Develop and manage reliability metrics (SLOs) for AI-driven API services and agentic AI platform features •
Implement comprehensive observability and monitoring systems for real-time performance and fault detection •
Design and drive automated failover, recovery, and incident response strategies for high-availability AI infrastructure •
Optimize resource utilization, particularly GPU/accelerator efficiency, ensuring cost-effective AI system operation •
Collaborate closely with engineering, platform, and product teams to align reliability efforts with broader organizational goals •
Lead efforts to build internal tooling and automation focused on AI system stability and operational excellence •
Drive continuous improvement in deployment practices, monitoring approaches, and incident management processes About You •
Requirements
Postman is seeking an experienced AI Systems Reliability Engineer to help define, build, and maintain the infrastructure and processes that ensure the reliability, scalability, and performance of Postman’s AI-powered API and agentic systems in production.
Have a strong background in AI reliability engineering, SRE, or DevOps for distributed systems •
Understand the unique challenges of maintaining large-scale AI systems and integrating AI-specific metrics into reliability frameworks •
Are comfortable collaborating across teams to influence best practices for AI system reliability and operational health •
Thrive in dynamic, fast-paced environments focusing on delivering reliable, safe AI-powered services
experience with AI/ML infrastructure, including GPU/xPU optimization and scaling •
Familiarity with API platform operations and large-scale distributed services • Prior
experience building or operating observability tools tailored for AI and agentic systems •
Benefits
P.S: We highly recommend reading The "API-First World" graphic novel to understand the bigger picture and our vision at Postman. The Opportunity
The reasonably estimated base salary for this role ranges from $256,000 to $276,000, plus a competitive equity package.
Actual compensation is based on the candidate's skills, qualifications, and experience. What Else?
In addition to Postman's pay-on-performance philosophy, and a flexible schedule working with a fun, collaborative team, Postman offers a comprehensive set of benefits, including full medical coverage, flexible PTO, wellness reimbursement, and a monthly lunch stipend.
Our frequent and fascinating team-building events will keep you connected, while our donation-matching program can support the causes you care about.
In our work, we focus on specific goals that add up to a larger vision.
Contact
Learn more at postman.com or connect with Postman on X via @getpostman.
Additional details
Postman is helping developers and professionals across the globe build the API-first world by simplifying each step of the API lifecycle and streamlining collaboration—enabling users to create better APIs, faster.
The company is headquartered in San Francisco and has offices in Boston, New York, Austin, Tokyo, London, and Bangalore - where Postman was founded.
Are experienced with cloud platforms, monitoring tools, and incident response automation •
Contribution to open-source projects or reliability engineering thought leadership
Along with that, our wellness programs will help you stay in the best of your physical and mental health.
We’re building a long-term company with an inclusive culture where everyone can be the best version of themselves.
We are in office 5 days a week for all roles based out of our hubs in San Francisco Bay Area, Boston, Austin, New York City, Tokyo and London.
For roles based in Bangalore, employees currently work in the office three days a week and will transition to five days per week by the end of the year.
We were thoughtful in our approach which is based on collaboration and grounded in feedback from our workforce, leadership team, and peers. The
benefits of our in office model will be shared knowledge, brainstorming sessions, communication, and building trust in-person that cannot be replicated via zoom. Our Values