infrastructure
Posted 4 weeks agoPrincipal Site Reliability Engineer
at UiPath
Hybrid
Requirements
- LIFE AT UIPATH The people at UiPath believe in the transformative power of automation to change how the world works.
- Could that be you? YOUR MISSION UiPath is seeking a Principal Site Reliability Engineer to redefine how reliability is engineered using AI.
- You will operate at the intersection of SRE, distributed systems, and applied AI, designing systems that transform raw telemetry into actionable insights, enable predictive reliability, and introduce self-healing capabilities into production environments.
- Reliability platform tooling - Build internal systems that enable engineering teams to debug faster using AI-assisted tooling and proactively identify and mitigate reliability risks.
- AI-assisted Incident response & RCA - Build AI-powered systems that determine impact and use historical data to improve detection and response over time.
- Technical Leadership & Org Impact - Influence standards for building AI-driven tooling, mentor junior and senior engineers, and elevate reliability focus across the organization.
- experience in SRE, Platform, Cloud infrastructure engineering roles with a track record of building internal tooling to improve reliability.
- Strong conceptual understanding of distributed systems, performance bottlenecks, failure modes, and trade-offs inherent to large-scale systems. AI/ML Application to systems & operations
- Experience building applications or internal tools using LLMs to automate non-trivial workflows (e.g., AIOps, Automated code reviews, Automated flagging of reliability risks) Hands-on
- experience with building Agents/Copilots using modern ML frameworks (PyTorch, vLLM or equivalent) in production setting. Scripting & Tooling
- Proficiency in at least one programming language (e.g., Python, Go, or similar).
- Experience with Infrastructure as Code (e.g., Terraform, Pulumi) and container orchestration (e.g., Kubernetes). Cloud & Infrastructure Expertise Hands-on
- experience working with one or more major cloud providers (Azure, AWS, GCP), with practical knowledge of networking, deployments, and scaling. Observability & Operational Practices Proven
- experience with monitoring/observability stacks (metrics, logs, traces) and building meaningful dashboards and alerts that improve reliability signals. Incident Response & Post-Incident Learning
- Experience participating in and improving incident response, blameless postmortems, and implementing systemic fixes rather than symptomatic patches. Collaboration & Influence