Staff Site Reliability Engineer

San Francisco, United StatesRemote

You are nearing today's limit. Upgrade for unlimited access.

Responsibilities

- Define and drive adoption of SLOs, SLIs, and error budgets across engineering teams.
- Architect and continuously improve observability platforms (metrics, logging, tracing).
- Own reliability strategy and roadmap, proactively identifying risks and driving long-term improvements.
- Lead cross-team initiatives to improve system performance, scalability, and resilience.
- Drive root cause analysis and systemic improvements through blameless postmortems.
- Guide capacity planning, load testing, and performance optimization efforts.
- Design and validate disaster recovery, failover strategies, and resilience testing.
- Track record of technical leadership and cross-functional influence across engineering and product teams.

experience in software engineering, with a focus on distributed systems and production infrastructure. - Extensive
experience operating and scaling distributed systems in cloud environments, with a strong preference for AWS. - Deep expertise in system reliability, scalability, and performance engineering at scale. - Demonstrated
experience implementing SLO-driven engineering practices and reliability frameworks. - Strong background building and owning observability ecosystems (e.g., Datadog, Prometheus, Grafana). - Proficiency with Infrastructure as Code tooling, particularly Terraform or equivalent. - Proven
experience leading incident management, post-mortems, and production operations.
- Strong software engineering fundamentals with the ability to contribute to and review complex codebases.
Experience designing or operating multi-region and globally distributed systems. - Deep expertise in distributed tracing and performance analysis across complex service architectures. - Hands-on
experience with database scalability and performance tuning at scale. - Familiarity with compliance-driven engineering environments (e.g., SOC 2, FedRAMP, or similar frameworks). -
Experience applying chaos engineering practices to validate and improve system resilience. -