Senior Site Reliability Engineer

San Francisco, United StatesRemote

Responsibilities

- Build and improve observability systems (metrics, logs, tracing) to provide deep visibility into system behavior and performance.
- Lead efforts to improve system reliability and performance, including capacity planning, load testing, and performance tuning.
- Drive continuous improvement of system resilience, including disaster recovery and chaos testing.
- Establish best practices for monitoring, alerting, and incident management to ensure rapid detection and resolution of issues.

experience in site reliability engineering, infrastructure, or a related software engineering discipline. - Strong
experience operating and scaling distributed systems in cloud environments, with AWS preferred. - Hands-on
Experience defining SLOs/SLIs and leveraging them to inform and drive engineering priorities. - Proficiency with Infrastructure as Code tooling, particularly Terraform or equivalent. - Deep understanding of system performance, reliability patterns, and distributed system failure modes. -
Experience implementing distributed tracing systems, such as OpenTelemetry or similar frameworks. -
Experience with capacity planning and performance benchmarking at scale. - Familiarity with database performance tuning and observability across high-traffic systems. - Exposure to regulated or compliance-heavy engineering environments (e.g., SOC 2, FedRAMP, or equivalent frameworks). -
Experience applying chaos engineering practices to proactively test and strengthen system resilience.

benefits include - Competitive compensation packages with meaningful ownership - Flexible PTO - 401k - Wellness benefits, including a bundle of free therapy sessions - Technology & Work from Home reimbursement - Flexible work schedules