infrastructure
Posted Apr 30Senior Site Reliability Engineer
at Fieldguide
San Francisco, United StatesRemote
Responsibilities
- - Build and improve observability systems (metrics, logs, tracing) to provide deep visibility into system behavior and performance.
- - Lead efforts to improve system reliability and performance, including capacity planning, load testing, and performance tuning.
- - Drive continuous improvement of system resilience, including disaster recovery and chaos testing.
- - Establish best practices for monitoring, alerting, and incident management to ensure rapid detection and resolution of issues.
Requirements
- experience in site reliability engineering, infrastructure, or a related software engineering discipline. - Strong
- experience operating and scaling distributed systems in cloud environments, with AWS preferred. - Hands-on
- Experience defining SLOs/SLIs and leveraging them to inform and drive engineering priorities. - Proficiency with Infrastructure as Code tooling, particularly Terraform or equivalent. - Deep understanding of system performance, reliability patterns, and distributed system failure modes. -
- Experience implementing distributed tracing systems, such as OpenTelemetry or similar frameworks. -
- Experience with capacity planning and performance benchmarking at scale. - Familiarity with database performance tuning and observability across high-traffic systems. - Exposure to regulated or compliance-heavy engineering environments (e.g., SOC 2, FedRAMP, or equivalent frameworks). -
- Experience applying chaos engineering practices to proactively test and strengthen system resilience.
Experience
- Who You Are - 5+ years of
Benefits
- benefits include - Competitive compensation packages with meaningful ownership - Flexible PTO - 401k - Wellness benefits, including a bundle of free therapy sessions - Technology & Work from Home reimbursement - Flexible work schedules