infrastructure
Posted Apr 28Staff Site Reliability Engineer
at Fieldguide
San Francisco, United StatesRemote
You are nearing today's limit. Upgrade for unlimited access.
Responsibilities
- - Define and drive adoption of SLOs, SLIs, and error budgets across engineering teams.
- - Architect and continuously improve observability platforms (metrics, logging, tracing).
- - Own reliability strategy and roadmap, proactively identifying risks and driving long-term improvements.
- - Lead cross-team initiatives to improve system performance, scalability, and resilience.
- - Drive root cause analysis and systemic improvements through blameless postmortems.
- - Guide capacity planning, load testing, and performance optimization efforts.
- - Design and validate disaster recovery, failover strategies, and resilience testing.
- - Track record of technical leadership and cross-functional influence across engineering and product teams.
Requirements
- experience in software engineering, with a focus on distributed systems and production infrastructure. - Extensive
- experience operating and scaling distributed systems in cloud environments, with a strong preference for AWS. - Deep expertise in system reliability, scalability, and performance engineering at scale. - Demonstrated
- experience implementing SLO-driven engineering practices and reliability frameworks. - Strong background building and owning observability ecosystems (e.g., Datadog, Prometheus, Grafana). - Proficiency with Infrastructure as Code tooling, particularly Terraform or equivalent. - Proven
- experience leading incident management, post-mortems, and production operations.
- - Strong software engineering fundamentals with the ability to contribute to and review complex codebases.
- Experience designing or operating multi-region and globally distributed systems. - Deep expertise in distributed tracing and performance analysis across complex service architectures. - Hands-on
- experience with database scalability and performance tuning at scale. - Familiarity with compliance-driven engineering environments (e.g., SOC 2, FedRAMP, or similar frameworks). -
- Experience applying chaos engineering practices to validate and improve system resilience. -