infrastructure
Posted 3 weeks agoSenior Site Reliability Engineer
at Replit
FranceRemote
Responsibilities
- Create dashboards and metrics that provide real-time visibility into system health and performance.
- Implement logging strategies that enable quick problem identification and resolution. - Drive Automation and Infrastructure as Code: Architect and implement infrastructure automation solutions using tools like Terraform, Ansible, or Pulumi.
- Design and maintain CI/CD pipelines that enable reliable and consistent deployments.
- Create self-healing systems that can automatically respond to common failure scenarios. - Establish SLOs and SLIs: Work with product and engineering teams to define and implement Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
- Build systems to track and report on these metrics, ensuring we maintain high reliability standards while balancing innovation speed. - Incident Management and Response: Lead incident response efforts, conducting thorough post-mortems, and implementing improvements to prevent future occurrences.
- Develop and maintain runbooks for critical services.
- Build tools and processes that reduce Mean Time To Recovery (MTTR). - Performance Optimization: Identify and resolve performance bottlenecks across our infrastructure.
- Implement capacity planning strategies and optimize resource utilization.
Requirements
- experience in Site Reliability Engineering or similar roles (DevOps, Systems Engineering, Infrastructure Engineering) - Strong programming skills in languages commonly used for automation (Python, Go, or similar) - Deep understanding of distributed systems -
- Experience with container orchestration platforms (Kubernetes) and cloud-native technologies - Proven track record of implementing and maintaining monitoring/observability solutions - Strong incident management skills with
- experience leading incident response -
- Experience with Google Cloud Platform (GCP) services and tools - Knowledge of modern observability platforms (Prometheus, Grafana, Datadog, etc.) WHAT WE VALUE: - Problem-solving mindset: Ability to approach complex operational challenges systematically and devise effective solutions - Self-directed and autonomous: Capable of working independently while collaborating effectively with cross-functional teams - Strong communication skills: Ability to explain complex technical concepts to both technical and