infrastructure
Posted Apr 22Lead Site Reliability Engineer
United StatesRemote
Responsibilities
- Architect observability frameworks to translate telemetry data into actionable roadmaps that reduce toil and enhance resilience.
Requirements
- You are not just an operator; you are an experienced software engineer who excels at architecture, code optimization, and deep troubleshooting.
- Core : Applications written in Ruby on Rails and Node.js, PostgreSql, MongoDB,, Redis, Memcached, Sidekiq, ActiveJob, Elasticsearch, Websockets
- Infrastructure : 100% Linux-based cloud infrastructure (AWS, Google Cloud, MongoDB Atlas) and services (ECS/EC2/Kubernetes, Elasticache, MemoryStore, RDS, CloudSQL, BigQuery etc.)
- Infrastructure as Code (IaC) : GitHub, Terragrunt, Terraform, Ansible
- Observability & Alerting : New Relic, AWS CloudWatch, Google Cloud Stackdriver, Squadcast
- Agile/Scrum practices utilizing JIRA Responsibilities
- Expertise in Cloud Computing (AWS/GCP) and Infrastructure as Code (Terraform/Ansible).
- Strong proficiency with SQL databases (PostgreSQL) and the ability to quickly navigate and optimize complex, unfamiliar codebases.
- experience designing monitoring solutions (Datadog, New Relic, Prometheus) based on the "Golden Signals".
- SLO Governance: Demonstrated ability to define SLIs/SLOs from scratch, negotiate Error Budgets, and use data to balance feature velocity with reliability. Security Focus:
- Experience securing cloud environments and container platforms (Kubernetes), including hands-on management of WAF rules and edge security.
- Experience leading post-incident reviews (RCAs) and implementing action items that directly improve MTTR (Mean Time to Recovery) and MTTD (Mean Time to Detection). Leadership Proven
- experience leading technical teams, mentoring engineers, and working in a team-oriented, collaborative environment with strong communication skills.
- Experience in developing solutions using server automation tools such as Terraform, Ansible. CI/CD Expertise:
- Experience in writing and maintaining CI/CD pipelines and services. Kubernetes:
- Experience in building, deploying, and optimizing Kubernetes-based infrastructure Perimeter Defense:
- Experience configuring and managing Web Application Firewalls (WAF) (e.g., Cloudflare, AWS WAF, Akamai) and DDOS protection mechanisms. Education