engineering
Posted 1 weeks agoMember of Technical Staff, Site Reliablity Engineer
at Vapi
San Francisco, United StatesRemote
Responsibilities
- Define the first set of SLOs for the call-completion path. - 60 Day: Stand up error budgets and SLO-based alerting in Chronosphere/Prometheus for the highest-impact services.
- Run the first proper load test against provider rate limits and per-org concurrency.
- Tune autoscaling for wscaler / workerpool-cron-scaler. - 90 Day: Ship a real platform service — capacity forecaster, auto-remediation, or oncall tooling — in Go or TypeScript.
- Own the postmortem process.
- Drive a measurable improvement in p99 call completion or MTTR.
Requirements
- Voice AI that resolves, not transfers.
- We’ve had 15 stability-gap outages worth learning from, and we need someone who runs incident command, owns SLOs and error budgets, and builds the reliability culture from scratch. - This is not a bash-and-YAML role.
- You’ll ship code (Go or TypeScript) for services that monitor and manage the platform: auto-remediation, capacity forecasters, oncall tooling.
- Capacity planning, load testing, and KEDA-based autoscaling for Vapi’s wscaler and workerpool-cron-scaler are on your plate.
- WHO YOU ARE: Must-haves - You’ve run incident command and postmortem discipline at scale on a real oncall rotation. - You’ve operated SLOs and error budgets in Chronosphere, Prometheus, Grafana, or Datadog. - You’ve done capacity planning and load testing for production systems with real users. - You’re fluent in Kubernetes production ops: pod crash diagnosis, HPA/VPA tuning, PodDisruptionBudgets, graceful shutdown. - You know backpressure and autoscaling patterns — KEDA, custom metrics scaling.
- You can build platform services in Go or TypeScript (matches Vapi’s cluster-manager, database-health, wscaler, incidentManager). - Real-time / latency-sensitive product background where degraded means a dropped call, not a slow dashboard.
- Tech stack you’ll work in - Languages: Go and TypeScript (you ship code, not just scripts), Bash.
- - Orchestration: Kubernetes on EKS — production ops (HPA/VPA tuning, PodDisruptionBudgets, graceful shutdown, pod crash diagnosis).
- - Autoscaling and backpressure: KEDA, custom metrics scaling (matches Vapi’s wscaler and workerpool-cron-scaler).
- Where you likely come from - A real-time / latency-sensitive product (Discord, Zoom, Mux, Twitch, Twilio, LiveKit, Cloudflare, a trading firm, a gaming backend), or a FAANG SRE / Production Engineer (Google, Uber, Twitter/X, Meta) who misses being hands-on. - Weak fit: SRE from analytics or CRM backends where “degraded” means a slow dashboard, not a dropped call.