Member of Technical Staff, Site Reliablity Engineer

at Vapi

San Francisco, United StatesRemote

Kubernetes Prometheus Grafana TypeScript$10,000

Responsibilities

Define the first set of SLOs for the call-completion path. - 60 Day: Stand up error budgets and SLO-based alerting in Chronosphere/Prometheus for the highest-impact services.
Run the first proper load test against provider rate limits and per-org concurrency.
Tune autoscaling for wscaler / workerpool-cron-scaler. - 90 Day: Ship a real platform service — capacity forecaster, auto-remediation, or oncall tooling — in Go or TypeScript.
Own the postmortem process.
Drive a measurable improvement in p99 call completion or MTTR.

Requirements

Voice AI that resolves, not transfers.
We’ve had 15 stability-gap outages worth learning from, and we need someone who runs incident command, owns SLOs and error budgets, and builds the reliability culture from scratch. - This is not a bash-and-YAML role.
You’ll ship code (Go or TypeScript) for services that monitor and manage the platform: auto-remediation, capacity forecasters, oncall tooling.
Capacity planning, load testing, and KEDA-based autoscaling for Vapi’s wscaler and workerpool-cron-scaler are on your plate.
WHO YOU ARE: Must-haves - You’ve run incident command and postmortem discipline at scale on a real oncall rotation. - You’ve operated SLOs and error budgets in Chronosphere, Prometheus, Grafana, or Datadog. - You’ve done capacity planning and load testing for production systems with real users. - You’re fluent in Kubernetes production ops: pod crash diagnosis, HPA/VPA tuning, PodDisruptionBudgets, graceful shutdown. - You know backpressure and autoscaling patterns — KEDA, custom metrics scaling.
You can build platform services in Go or TypeScript (matches Vapi’s cluster-manager, database-health, wscaler, incidentManager). - Real-time / latency-sensitive product background where degraded means a dropped call, not a slow dashboard.
Tech stack you’ll work in - Languages: Go and TypeScript (you ship code, not just scripts), Bash.
- Orchestration: Kubernetes on EKS — production ops (HPA/VPA tuning, PodDisruptionBudgets, graceful shutdown, pod crash diagnosis).
- Autoscaling and backpressure: KEDA, custom metrics scaling (matches Vapi’s wscaler and workerpool-cron-scaler).
Where you likely come from - A real-time / latency-sensitive product (Discord, Zoom, Mux, Twitch, Twilio, LiveKit, Cloudflare, a trading firm, a gaming backend), or a FAANG SRE / Production Engineer (Google, Uber, Twitter/X, Meta) who misses being hands-on. - Weak fit: SRE from analytics or CRM backends where “degraded” means a slow dashboard, not a dropped call.

Member of Technical Staff, Site Reliablity Engineer

Responsibilities

Requirements

Browse by category

Browse by skills

Browse by role

Benefits

Additional details

Browse by location