infrastructure
Posted 3 weeks agoMember of Technical Staff, Infrastructure
at Vapi
San Francisco, United StatesOn-site
Responsibilities
- Ship a control-plane improvement in Go and drive a measurable reliability or capacity win. - 90 Day: Lead a roadmap pillar of the cell-based build-out: a new region, a stateful workload migration, or unblocking the SIP gateway SPOF.
Requirements
- Voice AI that resolves, not transfers.
- We’re building cell-based, multi-region infrastructure to drive 99.99% call completion, and this hire owns the foundation: multi-cluster Kubernetes on EKS, a stateful data plane (Postgres, Redis, Kafka, Temporal, ClickHouse), Envoy/Cilium networking, and multi-region Kafka on MSK across EU and ANZ. - You’ll write Go for control-plane services like cluster-manager, traffic-control-plane, and environment-manager, and you’ll set the bar for how Vapi runs stateful workloads at scale.
- WHAT YOU’LL DO: - 30 Day: Ramp on the cell-based architecture, the regional EKS clusters (backend / networking / persistence / monitoring / models / kafka), and the Pulumi stacks.
- WHO YOU ARE: - You’ve run multi-cluster Kubernetes on EKS in production — backend, networking, persistence, monitoring, models, and kafka clusters per region — and you’ve used Cluster API or similar for programmatic cluster creation. - You’ve operated a stateful data plane (Postgres, Redis, Kafka, Temporal, etcd, ClickHouse) at scale — you’ve sharded it, migrated data between instances, and lived with the consequences. - You’re fluent in Envoy and Cilium/eBPF.
- VPC/NAT/Cloudflare alone isn’t enough. - You’ve run multi-region Kafka on MSK in production — not just Kafka.
- You’ve dealt with regional topic naming, MSK Pulumi drift, and compliance constraints. - You write Go for control-plane services.
- experience — Shopify pods, AWS cell-based reference arch, Slack shards, or equivalent. Microservices experience alone isn’t the same.
- - You likely come from one of: a company that ran cell-based in prod (Shopify, AWS service teams, Slack); a distributed systems shop (Cockroach, MongoDB, Confluent, Temporal, Redpanda, ClickHouse Inc.); a voice/video/CPaaS company (Twilio, Plivo, Bandwidth, Vonage, LiveKit, Daily.co http://Daily.co, Dialpad); an Envoy/service-mesh org (Lyft, Stripe, Airbnb, Pinterest, Isovalent/Cilium); or a streaming-infra team (Confluent, Uber, LinkedIn, Datadog) that ran MSK/Kafka multi-region.