engineering
Posted 6 days agoSenior Platform & Reliability Engineer
at OpenArt
San Francisco, United StatesHybrid
Responsibilities
- Build and maintain end-to-end observability (logs, metrics, traces, dashboards) so engineers can quickly understand “what broke” and “why.” - Strengthen deploy safety through CI/CD improvements, automated rollbacks, canary releases, and feature flag patterns.
- adopt containerized or more managed approaches as we scale.
Requirements
- 🧑🏼 💻 Senior Platform & Reliability Engineer 🎨 About OpenArt OpenArt is an AI Storytelling and Visual Creation Platform used by millions worldwide.
- We’re building the next generation of creative tools powered by cutting-edge AI, enabling anyone to create videos, visuals, characters, and stories with unprecedented speed and imagination.
- We believe the future of creativity is AI-native, and we're shaping that future.
- - AI-native product, you’ll design how cutting-edge AI models are exposed asreal user experiences.
- You’ll work across cloud infrastructure, distributed systems, backend services, and developer tooling, making pragmatic decisions that balance product velocity, system reliability, and cost efficiency—in a fast-moving, AI-native environment.
- You’ll partner closely with product engineers to evolve the platform that powers OpenArt, contributing to key decisions around infrastructure architecture, improving multi-provider AI reliability, and helping us scale systems to millions of users—while raising the overall engineering bar.
- - Improve system resilience at external boundaries (AI providers, storage, etc.),including timeouts, retries, circuit breakers, and fallback strategies.
- Experience with cloud-native systems (AWS or GCP), including serverless/event-driven architectures and at least one container-based approach (e.g., ECS/Fargate, Cloud Run, Kubernetes). - Solid understanding of observability and reliability practices: metrics, alerting, tracing, and incident response. -
- Experience designing resilient systems with external dependencies (timeouts, retries/backoff, idempotency, circuit breakers). - Ability to communicate technical tradeoffs clearly to engineers across different domains. - Comfortable operating in ambiguous, fast-moving environments and taking ownership of problems. Nice to Have -