engineering
Posted May 4Senior Scalability Engineer - Observability
at Judi Health
Denver, United StatesRemote
Responsibilities
- Beyond maintaining infrastructure, you'll architect and develop a custom observability platform that gives engineering teams powerful, fast, and cost-effective visibility into every layer of our infrastructure—from application logs and metrics to distributed traces. You'll build production-grade internal products using React/TypeScript frontends with Python and Rust backends, creating tools that fundamentally improve how engineers at Judi Health debug, monitor, and
- Architect observability platform: Design, implement, and maintain the LGTM stack (Loki, Grafana, Tempo, Mimir/Prometheus) as the primary observability platform across all engineering teams, making architectural decisions that balance cost, performance, and developer experience.
- Build internal observability products: Design and develop production-grade internal platform products with React/TypeScript frontends and Python/Rust backends that provide engineers with powerful log search, metrics visualization, and trace analysis capabilities.
- Develop custom log indexing systems: Architect and build high-performance log indexing solutions using Rust that process logs and provide sub-second search across billions of log lines at a fraction of the cost.
- Integrate SQL analytics for logs: Design and implement solutions leveraging AWS Athena or similar SQL query engines (DuckDB, ClickHouse) for ad-hoc log analysis and historical queries, enabling engineers to run complex SQL queries over S3-based log data for deep investigations and trend analysis.
- Create advanced query interfaces: Build sophisticated web interfaces that allow engineers to query logs, metrics, and traces with features like saved queries, query templates, correlation analysis, and pattern detection, supporting both full-text search and SQL-based analytics.
- Balance cloud-native and open-source: Architect solutions that thoughtfully leverage both AWS-managed services (CloudWatch, Athena, Kinesis) and open-source tooling (LGTM stack, Quickwit) to optimize for cost, performance, and operational flexibility based on use case requirements.
- Integrate AWS observability: Design seamless integration between AWS CloudWatch Logs/Metrics and our custom observability platform, providing unified visibility across managed and self-hosted infrastructure.
- Build intelligent alerting: Develop smart dashboards, monitors, and alerting systems that reduce noise, detect anomalies, and help teams respond to incidents quickly.
- Enable performance optimization: Provide the observability foundation that allows the Scalability team to identify performance bottlenecks, track optimization impact, and measure platform stability with data-driven insights.