infrastructure
Posted 3 weeks agoSenior Cloud Infrastructure Engineer
at Langfuse
GermanyRemote
Responsibilities
- Build world-class observability: You'll own our Datadog setup end to end — dashboards, alerts, and SLOs.
- Automate everything: CI/CD pipelines, infrastructure-as-code, automated scaling, zero-downtime deployments.
Requirements
- ABOUT LANGFUSE Open Source LLM Engineering Platform that helps teams build useful AI applications via tracing, evaluation, and prompt management (mission https://tracking.us.nylas.com/l/6d586a21a6fc4e1a8aacc7eb75882b72/0/82383757e54352130f65066e1b2fc4708aacab7897561bcb8000fe4c8a9c6a21?cache_buster=1761124921, product https://tracking.us.nylas.com/l/6d586a21a6fc4e1a8aacc7eb75882b72/1/b9fba3a93b6ffcc0f99ecda62767a17cc437fe8fe0b16181d1c43c1391212e3d?cache_buster=1761124921).
- Largest open source solution in this category: trusted by 19 of the Fortune 50, >2k customers, >26M monthly SDK downloads, >6M Docker pulls.
- Together we can move faster on product while staying true to open source and self-hosting, and join forces on GTM and sales to accelerate revenue.
- You'll operate Langfuse Cloud on AWS ECS Fargate and ClickHouse Cloud, with Datadog as the observability backbone.
- You'll also own our public self-hosted infrastructure — including our Helm chart, Docker Compose setup, and everything in between — so that teams from startups to enterprises can run Langfuse on their own terms.
- YOU WILL GROW AT LANGFUSE BY Own Langfuse Cloud operations: You'll run our production environments on AWS ECS Fargate and ClickHouse Cloud.
- You'll own and evolve our Helm chart, Docker Compose configuration, and deployment documentation.
- Experience operating production workloads on AWS (ECS/Fargate, networking, IAM, S3, etc.) or on comparable hyperscale vendors. - Comfortable with container orchestration — Kubernetes and/or ECS, Helm charts, Docker -
- Experience with infrastructure-as-code (Terraform, Pulumi, CloudFormation, or similar) - Strong monitoring and observability instincts — you've built dashboards and alerts that actually caught problems (Datadog
- Experience with ClickHouse Cloud or other managed analytical databases - Background in operating high-throughput event processing or observability infrastructure - Contributions to open source infrastructure tooling (Helm charts, Terraform modules, etc.) - Former founder PROCESS We can run the full process to your offer letter in less than 7 days (hiring process https://langfuse.com/handbook/how-we-hire/hiring-process).
- TECH STACK We run a TypeScript monorepo: Next.js on the frontend, Express workers for background jobs, PostgreSQL for transactional data, ClickHouse for tracing at scale, S3 for file storage, and Redis for queues and caching.
- New joiners get all PRs reviewed to learn the codebase, patterns, and how the systems work (onboarding guide https://langfuse.com/handbook/product-engineering/how-we-work/onboarding). - We use AI as much as possible in our workflows to make our users happy.
- We encourage everyone to experiment with new tooling and AI workflows.
- WHY LANGFUSE (NOW PART OF CLICKHOUSE) - This role puts you at the forefront of the AI revolution, partnering with engineering teams who are building the technology that will define the next decade(s). - This is an open-source devtools company.
- You’ll own the full delivery end to end. - We're solving hard engineering problems: figuring out which features actually help users improve AI product performance, building SDKs developers love, visualizing data-rich traces, rendering massive LLM prompts and completions efficiently in the UI, and processing terabytes of data per day through our ingestion pipeline. - You'll work closely with the ClickHouse team and learn how they build a world-class infrastructure company.
- The AI space develops at breakneck speed and our customers are at the forefront.
Benefits
- You have strong opinions about reliability, automation, and how to ship infrastructure changes safely - Interest in open source software and genuine enjoyment helping users debug their self-hosted deployments - Thrives in a small, accountable team where your output is visible and matters - CS or quantitative degree preferred Bonus points: -
Contact
- We are also hiring for engineering in EU timezones and expect one week per month in our Berlin office (how we work https://langfuse.com/handbook/how-we-work/principles).
- You should be familiar with a good chunk of this, but we trust you'll pick up the rest quickly (Stack https://langfuse.com/handbook/product-engineering/tech-stack, Architecture https://langfuse.com/handbook/product-engineering/architecture).
- HOW WE SHIP Link to handbook https://langfuse.com/handbook/how-we-work/principles - We trust you to take ownership (ownership overview https://langfuse.com/handbook/how-we-work/ownership) for your area.
- You'll appear on changelog posts https://langfuse.com/changelog for the features you build, and during launch weeks, you'll produce videos https://langfuse.com/blog/2025-10-29-launch-week-4 to announce what you've shipped to the community.
- We're in a period of strong growth: Langfuse is growing organically and accelerating through ClickHouse's GTM. (Why we joined ClickHouse https://langfuse.com/blog/joining-clickhouse) - If you wonder what to build next, our users are a Slack message or a Github discussions post away. - You’re on a continuous learning journey.
Additional details
- We're building the "Datadog" of this category; model capabilities continue to improve, but building useful applications is really hard, both in startups and enterprises.
- Previously backed by Y Combinator, Lightspeed, and General Catalyst.
- We're a small, engineering-heavy, and experienced team in Berlin and San Francisco.
- WHY CLOUD INFRASTRUCTURE AT LANGFUSE Your work will keep Langfuse running — everywhere.
- Langfuse processes over a billion trace events per month.
- When a Fortune 50 company relies on Langfuse in production, they're relying on the infrastructure you operate.
- You'll own uptime, performance, and cost efficiency across our entire cloud footprint — and you'll make sure every self-hosted deployment runs just as smoothly.
- We're scaling fast, and you'll be the person who makes sure the infrastructure grows ahead of demand — not behind it.
- Langfuse is now part of ClickHouse, which means the team behind the database at the core of our stack is one channel away.
- Few infrastructure roles give you that kind of direct access to the people who build your most critical dependency.