Design, build, and operate reconciliation systems, including the SSS backend, to track desired stack state, detect and repair drift across stack templates, grafana.com state, Hosted Grafana, and actual customer stack configuration
Collaborate across SSS, grafana.com, and deployment configurations to ensure stack lifecycle workflows remain reliable, observable, and resilient
Improve operational efficiency by reducing deployment complexity (e.g., aiming for single PR regional SSS deployment) and contributing to the Stack Config Reconciliation project
Manage rollout mechanisms for provisioned plugins, dashboards, data sources, Grafana versions, release channels, and stack-level configuration
Support new region and cluster rollouts, including the operational paths required to bring stacks online safely in new Grafana Cloud regions
Improve incident response and recovery paths for stack misalignment, reconciliation failures, plugin rollout issues, and Hosted Grafana integration failures
Requirements
There are more than 20M users of Grafana, the open source visualization tool, around the globe, monitoring everything from beehives to climate change in the Alps.
Grafana Labs also helps more than 3,000 companies -- including Bloomberg, JPMorgan Chase, and eBay -- manage their observability strategies with the Grafana LGTM Stack, which can be run fully managed with Grafana Cloud or self-managed with the Grafana Enterprise Stack , both featuring scalable metrics ( Grafana Mimir ), logs ( Grafana Loki ), and traces ( Grafana Tempo ).
Our work includes maintaining the billing engine responsible for customer usage calculation, automating provisioning after a customer signs a contract, integrating with cloud marketplaces such as AWS, Azure, and GCP, and building and maintaining the user portal our customers rely on to manage their accounts.
Engineers at Grafana also have the opportunity to contribute to Open Source communities and collaborate across teams beyond their immediate scope.
A stack is the customer-facing Grafana Cloud environment that connects an organization to Grafana and the backend services it uses, including Mimir, Loki, Tempo, plugins, dashboards, data sources, and stack-level configuration.
You can use modern AI coding assistants as part of your daily workflow (your choice of tools, within security guidelines), backed by a company-funded usage budget so you can iterate quickly without unnecessary friction. We encourage pragmatic AI-assisted development: faster prototyping, test generation, refactors, documentation, and incident follow-ups—always paired with strong code review and quality standards. You’ll also have access to frontier models (e.g., GPT-Codex 5/3, Claude Opus 4.6,
At Grafana, we actively embrace AI-assisted and agentic development practices, integrating these technologies into both our engineering workflows and the systems we deliver.
We encourage our engineers to thoughtfully leverage AI tools to enhance every stage of the lifecycle, from design and implementation to testing, documentation, and operations.
Our team is small and operates with a high degree of independence; you will be expected to lead major projects, coordinate across service boundaries, and help define the technical direction for our domain.
You have worked on a big SaaS platform and dealt with common distributed systems problems (e.g. scalability, multi-tenancy, data isolation, HA, …) Have professional
experience with Golang and be willing to work across both backend service and application code
experience with delivering projects from gathering requirements, and brainstorming ideas to shipping a product to the customer’s hands in a self-driven way
experience with mentoring junior engineers in a collaborative but asynchronous environment
You make your plans transparent, bring stakeholders on board, and are open to feedback and suggestions Strong Kubernetes
experience in AWS, GCP, or Azure, and familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.) •
Experience with TypeScript/Node.js •
Experience with Kubernetes control-plane patterns, operators, reconcilers, or desired-state systems •
Experience with Jsonnet/Tanka, Terraform, Flux, Argo, or similar deployment/configuration tooling •
Experience working on SaaS provisioning, tenancy, regional expansion, plugin rollout, or customer lifecycle systems •
Experience with incident response involving configuration drift, partial failure, or cross-service state mismatch
Grafana Labs may utilize AI tools in its recruitment process to assist in matching information provided in CVs to job postings.
Experience
You have at least 1 year of fully remote work experience
Benefits
Experience participating in blameless incident response and writing high-quality post-incident reviews Bonus Points For: •
Compensation & Rewards:
In Canada, the Base compensation range for this role is CAD 186,368 - CAD 223,642. Actual compensation may vary based on level, experience, and skillset as assessed in the interview process.
Benefits include equity, bonus (if applicable) and other benefits listed here .
*Compensation ranges are country specific. If you are applying for this role from a different location than listed above, your recruiter will discuss your specific market’s defined pay range &
Balance is Key - We operate a global annual leave policy of 30 days per annum. 3 days of your annual leave entitlement are reserved for Grafana Shutdown Days to allow the team to really disconnect. *We will comply with local legislation where applicable.
Equal Opportunity Employer: We will recruit, train, compensate and promote regardless of race, religion, color, national origin, gender, disability, age, veteran status, and all the other fascinating characteristics that make us different and unique.
Contact
We build on the grafana.com platform to create custom solutions and integrations across the many systems that support a modern software company.
We build the control-plane services and workflows that keep stack state aligned across grafana.com, Stack State Service (SSS), Hosted Grafana, cloud regions, and the underlying Grafana Cloud infrastructure.
Additional details
Grafana Labs is a remote-first, open-source powerhouse.
The instantly recognizable dashboards have been spotted everywhere from a NASA launch and Minecraft HQ to Wimbledon and the Tour de France.
We’re scaling fast and staying true to what makes us different: an open-source legacy, a global collaborative culture, and a passion for meaningful work.
Our team thrives in an innovation-driven environment where transparency, autonomy, and trust fuel everything we do.
You may not meet every requirement, and that’s okay. If this role excites you, we’d love you to raise your hand for what could be a truly career-defining opportunity.
This is a remote opportunity and we would be interested in applicants located in Canadian time zones (EST + CST only at this time).
Application Core Services (AppCore) partners closely with our Cloud, Enterprise, and Grafana teams to deliver reliable internal and customer-facing systems that power critical parts of the Grafana business.
The team owns important domain areas that help keep both our customer workflows and internal business processes running smoothly.
AppCore is made up of multiple squads, each focused on one or more of these domains.
This is a team working at the intersection of product, platform, and business operations.