jobloom

JobLoom finds jobs directly from company career sites before many job boards, then routes you into detailed role pages like this one.

infrastructure

Posted May 8

Staff Site Reliability Engineer - Site Experience

at Redditinc

Dublin, IrelandOn-site

Responsibilities

  • Lead Reliability Engineering for User Experience
  • Drive reliability, scalability, and operational excellence for critical user facing systems and services.
  • Improve performance and resiliency across APIs, content delivery, feed generation, search, messaging, and real-time experiences. Architect for Scale
  • Guide architectural decisions around failover, redundancy, graceful degradation, traffic management, and capacity planning.
  • Identify systemic risks and reliability bottlenecks across services, dependencies, deployments, and infrastructure.
  • Build proactive mitigation strategies and drive engineering improvements that reduce incidents and improve service health. Drive Automation
  • Eliminate repetitive operational work through automation and tooling. Build systems that improve deployment safety, incident response, remediation workflows, and reliability guardrails Incident Management
  • Lead complex incident response efforts across engineering teams. Drive blameless postmortems, identify root causes, and ensure sustainable long-term fixes are implemented.
  • Define and champion best practices around reliability engineering, SLIs/SLOs, capacity management, release engineering, and operational maturity across the company.

Requirements

  • With 100,000+ active communities and approximately 126 million daily active unique visitors, Reddit is one of the internet’s largest sources of information.
  • In this role, you will partner closely with product and infrastructure teams to improve availability, latency, scalability, and operational excellence across Reddit’s most business critical experiences.
  • experience in Site Reliability Engineering, Infrastructure Engineering, or related roles operating large scale distributed systems.
  • Strong collaboration and communication skills with the ability to influence technical direction across teams. Strong
  • experience supporting high traffic, user facing production environments.
  • Deep understanding of one or more: distributed systems, networking, Linux systems, cloud native architectures. •
  • Experience designing highly available systems with strong operational and reliability practices.
  • Strong programming skills in languages such as Go, Python, or similar.
  • Strong understanding of observability systems including metrics, logging, tracing, and alerting. •
  • Experience improving reliability through SLOs, automation, incident management, and performance optimization.
  • Demonstrated ability to troubleshoot complex issues across applications, infrastructure, networking, and services. Nice to Have •
  • Experience with Kubernetes, containers, cloud infrastructure, and modern deployment platforms.
  • Familiarity with technologies such as Prometheus, Grafana, OpenTelemetry, Envoy, Kafka, ClickHouse, Cassandra, Redis, or similar distributed infrastructure technologies. •
  • Experience with CDN optimization, edge reliability, traffic engineering, or global infrastructure.
  • Experience leading large scale incident response and operational transformation initiatives. Why Join Reddit?
  • In select roles and locations, the interviews will be recorded, transcribed and summarized by artificial intelligence (AI).

Experience

  • What We’re Looking For 8+ years of

Benefits

  • Gender-Affirming Care
  • Private Medical, Dental, and Vision Benefits
  • Personal Retirement Savings Account with matching contribution
  • Flexible Vacation & Paid Volunteer Time Off
  • Generous Paid Parental Leave

Contact

  • For more information, visit www.redditinc.com .

Additional details

  • It’s built on shared interests, passion, and trust, and is home to the most open and authentic conversations on the internet.
  • Experience SRE team sits at the intersection of infrastructure, product engineering, and user
  • experience - ensuring that every interaction across web, mobile, APIs, feeds, media delivery, and real time systems is fast, reliable, and resilient.
  • This is a highly technical leadership role for someone who thrives in large-scale distributed systems, enjoys solving complex reliability challenges, and can influence engineering culture across the organization. What you’ll do:
  • Partner with product and infrastructure engineering teams to design systems that remain highly available and performant under massive global load.
  • Provide technical leadership and mentorship to engineers across SRE and software engineering teams.
  • Help shape reliability culture and raise the operational excellence bar across the organization.
  • Experience operating systems at internet scale traffic volumes. •
  • Contributions to open source software or participation in technical communities. •
  • You’ll help shape the reliability and performance of one of the internet’s largest platforms, influencing experiences used by millions of people every day.

Find more real-time jobs on JobLoom.