infrastructure
Posted May 8Staff Site Reliability Engineer - Site Experience
at Redditinc
Dublin, IrelandOn-site
Responsibilities
- Lead Reliability Engineering for User Experience
- Drive reliability, scalability, and operational excellence for critical user facing systems and services.
- Improve performance and resiliency across APIs, content delivery, feed generation, search, messaging, and real-time experiences. Architect for Scale
- Guide architectural decisions around failover, redundancy, graceful degradation, traffic management, and capacity planning.
- Identify systemic risks and reliability bottlenecks across services, dependencies, deployments, and infrastructure.
- Build proactive mitigation strategies and drive engineering improvements that reduce incidents and improve service health. Drive Automation
- Eliminate repetitive operational work through automation and tooling. Build systems that improve deployment safety, incident response, remediation workflows, and reliability guardrails Incident Management
- Lead complex incident response efforts across engineering teams. Drive blameless postmortems, identify root causes, and ensure sustainable long-term fixes are implemented.
- Define and champion best practices around reliability engineering, SLIs/SLOs, capacity management, release engineering, and operational maturity across the company.
Requirements
- With 100,000+ active communities and approximately 126 million daily active unique visitors, Reddit is one of the internet’s largest sources of information.
- In this role, you will partner closely with product and infrastructure teams to improve availability, latency, scalability, and operational excellence across Reddit’s most business critical experiences.
- experience in Site Reliability Engineering, Infrastructure Engineering, or related roles operating large scale distributed systems.
- Strong collaboration and communication skills with the ability to influence technical direction across teams. Strong
- experience supporting high traffic, user facing production environments.
- Deep understanding of one or more: distributed systems, networking, Linux systems, cloud native architectures. •
- Experience designing highly available systems with strong operational and reliability practices.
- Strong programming skills in languages such as Go, Python, or similar.
- Strong understanding of observability systems including metrics, logging, tracing, and alerting. •
- Experience improving reliability through SLOs, automation, incident management, and performance optimization.
- Demonstrated ability to troubleshoot complex issues across applications, infrastructure, networking, and services. Nice to Have •
- Experience with Kubernetes, containers, cloud infrastructure, and modern deployment platforms.
- Familiarity with technologies such as Prometheus, Grafana, OpenTelemetry, Envoy, Kafka, ClickHouse, Cassandra, Redis, or similar distributed infrastructure technologies. •
- Experience with CDN optimization, edge reliability, traffic engineering, or global infrastructure.
- Experience leading large scale incident response and operational transformation initiatives. Why Join Reddit?
- In select roles and locations, the interviews will be recorded, transcribed and summarized by artificial intelligence (AI).
Experience
- What We’re Looking For 8+ years of
Benefits
- Gender-Affirming Care
- Private Medical, Dental, and Vision Benefits
- Personal Retirement Savings Account with matching contribution
- Flexible Vacation & Paid Volunteer Time Off
- Generous Paid Parental Leave
Contact
- For more information, visit www.redditinc.com .
Additional details
- It’s built on shared interests, passion, and trust, and is home to the most open and authentic conversations on the internet.
- Experience SRE team sits at the intersection of infrastructure, product engineering, and user
- experience - ensuring that every interaction across web, mobile, APIs, feeds, media delivery, and real time systems is fast, reliable, and resilient.
- This is a highly technical leadership role for someone who thrives in large-scale distributed systems, enjoys solving complex reliability challenges, and can influence engineering culture across the organization. What you’ll do:
- Partner with product and infrastructure engineering teams to design systems that remain highly available and performant under massive global load.
- Provide technical leadership and mentorship to engineers across SRE and software engineering teams.
- Help shape reliability culture and raise the operational excellence bar across the organization.
- Experience operating systems at internet scale traffic volumes. •
- Contributions to open source software or participation in technical communities. •
- You’ll help shape the reliability and performance of one of the internet’s largest platforms, influencing experiences used by millions of people every day.