Customer Reliability Engineer

at LazyApply

Hybrid

AWS GCP Azure Terraform Pulumi Python Elasticsearch Kubernetes Docker Grafana

Responsibilities

Own a Sev-1 incident where a large financial services customer sees asymmetric latency from a single POP. Trace it through BGP routing and origin configuration. Produce the fix upstream.
Diagnose a recurring WebSocket disconnect that a media customer has been fighting for weeks.
Isolate it to a specific interaction between WAF and their origin load balancer.
Drive the fix with Product Engineering.
Build, with Product Engineering, a distributed tracing capability that correlates Cloudflare edge signals with customer origin metrics so a single query tells the story of a failing request end-to-end.
Ship a detector for a class of WAF false positives silently degrading several customers. Get it into production before the next renewal cycle.
Prototype an AI agent that takes a new customer case, pulls relevant logs and config, and proposes a root cause with linked evidence.
Deploy it internally.
Own the most complex, high-severity customer issues end-to-end, from first signal through confirmed resolution.
Lead deep-dive debugging across the full stack: edge, network, DNS, transport, APIs, application, customer-side configuration.
Reproduce defects, validate fixes with Engineering, and confirm customer-side resolution.
Produce postmortems other engineers rely on.
Analyze support and telemetry signals across the customer base to find systemic risks before they become incidents.
Define customer-facing reliability metrics (error rates, resolution times, repeat-contact rates) and drive measurable improvement.
Write automation that reduces mean-time-to-detect and mean-time-to-resolve.
Manage the technical escalation lifecycle with clear ownership and timely communication.
Document diagnostic procedures and resolution patterns in runbooks, internal knowledge bases, and AI skills.
Maintain deep, current expertise across Cloudflare's product portfolio: edge networking, DNS, CDN, WAF, DDoS mitigation, Zero Trust, Workers, and our developer platform.
Anticipate customer impact from new releases and architecture changes.
Track record of building internal tooling or diagnostic utilities that measurably improved team efficiency.

Requirements

As a result, they see significant improvement in performance and a decrease in spam and other attacks.
We value candidates who have the instinct to spot a "normalized" problem and the AI-native curiosity to create a solution using the latest tools.
Our culture is built on iteration, leveraging AI to ship faster today to make it better tomorrow, while ensuring that every improvement, no matter how small, is shared across the team to lift everyone up.
If you’re the type of person who values curiosity over bureaucracy, and that AI is a partner in solving tough problems to keep the Internet moving forward, you’ll fit right in.
Cloudflare is building CRE as an AI-native function.
Engineers who ship AI-assisted diagnostics are the ones defining this discipline.
Serve as a go-to subject-matter expert in one or more domains. Requirements
experience in site reliability engineering, escalation engineering, systems engineering, or a comparable deeply technical support / operations role, with at least 2 years in customer-facing environments.
TCP/IP fundamentals: OSI model, IPv4/IPv6 addressing, subnetting, routing, switching.
Core protocols: DNS, HTTP/S, TLS/SSL, SMTP, SNMP, NTP.
Routing protocols: BGP, OSPF, including path selection and route propagation.
Firewall concepts: stateful/stateless inspection, rule sets, NAT, ACLs.
VPN and encryption: IPSec, SSL/TLS tunnels, GRE.
Proficiency with observability and diagnostic tooling: packet capture and analysis (Wireshark, tcpdump), log aggregation (Kibana, Elasticsearch), metrics dashboards (Grafana), distributed tracing.
Strong scripting and automation skills (Bash, Python) with a track record of shipping tooling that improves reliability and reduces toil. •
Experience with incident management, postmortem culture, and SLO/SLI-based reliability practices.
experience with direct customer-facing accountability.
Deep expertise at both L3/L4 (network infrastructure) and L7 (application protocols, DNS, HTTP, WebSocket).
Expert-level proficiency with Linux command-line tools: curl, dig, git, traceroute, mtr, strace, ss.
Data-at-scale analysis using SQL, PromQL, or equivalent.
Familiarity with CI/CD pipelines, infrastructure-as-code (Terraform, Pulumi), and container orchestration (Kubernetes, Docker).
Experience applying AI/ML to production engineering or operational workflows.
experience with Workers, Pages, R2, D1, or other developer platform services.
experience across AWS, Azure, or GCP.
Web programming (HTML, JavaScript) and regular expressions.
Chaos engineering or formal reliability frameworks (e.g., Google SRE principles).
Fundamental to our mission to help build a better Internet is protecting the free and open Internet.
Please note that applicants who progress to the offer stage of the interview process may be asked to attend an in-person interview within one of the Cloudflare Offices or Cloudflare Hubs. More details about this will be available at that stage of the interview process.

Benefits

Cloudflare built its reputation helping build a better Internet, defending millions of sites, giving away SSL and DDoS mitigation when the industry charged premium prices.
Health systems depend on us to provide care.
Comfort engaging directly with enterprise customer engineering teams, including on calls during incidents. Bonus Points
Managing or configuring non-HTTP services: email, DNS authoritative/recursive, FTP, SSH. Equity
This role is eligible to participate in Cloudflare's equity plan.
benefits programs can help you pay health care expenses, support caregiving, build capital for the future, and make life a little easier and fun. The below is a description of our
Health & Welfare Benefits
Medical/Rx Insurance Dental Insurance Vision Insurance
On-demand mental health support and Employee Assistance Program
Global Travel Medical Insurance Financial Benefits
Short and Long Term Disability Insurance
Life & Accident Insurance
401(k) Retirement Savings Plan
Employee Stock Participation Plan Time Off
Flexible paid time off covering vacation and sick leave
Leave programs, including parental, pregnancy health, medical, and bereavement leave
All qualified applicants will be considered for employment without regard to their, or any other person's, perceived or actual race, color, religion, sex, gender, gender identity, gender expression, sexual orientation, national origin, ancestry, citizenship, age, physical or mental disability, medical condition, family care status, or any other basis protected by law.

Contact

Examples of reasonable accommodations include, but are not limited to, changing the application process, providing documents in an alternate format, using a sign language interpreter, or using specialized equipment. If you require a reasonable accommodation to apply for a job, please contact us via e-mail at hr@cloudflare.com or via mail at 101 Townsend St. San Francisco, CA 94107.

Additional details

At Cloudflare, we are on a mission to help build a better Internet.
Today the company runs one of the world’s largest networks that powers millions of websites and other Internet properties for customers ranging from individual bloggers to SMBs to Fortune 500 companies.
Cloudflare protects and accelerates any Internet application online without adding hardware, installing software, or changing a line of code.
Internet properties powered by Cloudflare all have web traffic routed through its intelligent global network, which gets smarter with every request.
Cloudflare was named to Entrepreneur Magazine’s Top Company Cultures list and ranked among the World’s Most Innovative Companies by Fast Company.
At Cloudflare, we’re not looking for people who wait for a polished roadmap; we’re looking for the builders who see the cracks in the Internet that everyone else has simply learned to live with.
In an acceleratingly dangerous world, the scope of that mission has changed.
We are becoming something more: critical infrastructure.
Banks run their payment rails on us. Governments run public services on us. Media companies depend on us during live events.
Reliability for these customers is no longer a feature of our product. It is a mission.

Customer Reliability Engineer

Responsibilities

Requirements

Benefits

Contact

Additional details

Browse by category

Browse by skills

Browse by role

Browse by location