engineering
Posted 2 hours agoCustomer Reliability Engineer
at LazyApply
Hybrid
Responsibilities
- Own a Sev-1 incident where a large financial services customer sees asymmetric latency from a single POP. Trace it through BGP routing and origin configuration. Produce the fix upstream.
- Diagnose a recurring WebSocket disconnect that a media customer has been fighting for weeks.
- Isolate it to a specific interaction between WAF and their origin load balancer.
- Drive the fix with Product Engineering.
- Build, with Product Engineering, a distributed tracing capability that correlates Cloudflare edge signals with customer origin metrics so a single query tells the story of a failing request end-to-end.
- Ship a detector for a class of WAF false positives silently degrading several customers. Get it into production before the next renewal cycle.
- Prototype an AI agent that takes a new customer case, pulls relevant logs and config, and proposes a root cause with linked evidence.
- Deploy it internally.
- Own the most complex, high-severity customer issues end-to-end, from first signal through confirmed resolution.
- Lead deep-dive debugging across the full stack: edge, network, DNS, transport, APIs, application, customer-side configuration.
- Reproduce defects, validate fixes with Engineering, and confirm customer-side resolution.
- Produce postmortems other engineers rely on.
- Analyze support and telemetry signals across the customer base to find systemic risks before they become incidents.
- Define customer-facing reliability metrics (error rates, resolution times, repeat-contact rates) and drive measurable improvement.
- Write automation that reduces mean-time-to-detect and mean-time-to-resolve.
- Manage the technical escalation lifecycle with clear ownership and timely communication.
- Document diagnostic procedures and resolution patterns in runbooks, internal knowledge bases, and AI skills.
- Maintain deep, current expertise across Cloudflare's product portfolio: edge networking, DNS, CDN, WAF, DDoS mitigation, Zero Trust, Workers, and our developer platform.
- Anticipate customer impact from new releases and architecture changes.
- Track record of building internal tooling or diagnostic utilities that measurably improved team efficiency.
Requirements
- As a result, they see significant improvement in performance and a decrease in spam and other attacks.
- We value candidates who have the instinct to spot a "normalized" problem and the AI-native curiosity to create a solution using the latest tools.
- Our culture is built on iteration, leveraging AI to ship faster today to make it better tomorrow, while ensuring that every improvement, no matter how small, is shared across the team to lift everyone up.
- If you’re the type of person who values curiosity over bureaucracy, and that AI is a partner in solving tough problems to keep the Internet moving forward, you’ll fit right in.
- Cloudflare is building CRE as an AI-native function.
- Engineers who ship AI-assisted diagnostics are the ones defining this discipline.
- Serve as a go-to subject-matter expert in one or more domains. Requirements
- experience in site reliability engineering, escalation engineering, systems engineering, or a comparable deeply technical support / operations role, with at least 2 years in customer-facing environments.
- TCP/IP fundamentals: OSI model, IPv4/IPv6 addressing, subnetting, routing, switching.
- Core protocols: DNS, HTTP/S, TLS/SSL, SMTP, SNMP, NTP.
- Routing protocols: BGP, OSPF, including path selection and route propagation.
- Firewall concepts: stateful/stateless inspection, rule sets, NAT, ACLs.
- VPN and encryption: IPSec, SSL/TLS tunnels, GRE.
- Proficiency with observability and diagnostic tooling: packet capture and analysis (Wireshark, tcpdump), log aggregation (Kibana, Elasticsearch), metrics dashboards (Grafana), distributed tracing.
- Strong scripting and automation skills (Bash, Python) with a track record of shipping tooling that improves reliability and reduces toil. •
- Experience with incident management, postmortem culture, and SLO/SLI-based reliability practices.
- experience with direct customer-facing accountability.
- Deep expertise at both L3/L4 (network infrastructure) and L7 (application protocols, DNS, HTTP, WebSocket).
- Expert-level proficiency with Linux command-line tools: curl, dig, git, traceroute, mtr, strace, ss.
- Data-at-scale analysis using SQL, PromQL, or equivalent.
- Familiarity with CI/CD pipelines, infrastructure-as-code (Terraform, Pulumi), and container orchestration (Kubernetes, Docker).
- Experience applying AI/ML to production engineering or operational workflows.
- experience with Workers, Pages, R2, D1, or other developer platform services.
- experience across AWS, Azure, or GCP.
- Web programming (HTML, JavaScript) and regular expressions.
- Chaos engineering or formal reliability frameworks (e.g., Google SRE principles).
- Fundamental to our mission to help build a better Internet is protecting the free and open Internet.
- Please note that applicants who progress to the offer stage of the interview process may be asked to attend an in-person interview within one of the Cloudflare Offices or Cloudflare Hubs. More details about this will be available at that stage of the interview process.
Benefits
- Cloudflare built its reputation helping build a better Internet, defending millions of sites, giving away SSL and DDoS mitigation when the industry charged premium prices.
- Health systems depend on us to provide care.
- Comfort engaging directly with enterprise customer engineering teams, including on calls during incidents. Bonus Points
- Managing or configuring non-HTTP services: email, DNS authoritative/recursive, FTP, SSH. Equity
- This role is eligible to participate in Cloudflare's equity plan.
- benefits programs can help you pay health care expenses, support caregiving, build capital for the future, and make life a little easier and fun. The below is a description of our
- Health & Welfare Benefits
- Medical/Rx Insurance Dental Insurance Vision Insurance
- On-demand mental health support and Employee Assistance Program
- Global Travel Medical Insurance Financial Benefits
- Short and Long Term Disability Insurance
- Life & Accident Insurance
- 401(k) Retirement Savings Plan
- Employee Stock Participation Plan Time Off
- Flexible paid time off covering vacation and sick leave
- Leave programs, including parental, pregnancy health, medical, and bereavement leave
- All qualified applicants will be considered for employment without regard to their, or any other person's, perceived or actual race, color, religion, sex, gender, gender identity, gender expression, sexual orientation, national origin, ancestry, citizenship, age, physical or mental disability, medical condition, family care status, or any other basis protected by law.
Contact
- Examples of reasonable accommodations include, but are not limited to, changing the application process, providing documents in an alternate format, using a sign language interpreter, or using specialized equipment. If you require a reasonable accommodation to apply for a job, please contact us via e-mail at hr@cloudflare.com or via mail at 101 Townsend St. San Francisco, CA 94107.
Additional details
- At Cloudflare, we are on a mission to help build a better Internet.
- Today the company runs one of the world’s largest networks that powers millions of websites and other Internet properties for customers ranging from individual bloggers to SMBs to Fortune 500 companies.
- Cloudflare protects and accelerates any Internet application online without adding hardware, installing software, or changing a line of code.
- Internet properties powered by Cloudflare all have web traffic routed through its intelligent global network, which gets smarter with every request.
- Cloudflare was named to Entrepreneur Magazine’s Top Company Cultures list and ranked among the World’s Most Innovative Companies by Fast Company.
- At Cloudflare, we’re not looking for people who wait for a polished roadmap; we’re looking for the builders who see the cracks in the Internet that everyone else has simply learned to live with.
- In an acceleratingly dangerous world, the scope of that mission has changed.
- We are becoming something more: critical infrastructure.
- Banks run their payment rails on us. Governments run public services on us. Media companies depend on us during live events.
- Reliability for these customers is no longer a feature of our product. It is a mission.