Staff Machine Learning Engineer, AI Serving

at Redditinc

United StatesRemote

Python AWS Kubernetes Terraform gcp pytorch

Responsibilities

Lead the end-to-end design, implementation, and maintenance of a highly available, low-latency GPU-based model serving system for search, ranking, and LLMs supporting Millions of QPS.
Design and develop ML and Generative AI systems in cloud-based production environments on Kubernetes at scale.
Lead a unified GPU model export framework to support converting trained models into optimized GPU inference models.
Built an E2E inference performance benchmarking framework

Requirements

With 100,000+ active communities and approximately 126 million daily active unique visitors, Reddit is one of the internet’s largest sources of information.
The Machine Learning Platform team at Reddit is a high-impact team that owns the infrastructure that powers recommendations, content discovery, user and content quantification, while directly impacting other teams such as Growth, Ads, Feeds, and Core Machine Learning teams. What You’ll Do:
As a Staff Machine Learning Engineer, you will lead the development of a large-scale ML Inference Platform at Reddit.
Strong understanding of real-time ML observability to track feature/model performance. •
Experience working with LLM serving online at scale.
Deep Understanding of multi-cluster compute environment and network topology that is specific to ML inference use cases. Who You Might Be: 7+ years of
experience in ML Engineering, AI Platform Engineering, or Cloud AI Deployment roles. Have
experience operating orchestration systems such as Kubernetes at scale Deep
experience with cloud-based technologies for supporting an ML platform, including tools like AWS, Google Cloud Storage, infrastructure-as-code (Terraform), and more
Proficiency with the common programming languages and frameworks of ML, such as Go, Python, etc.
Excellent communication skills with the ability to articulate technical AI concepts to non-technical stakeholders
Strong knowledge of model serving, inference pipelines, monitoring, and observability for AI systems is a plus
Strong proficiency in Python and deep
experience with modern AI/ML frameworks (Triton, Dynamo, vLLM, Pytorch) Benefits:

Staff Machine Learning Engineer, AI Serving

Responsibilities

Requirements

Browse by category

Browse by skills

Benefits

Contact

Additional details

Browse by role

Browse by location