infrastructure
Posted Apr 2, 2024Site Reliability Engineer
at Mistral AI
Paris, FranceRemote
Responsibilities
- Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads
- Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.)
- Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime
- Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client-facing APIs and large training runs
- Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform
- Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model-training experiments
- Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure
- Design and develop new workflows and tooling to improve to the reliability, availability and performance of our systems (automation scripts, refactoring, new API-based features, web apps, dashboards, etc.)
- Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements
- Document processes and procedures to ensure consistency and knowledge sharing across the team
Requirements
- About Mistral At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity.
- We democratize AI through high-performance, optimized, open-source and cutting-edge models, products and solutions.
- Our comprehensive AI platform is designed to meet enterprise needs, whether on-premises or in cloud environments.
- Our offerings include le Chat, the AI assistant for life and work.
- We are a dynamic, collaborative team passionate about AI and its potential to transform society.
- Join us to be part of a pioneering company shaping the future of AI.
- See more about our culture on https://mistral.ai/careers.
- Master’s degree in Computer Science, Engineering or a related field
- experience with cloud computing and highly available distributed systems
- Experience working against reliability KPIs (observability, alerting, SLAs) Hands-on
- experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...)
- Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...)
- Familiarity with infrastructure-as-code tools like Terraform or CloudFormation
- Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices
- Strong understanding of networking, security, and system administration concepts
- Self-motivated and able to work well in a fast-paced startup environment Your application will be all the more interesting if you also have: experience in an AI/ML environment
- experience of high-performance computing (HPC) systems and workload managers (Slurm)
Experience
- 7+ years of experience in a DevOps/SRE role Strong
Additional details
- Our technology is designed to integrate seamlessly into daily working life.
- Our diverse workforce thrives in competitive environments and is committed to driving innovation.
- Our teams are distributed between France, USA, UK, Germany and Singapore.
- Role Summary We are seeking highly experienced Site Reliability Engineers (SRE) to shape the reliability, scalability and performance of our platform and customer facing applications.
- You will work closely with our software engineers and research teams to ensure our systems meet and exceed our internal and external customers' expectations.
- What you will do As a Site Reliability Engineer, you balance the day-to-day operations on production systems with long-term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems. Operations
- Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters
- Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences Development
- Contribute to open-source projects, research publications, blog articles and conferences About you
- Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...)