data
Posted Apr 24Staff Data Engineer
New York, United StatesOn-site
Responsibilities
- Own the data layer and architecture: the models, schemas, and infrastructure decisions that everything downstream depends on
- Build and operate the pipelines and transformations that move data from ingestion through normalization, enrichment, and into the formats that support analytics, ML training, and production model serving
- Own data quality and observability: build the systems that make data issues visible and correctable before they compound
- Define how clinical and operational data is governed across the system
- Evaluate and select the tools and technologies that make up the data stack, with a clear point of view on build vs. buy
Requirements
- By combining deep expertise in clinical trials with cutting-edge AI, we empower research teams and study sponsors to expand and expedite access to novel therapeutics for patients in need. About the Role
- Your job is to build the pipelines, data models, and AI infrastructure that make this asset real, from ingestion and normalization through to the systems that power predictions on top of it.
- You'll also have a direct hand in shaping how this data drives our AI strategy, what we model, what we predict, and what becomes possible.
- Partner with ML and engineering teams to identify what's modelable, define training data requirements, and build the data foundations for new predictive capabilities
- experience in data engineering or related roles, with significant time spent building data systems •
- Experience with healthcare data strongly preferred (HL7, FHIR, claims, EHR extracts) or other complex, regulated data domains Deep
- Experience applying AI and LLMs to data engineering problems: extraction, normalization, classification, entity resolution
- Strong understanding of how data infrastructure supports ML workflows from feature engineering to training data pipelines to model serving
- Fluent in SQL and at least one modern programming language (Python, Java, Scala, Go), with
- experience across modern data infrastructure - distributed processing, streaming, cloud-native storage, orchestration, and transformation frameworks
- Experience building data infrastructure that directly supports ML model training and evaluation