About this role
About the TeamThe ML Infrastructure team builds the systems that power every stage of Decagon's model lifecycle. We own the platforms for model training, the infrastructure for model evaluation and experimentation, and the routing layer that manages inference across multiple providers.We work at the intersection of research and production: translating cutting-edge ML models into reliable, scalable systems that run in customer environments. We collaborate closely with Research, Infrastructure, and Product teams to ensure models train efficiently, serve reliably, and deliver exceptional user experiences.The team values technical rigor, pragmatic decision-making, and building systems that others love to use.About the RoleWe're hiring a Senior ML Infrastructure Engineer to own the platforms powering Decagon's model training and inference. You'll build distributed training systems, design inference architecture across multiple providers, and create the frameworks that let our Research and Product teams ship faster.This role is for someone who thrives on technical depth, can lead multi-quarter initiatives, and wants to shape the long-term architecture of our ML stack.In this role, you willDesign and build distributed training platforms for LLM and multimodal fine-tuning and post-training at scaleIntegrate state-of-the-art training algorithms into production pipelinesOwn inference architecture and multi-provider routing, including failover and optimizationLead initiatives to improve latency and cost efficiency across the training and serving stackBuild evaluation and experimentation infrastructure that enables rapid, reliable iterationDrive technical direction, mentor engineers, and establish best practices for ML infrastructureYour background looks something like this8+ years building ML infrastructure or production systems at scaleDeep experience with distributed training: multi-node GPU clusters, fault tolerance, and optimizationStrong understanding of LLM inference: latency optimization, provider tradeoffs, and serving architectureProven track record leading complex, multi-quarter technical projects