
ABOUT THE ROLE
You would be working in our pre-training team focused on building out our distributed training and inference of Large Language Models (LLMs). This is a hands-on role that focuses on software reliability and fault tolerance. You will work on cross-platform checkpointing, NCCL recovery, and hardware fault detection. You will make high-level tools. You will not be afraid of debugging Linux kernel modules. You will have access to thousands of GPUs to test changes.
Strong engineering skills are a prerequisite. We assume good knowledge of Torch, NVIDIA GPU architecture, reliability concepts, distributed systems, and best coding practices. A basic understanding of LLM training and inference principles is required. We look for fast learners who are prepared for a steep learning curve and are not afraid to step out of their comfort zone.
YOUR MISSION
To help train the best foundational models for source code generation in the world
RESPONSIBILITIES
Identify, study, and troubleshoot hardware problems during training at scale
Minimize the GPU idle time during faults, both operationally and strategically
Design and develop tools and add-ons to accelerate the training recovery
Improve the performance and reliability of checkpointing
Write high-quality Python (PyTorch), Cython, C/C++, CUDA API code
SKILLS & EXPERIENCE
Understanding of Large Language Models (LLM)
Basic knowledge of Transformers
Knowledge of deep learning fundamentals
Strong engineering background
Programming experience
Linux API, Linux kernel
Strong algorithmic skills
Python with numpy, PyTorch, or Jax
C/C++
NCCL
Use modern tools and are always looking to improve
Strong critical thinking and ability to question code quality policies when applicable
Distributed systems
Reliability
Observability
Fault-tolerance
K8s stack
Take the next step in your career journey