Member of Engineering, Inference (Remote)*

Full-time•remote•Los Angeles, San Francisco•$240k - $400k+ Highly competitive

Summary

Location

Los Angeles, San Francisco

Salary

$240k - $400k

Equity

Highly competitive

Type

Full-time

Workplace

Remote

Experience

3+ years

Visa

Visa sponsorship available For exceptional candidates

Company links

Website LinkedIn

About this role

ABOUT THE ROLE

You would be working in our pre-training team focused on building out our distributed training and inference of Large Language Models (LLMs). This is a hands-on role that focuses on software reliability and fault tolerance. You will work on cross-platform checkpointing, NCCL recovery, and hardware fault detection. You will make high-level tools. You will not be afraid of debugging Linux kernel modules. You will have access to thousands of GPUs to test changes.

Strong engineering skills are a prerequisite. We assume good knowledge of Torch, NVIDIA GPU architecture, reliability concepts, distributed systems, and best coding practices. A basic understanding of LLM training and inference principles is required. We look for fast learners who are prepared for a steep learning curve and are not afraid to step out of their comfort zone.

YOUR MISSION

To help train the best foundational models for source code generation in the world

RESPONSIBILITIES

Identify, study, and troubleshoot hardware problems during training at scale
Minimize the GPU idle time during faults, both operationally and strategically
Design and develop tools and add-ons to accelerate the training recovery
Improve the performance and reliability of checkpointing
Write high-quality Python (PyTorch), Cython, C/C++, CUDA API code

SKILLS & EXPERIENCE

Understanding of Large Language Models (LLM)
- Basic knowledge of Transformers
- Knowledge of deep learning fundamentals
Strong engineering background
Programming experience
- Linux API, Linux kernel
- Strong algorithmic skills
- Python with numpy, PyTorch, or Jax
- C/C++
- NCCL
- Use modern tools and are always looking to improve
- Strong critical thinking and ability to question code quality policies when applicable
Distributed systems
- Reliability
- Observability
- Fault-tolerance
- K8s stack

About Poolside

Building the world's most capable AI for software development & the applications to unlock the potential of developers.

Ready to join Poolside?

Take the next step in your career journey

Frequently Asked Questions

What does Poolside pay for a Member of Engineering, Inference (Remote)*?

Poolside offers a competitive compensation package for the Member of Engineering, Inference (Remote)* role. The salary range is USD 240k - 400k per year, plus Highly competitive equity. Apply through Clera to learn more about the full compensation details.

What does a Member of Engineering, Inference (Remote)* do at Poolside?

The Member of Engineering, Inference (Remote)* role at Poolside involves ABOUT THE ROLE You would be working in our pre-training team focused on building out our distributed training and inference of Large Language Models (LLMs). This is a hands-on role that focuses on sof...

Is the Member of Engineering, Inference (Remote)* position at Poolside remote?

Yes! The Member of Engineering, Inference (Remote)* position at Poolside is a remote role, with team members based in Los Angeles, United States and San Francisco, United States. Apply through Clera to learn more about their remote work policies.

How do I apply for the Member of Engineering, Inference (Remote)* position at Poolside?

You can apply for the Member of Engineering, Inference (Remote)* position at Poolsidedirectly through Clera. Click the "Apply Now" button above to start your application. Clera's AI-powered platform will help match your profile with this opportunity and guide you through the application process.