Senior Software Engineer - Model Performance

full-time•San Francisco•$220k - $320k

Summary

Location

San Francisco

Salary

$220k - $320k

Type

full-time

Experience

2-5 years

Company links

Website LinkedIn

About this role

Help us make inference blazingly fast. If you love squeezing every last drop of performance out of GPUs, diving deep into CUDA kernels, and turning optimization techniques into production systems, we'd love to meet you.

About Inference.net

Inference.net trains and hosts specialized language models for companies that need frontier-quality AI at a fraction of the cost. The models we train match GPT-5 accuracy but are smaller, faster, and up to 90% cheaper. Our platform handles everything end-to-end: distillation, training, evaluation, and planet-scale hosting.

We are a well-funded ten-person team of engineers who work in-person in downtown San Francisco on difficult, high-impact engineering problems. Everyone on the team has been writing code for over 10 years, and has founded and run their own software companies. We are high-agency, adaptable, and collaborative. We value creativity alongside technical prowess and humility. We work hard, and deeply enjoy the work that we do. Most of us are in the office 4 days a week in SF; hybrid works for Bay Area candidates.

About the Role

You will be responsible for making our inference stack as fast and efficient as possible. Your work spans from implementing known optimization techniques to experimenting with novel approaches, always with the goal of serving models faster and cheaper at scale.

Your north star is inference performance: latency, throughput, cost efficiency, and how quickly we can bring new model architectures into production. You'll work across the full inference stack—from CUDA kernels to serving frameworks—to find and eliminate bottlenecks. This role reports directly to the founding team. You'll have the autonomy, a large compute budget, and technical support to push the limits of what's possible in model serving.

Key Responsibilities

Implement and productionize optimization techniques including quantization, speculative decoding, KV cache optimization, continuous batching, and LoRA serving
Deep dive into inference frameworks (vLLM, SGLang, TensorRT-LLM) and underlying libraries to debug and improve performance
Profile and optimize CUDA kernels and GPU utilization across our serving infrastructure
Add support for new model architectures, ensuring they meet our performance standards before going to production
Experiment with novel inference techniques and bring successful approaches into production
Build tooling and benchmarks to measure and track inference performance across our fleet
Collaborate with applied ML engineers to ensure trained models can be served efficiently

Requirements

2+ years of experience in ML systems, inference optimization, or GPU programming
Strong proficiency in Python and familiarity with C++
Hands-on experience with LLM inference frameworks (vLLM, SGLang, TensorRT-LLM, or similar)
Deep understanding of GPU architecture and experience profiling GPU workloads
Familiarity with LLM optimization techniques (quantization, speculative decoding, continuous batching, KV cache management)
Experience with PyTorch and understanding of how models execute on hardware
Track record of measurably improving system performance

Nice-to-Have

Experience with CUDA programming
Familiarity with serving non-LLM models (TTS, vision, embeddings)
Experience with distributed inference and multi-GPU serving
Contributions to open-source inference frameworks
Experience with Docker and Kubernetes

You don't need to tick every box. Curiosity and the ability to learn quickly matter more.

Compensation

We offer competitive compensation, equity in a high-growth startup, and comprehensive benefits. The base salary range for this role is $220,000 - $320,000, plus equity and benefits, depending on experience.

Equal Opportunity

Inference.net is an equal opportunity employer. We welcome applicants from all backgrounds and don't discriminate based on race, color, religion, gender, sexual orientation, national origin, genetics, disability, age, or veteran status.

If you're excited about making AI inference faster for everyone, we'd love to hear from you. Please send your resume and GitHub to [email protected] and/or apply here on Ashby.

What you'll do

You will be responsible for optimizing the inference stack for speed and efficiency. This includes implementing optimization techniques and experimenting with novel approaches to enhance model serving.

About Inference

The AI compute ecosystem for future leaders: Inference.ai enables flexible allocation of virtualized GPUs to different tasks and containerized workflows.

Ready to join Inference?

Take the next step in your career journey

Frequently Asked Questions

What does Inference pay for a Senior Software Engineer - Model Performance?

Inference offers a competitive compensation package for the Senior Software Engineer - Model Performance role. The salary range is USD 220k - 320k per year. Apply through Clera to learn more about the full compensation details.

What does a Senior Software Engineer - Model Performance do at Inference?

As a Senior Software Engineer - Model Performance at Inference, you will: you will be responsible for optimizing the inference stack for speed and efficiency. This includes implementing optimization techniques and experimenting with novel approaches to enhance model serving..

Is the Senior Software Engineer - Model Performance position at Inference remote?

The Senior Software Engineer - Model Performance position at Inference is based in San Francisco, United States. Contact the company through Clera for specific work arrangement details.

How do I apply for the Senior Software Engineer - Model Performance position at Inference?

You can apply for the Senior Software Engineer - Model Performance position at Inference directly through Clera. Click the "Apply Now" button above to start your application. Clera's AI-powered platform will help match your profile with this opportunity and guide you through the application process.

About this role

About Inference.net

About the Role

Key Responsibilities

Implement and productionize optimization techniques including quantization, speculative decoding, KV cache optimization, continuous batching, and LoRA serving
Deep dive into inference frameworks (vLLM, SGLang, TensorRT-LLM) and underlying libraries to debug and improve performance
Profile and optimize CUDA kernels and GPU utilization across our serving infrastructure
Add support for new model architectures, ensuring they meet our performance standards before going to production
Experiment with novel inference techniques and bring successful approaches into production
Build tooling and benchmarks to measure and track inference performance across our fleet
Collaborate with applied ML engineers to ensure trained models can be served efficiently

Requirements

2+ years of experience in ML systems, inference optimization, or GPU programming
Strong proficiency in Python and familiarity with C++
Hands-on experience with LLM inference frameworks (vLLM, SGLang, TensorRT-LLM, or similar)
Deep understanding of GPU architecture and experience profiling GPU workloads
Familiarity with LLM optimization techniques (quantization, speculative decoding, continuous batching, KV cache management)
Experience with PyTorch and understanding of how models execute on hardware
Track record of measurably improving system performance

Nice-to-Have

Experience with CUDA programming
Familiarity with serving non-LLM models (TTS, vision, embeddings)
Experience with distributed inference and multi-GPU serving
Contributions to open-source inference frameworks
Experience with Docker and Kubernetes

You don't need to tick every box. Curiosity and the ability to learn quickly matter more.

Compensation

Equal Opportunity

If you're excited about making AI inference faster for everyone, we'd love to hear from you. Please send your resume and GitHub to [email protected] and/or apply here on Ashby.

Senior Software Engineer - Model Performance

Summary

Location

Salary

Type

Experience

Company links

About this role

What you'll do

About Inference

Ready to join Inference?

Frequently Asked Questions

What does Inference pay for a Senior Software Engineer - Model Performance?

What does a Senior Software Engineer - Model Performance do at Inference?

Is the Senior Software Engineer - Model Performance position at Inference remote?

How do I apply for the Senior Software Engineer - Model Performance position at Inference?

Senior Software Engineer - Model Performance

Summary

Location

Salary

Type

Experience

Company links

About this role

What you'll do

About Inference

Ready to join Inference?

Frequently Asked Questions

What does Inference pay for a Senior Software Engineer - Model Performance?

What does a Senior Software Engineer - Model Performance do at Inference?

Is the Senior Software Engineer - Model Performance position at Inference remote?

How do I apply for the Senior Software Engineer - Model Performance position at Inference?

Join Clera's Talent Pool

Join Clera's Talent Pool