Clera home
·Dashboard

Jobs at Sarvam (Now Hiring) — 5 open

Sarvam logoSarvam

Infrastructure SRE - HPC

Bengaluru, Karnataka, India · On-site

Senior$41M raised

About Sarvam Sarvam is building the bedrock of Sovereign AI for India. The company is developing India’s full-stack sovereign AI platform, building across research, models, infrastructure and applications with a singular…

Skills: GPU Cluster Operation, Site Reliability Engineering, Python, Go, Kubernetes

Sarvam logoSarvam

AI Engineer - Healthcare

Bengaluru, Karnataka, India · On-site

Mid level$41M raised

About Sarvam Sarvam is building the bedrock of Sovereign AI for India: a full-stack platform spanning research, models, infrastructure, and applications, with a singular focus on making AI genuinely work for India. We ar…

Skills: LLMs, Prompting, Retrieval Augmented Generation, Fine-tuning, Agentic Systems

Sarvam logoSarvam

Principal Security Engineer

On-site

Senior+$41M raised

About Sarvam Sarvam is building the bedrock of Sovereign AI for India. The company is developing India's full-stack sovereign AI platform, building across research, models, infrastructure and applications with a singular…

Skills: Security Strategy, Threat Modeling, Product Security, API Security, Cloud Security

Sarvam logoSarvam

IT Lead

Bengaluru, Karnataka, India · On-site

Senior$41M raised

About Sarvam Sarvam is building the bedrock of Sovereign AI for India. The company is developing India’s full-stack sovereign AI platform, building across research, models, infrastructure and applications with a singular…

Skills: IdP Administration, Endpoint MDM, EDR Operations, ZTNA Administration, SOC 2 Compliance

Sarvam logoSarvam

Applied AI Engineer, Sarvam Agents

Bengaluru, Karnataka, India · On-site

Mid level$41M raised

About Sarvam Sarvam is building the bedrock of Sovereign AI for India. The company is developing India’s full-stack sovereign AI platform, building across research, models, infrastructure and applications with a singular…

Skills: Python, LLM API, LangGraph, MCP Servers, RAG

Sarvam logo

Infrastructure SRE - HPC

Sarvam

Bengaluru, Karnataka, India • On-site

Apply
Senior

Tired of cold applications?

Sign up with Clera and we'll reach out the moment a role actually fits you — no more spraying applications into the void.

  • Full-time
  • Posted 2d ago
  • ~40 hrs/week

Responsibilities

Operate and maintain a large multi-vendor GPU fleet for AI training and inference workloads. Build internal tooling and partner with ML teams to ensure fleet health and predictable serving latency.

Requirements

Requires 5+ years of infrastructure or SRE experience, including at least 2 years operating GPU clusters at scale. Proficiency in Python or Go and deep expertise in one of five specialized infrastructure areas is expected.

Full job description

About Sarvam

Sarvam is building the bedrock of Sovereign AI for India. The company is developing India’s full-stack sovereign AI platform, building across research, models, infrastructure and applications with a singular focus on making AI genuinely work for India. Sarvam works with leading enterprises and public institutions and is backed by Lightspeed, Peak XV, and Khosla Ventures. Sarvam partners with India’s leading brands, including Tata Capital, SBI Life, CRED, IDFC, and LIC.

About the Role

Sarvam runs a large, multi-vendor GPU fleet that serves two demanding workloads on the same physical infrastructure: training jobs that span hundreds of GPUs and must run uninterrupted for weeks, and inference services that must hold a flat p99 under production load. Keeping both healthy at once is a hard, specialized reliability problem, and it is the problem this team exists to solve.

This is not a Kubernetes administration role. We assume Kubernetes fluency as a baseline. The difficulty lies above and below it - in parallel filesystems under heavy checkpoint load, in RDMA fabrics that degrade quietly, in NCCL hangs whose root cause may be the network or the kernel, in driver and firmware drift across heterogeneous hardware, and in distributed training failures that masquerade as infrastructure faults.

We are hiring a team of specialists rather than a set of identical generalists. This posting covers five areas of focus. We expect candidates to bring genuine depth in one and working fluency across the others, because on a shared fleet a storage problem often first appears as a training hang, and the engineer on call must route an incident correctly before anyone can resolve it.

When you apply, please indicate the area of focus that best matches your experience. Strong generalists are welcome; we will place you where your depth is most useful.

What You’ll Do

  • Operate the GPU fleet end to end across training and serving - provisioning, observability, capacity, and fleet health.

  • Hold a meaningful on-call rotation, write runbooks that hold up under pressure, and drive postmortems that produce durable fixes.

  • Build the internal tooling the team relies on, rather than operating off-the-shelf systems alone.

  • Partner with ML and platform teams to keep large runs alive and serving latency predictable.

What We're Looking For

  • 5+ years in infrastructure or site reliability engineering, including 2+ years operating GPU clusters at scale.

  • Demonstrated on-call ownership of infrastructure that mattered, with a track record of postmortems that led to real change.

  • Proficiency in Python or Go, used to build and maintain internal tooling.

  • Working fluency across all five areas of focus below - enough to recognize, triage, and route a problem outside your specialty, even if the fix belongs to a teammate.

For the Storage and Fabric areas of focus, we will weigh deep domain expertise against the GPU-cluster requirement; exceptional specialists with less direct GPU-fleet time are encouraged to apply.

Bonus Points

  • Slurm and Kubernetes hybrid environments.

  • On-premise GPU deployment, including coordination with datacenter operations on power, cooling, and InfiniBand cabling.

  • Experience with Indian NCPs, DGX SuperPOD, Lambda, CoreWeave, NeevCloud etc.

  • Multi-tenant GPU isolation (MIG, MPS, time-slicing) in production.

Why Sarvam?

Sarvam is a fast-moving, high talent-density team building full-stack AI for India, working on problems that push the frontiers of AI with real population-scale impact.

Work alongside researchers, engineers, builders, and business leaders who move fast and hold each other to a very high bar

High ownership and high impact, from day one

Everything we do is AI-first, from the way we build and ship to the way we think about problems

You can work on problems that could change how an entire country learns, works, and communicates

If you want to work on problems at the frontier of AI in India, Sarvam is the place to be.

Related keywords

GPUSREHPCKubernetesRDMANCCLSlurmInfiniBandPythonGoDGX SuperPODLambdaCoreWeaveNeevCloudMIGMPS

About Sarvam

LinkedInVisit site

Building the full-stack for Generative AI

Industry
Software Development
Company size
51-200 employees
LinkedIn followers
94,311
Total funding
$41M

At Sarvam, we're on a mission to build the bedrock of Sovereign AI for India. We’re working toward creating a sovereign AI ecosystem that empowers governments, enterprises, and nonprofits to use GenAI solutions. We are a full-stack Generative AI platform built with sovereign models that deeply understand India’s culture and diversity, enabling AI agents and applications tailored for the nation. Today, Sarvam is backed by top investors, including Lightspeed, Peak XV, and Khosla Ventures. Indus is live in beta. Try now: https://indus.sarvam.ai

SoftwareGenerative AIArtificial Intelligence (AI)
View all jobs at Sarvam

About Sarvam

LinkedInVisit site

Building the full-stack for Generative AI

Industry
Software Development
Company size
51-200 employees
LinkedIn followers
94,311
Total funding
$41M

At Sarvam, we're on a mission to build the bedrock of Sovereign AI for India. We’re working toward creating a sovereign AI ecosystem that empowers governments, enterprises, and nonprofits to use GenAI solutions. We are a full-stack Generative AI platform built with sovereign models that deeply understand India’s culture and diversity, enabling AI agents and applications tailored for the nation. Today, Sarvam is backed by top investors, including Lightspeed, Peak XV, and Khosla Ventures. Indus is live in beta. Try now: https://indus.sarvam.ai

SoftwareGenerative AIArtificial Intelligence (AI)
View all jobs at Sarvam

Similar companies hiring

Amazon (4950)Prolific (3401)AgileEngine (1668)Bosch (1656)Speechify (1456)Google (969)Booz Allen Hamilton (779)Microsoft (722)Transport AI (669)SAP (579)Salesforce (516)Meta (456)
Clera home

Your AI-talent agent. Connecting talents with dream jobs.

Earn $5,000

Tools

  • Salary Calculator
  • Resume Review
  • Startup Map

Explore

  • Jobs
  • Discover Jobs
  • Companies
  • Acquihire
  • Referral

Company

  • Manifesto
  • Engineering
  • We are hiring!
  • FAQs
  • Blog
  • Press

Tools

  • Salary Calculator
  • Resume Review
  • Startup Map

Explore

  • Jobs
  • Discover Jobs
  • Companies
  • Acquihire
  • Referral

Company

  • Manifesto
  • Engineering
  • We are hiring!
  • FAQs
  • Blog
  • Press

© 2026 Clera Labs, Inc.

PrivacyTermsBug Bounty