
Team: Core Speech Research
Location: Bangalore, India
Type: Full-time
Experience: No fixed bar — skill and depth matter more than years
Smallest.ai builds real-time voice intelligence systems operating at enterprise scale.
We work across speech recognition, speech generation, and speech-to-speech systems with a strong focus on low latency, multilingual intelligence, and production reliability.
Our goal is simple: Smaller models. Lower latency. Higher intelligence.
As a Speech Research Scientist, you will work on the core speech stack at Smallest.ai.
You will research, train, evaluate, and productionize models across:
Speech to Text (ASR)
Text to Speech (TTS)
Speech to Speech (S2S)
This is not an offline research role.
You will work at the intersection of research, engineering, and real-world deployment.
Streaming and non-streaming ASR
Multilingual and code-mixed speech
Low-latency decoding and inference
Long-context speech modeling
Robustness to accents, noise, and telephony audio
Neural TTS and generative speech models
Controllable speech generation including emotion, style, pitch, rate, and prosody
Speaker adaptation and voice cloning
Stability, expressiveness, and naturalness optimization
End-to-end speech-to-speech models
Streaming voice-to-voice architectures
Codec-based or token-based speech representations
Low-latency conversational speech generation
Multilingual speaker understanding
Cross-lingual speaker embeddings
Speaker identification and verification
Accent and dialect robustness
Low-resource language modeling
Multi-speaker diarization
Overlapping speech detection and separation
Speaker-aware ASR pipelines
Joint diarization and recognition modeling
Full-duplex speech models
Simultaneous listening and speaking
Interruption handling and barge-in detection
Half-duplex conversational models
Turn detection
Latency-aware response generation
Novel model architectures and training strategies
Large-scale multilingual datasets and pipelines
Evaluation frameworks for WER, DER, MOS, latency, and RTF
Streaming inference systems for real-time speech
Research prototypes converted into production models
Your work will directly power live customer-facing systems.
Strong background in speech processing or deep learning
Deep expertise in at least one of the following:
ASR
TTS
Speech-to-speech systems
Strong understanding of modern architectures:
Transformers, Conformers, diffusion or flow-based models
Experience with CTC, Transducer, attention-based decoding
Strong proficiency in PyTorch
Experience training models at scale
Multilingual speech experience (Indic or European languages)
Speaker embeddings and diarization systems
Parameter-efficient fine-tuning methods such as LoRA
Streaming inference optimization
Deployment experience using ONNX, TensorRT, or Triton
Publications, open-source contributions, or serious personal research projects
Depth over buzzwords
Clean experiments and reproducibility
Strong benchmarking discipline
Latency, memory, and throughput awareness
Research that translates into shipped systems
We value people who ask:
“How does this behave at scale?”
Not just: “Does this work on the dataset?”
Work on real-world speech systems at scale
Direct ownership from research to production
Close collaboration with founders and infrastructure teams
Fast iteration cycles with minimal bureaucracy
Competitive compensation and meaningful ESOPs
One of the deepest speech research stacks in India
It would be nice if you can also share:
Resume
Research papers, GitHub repositories, or technical writing
Examples of models you trained or systems you built
A short note on what aspect of LLM or memory research excites you most
Email: [email protected]
Take the next step in your career journey
Get matched with similar opportunities at top startups
This role is hosted on Smallest's careers site.
Join our talent pool first to get notified about similar roles that match your profile.