vLLM-Omni is a framework for efficient inference of omni-modality models — text, speech, audio, and vision. I’m an active contributor, mostly on the TTS serving path.
What I’ve shipped
- OmniVoice voice cloning — end-to-end voice cloning support in the OmniVoice TTS path.
- Qwen3-TTS streaming — incremental token-by-token output, dynamic TTFA based on Code2Wav load, flexible initial chunking.
- Code2Wav performance — batched decoding, Triton SnakeBeta kernel, CUDA Graph, and
torch.compile(reduce-overhead, dynamic-shape off). - High-concurrency throughput — large wins on model throughput and latency at high concurrency for Qwen3-TTS.
- Voice / speaker management — voice cache manager refactor,
speaker_id/voicesloading from model config,ref_texthandling. - Bug fixes — chunk-transfer adapter deque mutation, CodePredictor CUDA Graph pool issue, streaming initial-chunk recomputation.
Why it matters
Serving omni-modality models at production scale has a very different shape than pure-LLM serving: a TTS request carries an autoregressive code-predictor and a waveform decoder (Code2Wav), each with its own batching and latency budget. The work above is focused on making that path fast, streamable, and low-TTFA for real-world usage — which directly maps to the TTS/generative audio work I do in my day job.
Full commit list: github.com/vllm-project/vllm-omni/commits?author=JuanPZuluaga.