vLLM-Omni

vLLM-Omni is a framework for efficient inference of omni-modality models — text, speech, audio, and vision. I’m an active contributor, mostly on the TTS serving path.

What I’ve shipped

OmniVoice voice cloning — end-to-end voice cloning support in the OmniVoice TTS path.
Qwen3-TTS streaming — incremental token-by-token output, dynamic TTFA based on Code2Wav load, flexible initial chunking.
Code2Wav performance — batched decoding, Triton SnakeBeta kernel, CUDA Graph, and torch.compile (reduce-overhead, dynamic-shape off).
High-concurrency throughput — large wins on model throughput and latency at high concurrency for Qwen3-TTS.
Voice / speaker management — voice cache manager refactor, speaker_id/voices loading from model config, ref_text handling.
Bug fixes — chunk-transfer adapter deque mutation, CodePredictor CUDA Graph pool issue, streaming initial-chunk recomputation.

Why it matters

Serving omni-modality models at production scale has a very different shape than pure-LLM serving: a TTS request carries an autoregressive code-predictor and a waveform decoder (Code2Wav), each with its own batching and latency budget. The work above is focused on making that path fast, streamable, and low-TTFA for real-world usage — which directly maps to the TTS/generative audio work I do in my day job.

Full commit list: github.com/vllm-project/vllm-omni/commits?author=JuanPZuluaga.

What I’ve shipped#

Why it matters#

What I’ve shipped

Why it matters