vLLM-Omni is a framework for efficient inference of omni-modality models — text, speech, audio, and vision. I’m an active contributor, mostly on the TTS serving path.

What I’ve shipped

  • OmniVoice voice cloning — end-to-end voice cloning support in the OmniVoice TTS path.
  • Qwen3-TTS streaming — incremental token-by-token output, dynamic TTFA based on Code2Wav load, flexible initial chunking.
  • Code2Wav performance — batched decoding, Triton SnakeBeta kernel, CUDA Graph, and torch.compile (reduce-overhead, dynamic-shape off).
  • High-concurrency throughput — large wins on model throughput and latency at high concurrency for Qwen3-TTS.
  • Voice / speaker management — voice cache manager refactor, speaker_id/voices loading from model config, ref_text handling.
  • Bug fixes — chunk-transfer adapter deque mutation, CodePredictor CUDA Graph pool issue, streaming initial-chunk recomputation.

Why it matters

Serving omni-modality models at production scale has a very different shape than pure-LLM serving: a TTS request carries an autoregressive code-predictor and a waveform decoder (Code2Wav), each with its own batching and latency budget. The work above is focused on making that path fast, streamable, and low-TTFA for real-world usage — which directly maps to the TTS/generative audio work I do in my day job.

Full commit list: github.com/vllm-project/vllm-omni/commits?author=JuanPZuluaga.