[{"content":"vLLM-Omni is a framework for efficient inference of omni-modality models — text, speech, audio, and vision. I\u0026rsquo;m an active contributor, mostly on the TTS serving path.\nWhat I\u0026rsquo;ve shipped OmniVoice voice cloning — end-to-end voice cloning support in the OmniVoice TTS path. Qwen3-TTS streaming — incremental token-by-token output, dynamic TTFA based on Code2Wav load, flexible initial chunking. Code2Wav performance — batched decoding, Triton SnakeBeta kernel, CUDA Graph, and torch.compile (reduce-overhead, dynamic-shape off). High-concurrency throughput — large wins on model throughput and latency at high concurrency for Qwen3-TTS. Voice / speaker management — voice cache manager refactor, speaker_id/voices loading from model config, ref_text handling. Bug fixes — chunk-transfer adapter deque mutation, CodePredictor CUDA Graph pool issue, streaming initial-chunk recomputation. Why it matters Serving omni-modality models at production scale has a very different shape than pure-LLM serving: a TTS request carries an autoregressive code-predictor and a waveform decoder (Code2Wav), each with its own batching and latency budget. The work above is focused on making that path fast, streamable, and low-TTFA for real-world usage — which directly maps to the TTS/generative audio work I do in my day job.\nFull commit list: github.com/vllm-project/vllm-omni/commits?author=JuanPZuluaga.\n","permalink":"https://juanpzuluaga.github.io/projects/vllm-omni/","summary":"Ongoing contributions to vLLM-Omni\u0026rsquo;s Qwen3-TTS and OmniVoice paths: streaming output, Code2Wav batched decoding, CUDA Graph + torch.compile, voice cloning, and throughput/latency optimization for high-concurrency TTS serving.","title":"vLLM-Omni"},{"content":"ATCO2 is a four-year effort (2020–2024) funded by the European SESAR Joint Undertaking, aiming to produce a large-scale open corpus of Air Traffic Control communications.\nWhat\u0026rsquo;s in it ~5,000 hours of audio from live VHF recordings across European airspace Transcripts (semi-automatic, manually verified on a subset) Speaker role labels (pilot vs. air traffic controller) Contextual metadata (callsigns, commands, waypoints — linked to surveillance data) Four subsets: 4 hours gold-standard · 1 hour test-full · the full 5k-hour pool · ATCO2-PL (pilot-only) Why it matters ATC speech is uniquely hard for ASR: channel noise, rapid speech, heavy code-switching between English and local languages, and a strictly domain-specific vocabulary. Before ATCO2, the largest open ATC dataset was a few dozen hours — too small to train modern self-supervised systems. ATCO2 makes it possible to study domain-shift behavior of large pretrained models at a realistic scale.\nDownstream use The corpus has been used as a training set or benchmark by follow-up work on contextual ASR, speaker diarization, callsign recognition, and speech translation. It\u0026rsquo;s available on ELRA for research use.\n","permalink":"https://juanpzuluaga.github.io/projects/atco2-corpus/","summary":"A multilingual, semi-automatically labeled corpus built to advance ASR and natural language understanding on one of the hardest real-world speech domains. Includes audio, transcripts, speaker role annotations, and a preprocessing pipeline. Used as a benchmark by follow-up work across Europe.","title":"ATCO2 Corpus"},{"content":"Wav2Vec2 and XLS-R models fine-tuned on public ATC datasets (ATCOSIM, LDC-ATCC, UWB-ATCC), released through HuggingFace for anyone to benchmark or build on.\nHeadline results 20–40% relative WER reduction vs. supervised Conformer baselines on in-domain ATC test sets Cross-accent generalization via XLS-R — a single model trained on mixed European ATC data ~6% WER on ATCOSIM with the Wav2Vec2-Large fine-tune What\u0026rsquo;s released Model Training data Link Wav2Vec2-Large ATC ATCOSIM HuggingFace ↗ Wav2Vec2-Base ATC LDC-ATCC HuggingFace ↗ XLS-R ATC All public ATC HuggingFace ↗ Try it The Colab notebook loads any of the models and transcribes an audio sample in under a minute — no GPU required for inference.\nContext This work is part of my PhD at Idiap/EPFL and was presented at IEEE SLT 2022. The accompanying paper systematically studies how self-supervised representations transfer under heavy domain shift — something surprisingly under-studied before we published.\n","permalink":"https://juanpzuluaga.github.io/projects/wav2vec2-atc/","summary":"A family of Wav2Vec2 models that achieve 20–40% relative WER reduction on ATC data compared to supervised baselines. Released with training recipes, evaluation scripts, and a Colab notebook for immediate inference. The benchmark paper at SLT 2022 studies self-supervised pretraining behavior under heavy domain shift.","title":"wav2vec2-atc"},{"content":"The idea Traditional speaker diarization (\u0026ldquo;who spoke when\u0026rdquo;) relies on acoustic features — but for Air Traffic Control, the audio signal is poor: VHF noise, short turns (~2s average), and a single mono channel for both parties.\nBERTraffic sidesteps that: take the ASR transcript, finetune BERT on a two-head classification task, and output both (a) turn boundaries and (b) the role of each turn (pilot/controller). Surprisingly, text-only beats the strongest audio-only baselines available.\nResults 27% relative DER reduction vs. audio baseline on ATCO2 test set Works with noisy ASR transcripts, not just gold text — the model is robust to ~15% WER input Joint training of the two heads is better than pipelining them Released Full training / evaluation code on GitHub Fine-tuned BERT models on HuggingFace (pilot/controller classifier, turn-change classifier) ","permalink":"https://juanpzuluaga.github.io/projects/bertraffic/","summary":"Most ATC diarization systems rely on audio signals, which are low-quality and short. BERTraffic reframes the problem as text classification: given a transcript, predict speaker turns and whether each turn is a pilot or controller. Beats audio-only baselines by 27% DER.","title":"BERTraffic"},{"content":"Why Conformer has become the default ASR encoder, but its self-attention scales quadratically with sequence length — a real bottleneck for streaming and long-form recognition. HyperMixer is a linear-complexity alternative to attention that had been shown to work well on NLP tasks; we asked whether it generalizes to speech.\nWhat we did Replaced Conformer\u0026rsquo;s self-attention with a multi-head HyperMixer block, keeping the convolutional module and macaron feedforward layers unchanged. Trained on Librispeech and CommonVoice with SpeechBrain.\nResults On par with Conformer on Librispeech (same WER, no regression) Better than Conformer on limited-data settings (CommonVoice) Linear time complexity — meaningful wins for longer utterances Status Merged into SpeechBrain recipes. Interspeech 2023.\n","permalink":"https://juanpzuluaga.github.io/projects/hyperconformer/","summary":"Attention is the expensive part of Conformer-based ASR models. HyperConformer swaps it for a multi-head HyperMixer, which scales linearly in sequence length rather than quadratically. Same WER as Conformer at a meaningful compute cut.","title":"HyperConformer"},{"content":"SpeechBrain is an open-source toolkit built to make conversational AI research easier. 1.0 is the stable milestone — a coherent API across dozens of tasks, production-ready recipes, and comprehensive documentation.\nMy contributions ATC ASR recipes (Wav2Vec2, XLS-R fine-tuning on air traffic control data) HyperConformer encoder (see separate project) Accent classification benchmarks (CommonAccent) Bug fixes, documentation, and review across the speech modules Why it matters Before SpeechBrain, building a speech system with PyTorch meant gluing together bits from ESPnet, Kaldi, Fairseq, and custom code. SpeechBrain unifies that into a single toolkit that covers ASR, TTS, speaker recognition, speech enhancement, and dialogue — all with reproducible recipes.\nUsed by research groups at Idiap, EPFL, Mila, Meta, and dozens of startups.\nCitation If you use SpeechBrain in your work, please cite the JMLR 2024 paper.\n","permalink":"https://juanpzuluaga.github.io/projects/speechbrain/","summary":"SpeechBrain is a PyTorch-based toolkit for speech and language tasks, used by dozens of research groups and startups. The 1.0 release (JMLR 2024) consolidates years of contributions into a stable API with comprehensive recipes for ASR, TTS, speaker recognition, and dialogue understanding.","title":"SpeechBrain 1.0"},{"content":"The best ways to reach me.\nEmail · juan.zuluaga@eu4m.eu\nProfessional · GitHub ↗ · Google Scholar ↗ · LinkedIn ↗ · ORCID ↗ · ResearchGate ↗ · HuggingFace ↗\nLess professional · Twitter ↗\nLocation · Zürich, Switzerland\nI\u0026rsquo;m especially happy to hear from:\nResearchers working on speech, NLP, or speech-and-audio LLMs Engineers interested in production ASR / TTS deployment Students applying to speech labs — happy to share advice Collaborators on open-source speech / NLP projects I try to reply to every email within a week. If you don\u0026rsquo;t hear back, it got lost — please nudge me.\n","permalink":"https://juanpzuluaga.github.io/contact/","summary":"How to reach Juan Pablo Zuluaga.","title":"Contact"},{"content":"↓ Resume (PDF) \u0026nbsp; · \u0026nbsp; ↓ Long CV (PDF)\nCurrent role 2025 — now Senior Research Engineer · Agigo AG Zürich, Switzerland\nProduction speech-and-audio LLMs, synthetic conversational data, GPU-efficient multi-client inference.\nExperience 2024 — 2025 Research Engineer · Telepathy Labs Zürich, Switzerland\nSpeech recognition, understanding, and generation for conversational AI agents.\nSummer 2023 ML Engineer Intern · Apple Cambridge, MA\nDiscriminative training of language models for ASR on tail named-entity data.\nSpring 2023 Applied Scientist Intern · Amazon Web Services Seattle, WA\nJoint speech-to-text translation and transcription research. Work published at EMNLP 2023.\n2020 — 2024 PhD Researcher · Idiap Research Institute \u0026amp; EPFL Martigny, Switzerland\nThesis: \"Low-Resource Speech Recognition and Understanding for Challenging Applications.\" Advised by Dr. Petr Motlicek and Prof. Hervé Bourlard.\n2019 — 2020 Research Engineer · Idiap Research Institute Martigny, Switzerland\nATCO2 project (EU Horizon 2020). Automatic speech recognition and contextual understanding for air traffic control.\nEducation 2024 PhD · EPFL \u0026amp; IDIAP Lausanne / Martigny, Switzerland\nComputer Science. Dissertation on Automatic Speech Recognition for Air Traffic Control — domain shift, self-supervised pretraining, and contextual biasing.\n2017 — 2019 MSc · Erasmus Mundus EU4M Oviedo, Spain · Nancy, France · Cluj-Napoca, Romania\nMechatronics \u0026amp; Micro-Mechatronics. Thesis on computer vision for breast cancer diagnosis (SBRA EU project, Universidad de Oviedo).\n2011 — 2016 BSc · Universidad Autónoma del Caribe Barranquilla, Colombia\nMechatronics Engineering.\nAwards Best Student Paper nominee — Interspeech 2023 (CommonAccent) 1st place — International Create Challenge 2020 · HealthTech Award (Groupe Mutuel) Erasmus Mundus Scholarship — EU Commission (EU4M programme, 2017) DAAD Research Scholarship — Germany (2014) Skills Python · PyTorch · SpeechBrain · Kaldi · HuggingFace · LaTeX · Git · Linux · GPU training \u0026amp; inference · LLMs · ASR · TTS · Self-supervised learning · NLP\nLanguages Spanish (native) · English (fluent) · French (intermediate)\nContact juan.zuluaga@eu4m.eu · GitHub ↗ · Scholar ↗ · LinkedIn ↗\n","permalink":"https://juanpzuluaga.github.io/cv/","summary":"Curriculum Vitae of Juan Pablo Zuluaga — research, experience, and education.","title":"CV"}]