I build and deploy speech-and-audio LLMs, production ASR, and spoken language understanding systems.
PhD from EPFL & IDIAP. Previously at Apple and AWS.
I'm a Senior Research Scientist at Agigo AG, a Swiss AI company building autonomous AI agents.
My work sits at the intersection of natural language processing and automatic speech recognition, with a strong focus on speech-and-audio LLMs.
I completed my PhD at EPFL and IDIAP in 2024.
My thesis tackled automatic speech recognition for air traffic control — one of the hardest real-world ASR domains.
Along the way I built the ATCO2 corpus, fine-tuned self-supervised models for this domain,
and published work on speaker diarization, speaker role detection, and contextual ASR.
Before Agigo, I interned at Apple (ML for ASR on tail named entities) and
at AWS (speech translation and transcription).
I hold master's and bachelor's degrees in Mechatronics Engineering from Universidad de Oviedo and Universidad Autónoma del Caribe.
I live in Zürich. Originally from Baranoa, Colombia.
currently
High-throughput LLM & speech model serving— production deployment with vLLM-Omni: streaming decoders, CUDA Graphs, torch.compile, and multi-client inference scheduling.
TTS controllability & steering— conditioning generative TTS on prosody, speaker identity, and style — including voice cloning and zero-shot cross-lingual synthesis.
Full-stack model development— end-to-end: data curation, large-scale synthetic data generation, training optimization targeting 50% MFU, and deployment tuning for low-latency serving.
Local LLM deployment & KV-cache offloading— memory-efficient inference using disaggregated caching and offloading techniques (e.g., LMCache) for constrained-GPU and edge scenarios.
Holistic TTS evaluation pipelines— state-of-the-art automated evaluation covering intelligibility, speaker similarity, naturalness, prosody, and robustness across languages and domains.
Omni-modal data pipelines— large-scale processing with omni LLMs to generate multi-task RL training signal across speech, text, and audio.
ai-assisted engineering
Agentic development tools — primarily Claude Code — are central to how I build.
I pair them with deep systems knowledge to move fast across the full stack that powers production speech and language AI:
from low-level GPU kernels up to high-concurrency serving. Used well, they compress the loop from idea to shipped
system and let a small team operate with outsized leverage.
CUDA & Triton kernel developmentTTS systems & controllabilityLLM & omni-modal inferenceHigh-concurrency deploymentLarge-scale data curationRapid prototyping & evaluation
Unifying Global and Near-Context Biasing in a Single Trie Pass
TSD2025
I. Thorbecke, E. Villatoro-Tello, J. P. Zuluaga, S. Kumar, S. Burdisso, P. Rangappa, A. Carofilis, S. Madikeri, P. Motlicek, K. Pandia, K. Hacioglu, A. Stolcke.
Single-pass trie unifies global vocabulary biasing with utterance-level context biasing for transducer ASR.
@inproceedings{unifyingglobal2025,
title = {Unifying Global and Near-Context Biasing in a Single Trie Pass},
author = {I. Thorbecke and E. Villatoro-Tello and J. P. Zuluaga and S. Kumar and S. Burdisso and P. Rangappa and A. Carofilis and S. Madikeri and P. Motlicek and K. Pandia and K. Hacioglu and A. Stolcke},
booktitle = {TSD},
year = {2025},
url = {https://doi.org/10.1007/978-3-032-02548-7_15}
}
Speech Data Selection for Efficient ASR Fine-Tuning
ICASSP2025
P. Rangappa, S. Madikeri, J. P. Zuluaga, J. Villatoro-Tello, P. Motlicek.
A domain classifier plus pseudo-label filtering cuts ASR fine-tuning compute by ~40% at matched WER.
@inproceedings{speechdata2025,
title = {Speech Data Selection for Efficient ASR Fine-Tuning},
author = {P. Rangappa and S. Madikeri and J. P. Zuluaga and J. Villatoro-Tello and P. Motlicek},
booktitle = {ICASSP},
year = {2025},
url = {https://ieeexplore.ieee.org/abstract/document/10888138/}
}
XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models
ICASSP2025
S. Madikeri, J. P. Zuluaga, P. Rangappa, J. Villatoro-Tello, P. Motlicek.
Streaming ASR atop a frozen self-supervised backbone, without sacrificing non-streaming accuracy.
@inproceedings{xlsrtransducer2025,
title = {XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models},
author = {S. Madikeri and J. P. Zuluaga and P. Rangappa and J. Villatoro-Tello and P. Motlicek},
booktitle = {ICASSP},
year = {2025},
url = {https://arxiv.org/abs/2407.04439}
}
Open-source Conversational AI with SpeechBrain 1.0
JMLR2024
M. Ravanelli, T. Parcollet, A. Moumen, S. de Langen, C. Subakan, P. Plantinga, Y. Liao, S. Cornell, D. Roman, S. Moradi, D. Chander, D. Petermann, Y. Wang, J. P. Zuluaga, et al.
Co-authored the 1.0 release of SpeechBrain — a PyTorch toolkit for conversational AI.
@article{opensource2024,
title = {Open-source Conversational AI with SpeechBrain 1.0},
author = {M. Ravanelli and T. Parcollet and A. Moumen and S. de Langen and C. Subakan and P. Plantinga and Y. Liao and S. Cornell and D. Roman and S. Moradi and D. Chander and D. Petermann and Y. Wang and J. P. Zuluaga and others},
journal = {JMLR},
year = {2024},
url = {https://jmlr.org/papers/v25/24-0991.html}
}
@inproceedings{endto2023,
title = {End-to-end single-channel speaker-turn aware conversational speech translation},
author = {J. P. Zuluaga and Z. Huang and X. Niu and R. Paturi and S. Srinivasan and P. Mathur and B. Thompson and M. Federico},
booktitle = {EMNLP},
year = {2023},
url = {https://aclanthology.org/2023.emnlp-main.493/}
}
HyperConformer: Multi-Head HyperMixer for Efficient Speech Recognition
Interspeech2023
F. Mai, J. P. Zuluaga, T. Parcollet, P. Motlicek.
Replaces Conformer attention with HyperMixer, matching accuracy at a fraction of the compute.
@inproceedings{hyperconformermulti2023,
title = {HyperConformer: Multi-Head HyperMixer for Efficient Speech Recognition},
author = {F. Mai and J. P. Zuluaga and T. Parcollet and P. Motlicek},
booktitle = {Interspeech},
year = {2023},
url = {https://arxiv.org/abs/2305.18281}
}
CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification
Interspeech2023★ Best Student Paper nominee
J. P. Zuluaga, S. Sarfjoo, A. Prasad, I. Nigmatulina, P. Motlicek, K. Ondrej, O. Ohneiser, H. Helmke.
Accent classification benchmark on Common Voice using large self-supervised models.
@inproceedings{commonaccentexploring2023,
title = {CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification},
author = {J. P. Zuluaga and S. Sarfjoo and A. Prasad and I. Nigmatulina and P. Motlicek and K. Ondrej and O. Ohneiser and H. Helmke},
booktitle = {Interspeech},
year = {2023},
url = {https://arxiv.org/abs/2305.18283}
}
How Does Pre-Trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications
IEEE SLT2022
J. P. Zuluaga, A. Prasad, I. Nigmatulina, S. Sarfjoo, P. Motlicek, M. Kleinert, H. Helmke, O. Ohneiser, Q. Zhan.
Systematic study of self-supervised pretraining under domain shift — 20–40% relative WER cut on Air Traffic Control.
@inproceedings{howdoes2022,
title = {How Does Pre-Trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications},
author = {J. P. Zuluaga and A. Prasad and I. Nigmatulina and S. Sarfjoo and P. Motlicek and M. Kleinert and H. Helmke and O. Ohneiser and Q. Zhan},
booktitle = {IEEE SLT},
year = {2022},
url = {https://arxiv.org/abs/2203.16822}
}
Active contributor to vLLM-Omni — the production inference engine for omni-modality models.
20+ merged PRs across Qwen3-TTS and OmniVoice: streaming output, CUDA Graph + torch.compile, batched Code2Wav decoding, global speaker cache manager, and large throughput & latency wins under high concurrency.
5,000 hours of Air Traffic Control communications — the largest open ATC speech dataset.
A multilingual, semi-automatically labeled corpus built to advance ASR and natural language understanding on one of the hardest real-world speech domains. Includes audio, transcripts, speaker role annotations, and a preprocessing pipeline. Used as a benchmark by follow-up work across Europe.
Self-supervised ASR models fine-tuned for Air Traffic Control, available on HuggingFace.
A family of Wav2Vec2 models that achieve 20–40% relative WER reduction on ATC data compared to supervised baselines. Released with training recipes, evaluation scripts, and a Colab notebook for immediate inference. The benchmark paper at SLT 2022 studies self-supervised pretraining behavior under heavy domain shift.
Joint speaker-role and speaker-change detection from ATC transcripts — no audio required.
Most ATC diarization systems rely on audio signals, which are low-quality and short. BERTraffic reframes the problem as text classification: given a transcript, predict speaker turns and whether each turn is a pilot or controller. Beats audio-only baselines by 27% DER.
A Conformer variant where attention is replaced with HyperMixer — matched accuracy, less compute.
Attention is the expensive part of Conformer-based ASR models. HyperConformer swaps it for a multi-head HyperMixer, which scales linearly in sequence length rather than quadratically. Same WER as Conformer at a meaningful compute cut.
Co-authored the 1.0 release of the open-source conversational AI toolkit.
SpeechBrain is a PyTorch-based toolkit for speech and language tasks, used by dozens of research groups and startups. The 1.0 release (JMLR 2024) consolidates years of contributions into a stable API with comprehensive recipes for ASR, TTS, speaker recognition, and dialogue understanding.