Juan Pablo Zuluaga

Juan Pablo Zuluaga

Senior Research Scientist · Agigo AG · Zürich

🇨🇴 🇨🇭

I build and deploy speech-and-audio LLMs, production ASR, and spoken language understanding systems. PhD from EPFL & IDIAP. Previously at Apple and AWS.

40+ publications Interspeech · ICASSP · EMNLP · SLT · JMLR · TSD

about

I'm a Senior Research Scientist at Agigo AG, a Swiss AI company building autonomous AI agents. My work sits at the intersection of natural language processing and automatic speech recognition, with a strong focus on speech-and-audio LLMs.

I completed my PhD at EPFL and IDIAP in 2024. My thesis tackled automatic speech recognition for air traffic control — one of the hardest real-world ASR domains. Along the way I built the ATCO2 corpus, fine-tuned self-supervised models for this domain, and published work on speaker diarization, speaker role detection, and contextual ASR.

Before Agigo, I interned at Apple (ML for ASR on tail named entities) and at AWS (speech translation and transcription). I hold master's and bachelor's degrees in Mechatronics Engineering from Universidad de Oviedo and Universidad Autónoma del Caribe.

I live in Zürich. Originally from Baranoa, Colombia.

currently

High-throughput LLM & speech model serving — production deployment with vLLM-Omni: streaming decoders, CUDA Graphs, torch.compile, and multi-client inference scheduling.
TTS controllability & steering — conditioning generative TTS on prosody, speaker identity, and style — including voice cloning and zero-shot cross-lingual synthesis.
Full-stack model development — end-to-end: data curation, large-scale synthetic data generation, training optimization targeting 50% MFU, and deployment tuning for low-latency serving.
Local LLM deployment & KV-cache offloading — memory-efficient inference using disaggregated caching and offloading techniques (e.g., LMCache) for constrained-GPU and edge scenarios.
Holistic TTS evaluation pipelines — state-of-the-art automated evaluation covering intelligibility, speaker similarity, naturalness, prosody, and robustness across languages and domains.
Omni-modal data pipelines — large-scale processing with omni LLMs to generate multi-task RL training signal across speech, text, and audio.

ai-assisted engineering

Agentic development tools — primarily Claude Code — are central to how I build. I pair them with deep systems knowledge to move fast across the full stack that powers production speech and language AI: from low-level GPU kernels up to high-concurrency serving. Used well, they compress the loop from idea to shipped system and let a small team operate with outsized leverage.

CUDA & Triton kernel development TTS systems & controllability LLM & omni-modal inference High-concurrency deployment Large-scale data curation Rapid prototyping & evaluation

experience

2025 — present Agigo AG· Zürich, Switzerland

Senior Research Scientist

Production speech-and-audio LLMs, synthetic conversational data, GPU-efficient multi-client inference for real-world ASR and TTS deployments.

2024 — 2025 Telepathy Labs· Zürich, Switzerland

Research Engineer

Speech recognition, understanding, and generation for conversational AI agents.

Summer 2023 Apple· Cambridge, MA

ML Engineer Intern

Discriminative training of language models for ASR on tail named-entity data.

Spring 2023 Amazon Web Services· Seattle, WA

Applied Scientist Intern

Joint speech-to-text translation and transcription research. Work published at EMNLP 2023.

2020 — 2024 Idiap Research Institute & EPFL· Martigny, Switzerland

PhD Researcher

Thesis: Low-Resource Speech Recognition and Understanding for Challenging Applications. Advised by Dr. Petr Motlicek and Prof. Hervé Bourlard.

2019 — 2020 Idiap Research Institute· Martigny, Switzerland

Research Engineer

ATCO2 project (EU Horizon 2020). Automatic speech recognition and contextual understanding for air traffic control communications.

2017 — 2019 Universidad de Oviedo· Oviedo · Nancy · Cluj-Napoca

MSc · Erasmus Mundus EU4M

Mechatronics & Micro-Mechatronics. Fully funded Erasmus Mundus scholarship (EU Commission). Thesis on computer vision for breast cancer diagnosis.

2011 — 2016 Universidad Autónoma del Caribe· Barranquilla, Colombia

BSc · Mechatronics Engineering

Mechatronics Engineering. DAAD Research Scholarship (Germany, 2014).

→ full cv with education & awards

featured publications

Unifying Global and Near-Context Biasing in a Single Trie Pass

TSD 2025

I. Thorbecke, E. Villatoro-Tello, J. P. Zuluaga, S. Kumar, S. Burdisso, P. Rangappa, A. Carofilis, S. Madikeri, P. Motlicek, K. Pandia, K. Hacioglu, A. Stolcke.

Single-pass trie unifies global vocabulary biasing with utterance-level context biasing for transducer ASR.

Speech Data Selection for Efficient ASR Fine-Tuning

ICASSP 2025

P. Rangappa, S. Madikeri, J. P. Zuluaga, J. Villatoro-Tello, P. Motlicek.

A domain classifier plus pseudo-label filtering cuts ASR fine-tuning compute by ~40% at matched WER.

XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

ICASSP 2025

S. Madikeri, J. P. Zuluaga, P. Rangappa, J. Villatoro-Tello, P. Motlicek.

Streaming ASR atop a frozen self-supervised backbone, without sacrificing non-streaming accuracy.

Open-source Conversational AI with SpeechBrain 1.0

JMLR 2024

M. Ravanelli, T. Parcollet, A. Moumen, S. de Langen, C. Subakan, P. Plantinga, Y. Liao, S. Cornell, D. Roman, S. Moradi, D. Chander, D. Petermann, Y. Wang, J. P. Zuluaga, et al.

Co-authored the 1.0 release of SpeechBrain — a PyTorch toolkit for conversational AI.

End-to-end single-channel speaker-turn aware conversational speech translation

EMNLP 2023

J. P. Zuluaga, Z. Huang, X. Niu, R. Paturi, S. Srinivasan, P. Mathur, B. Thompson, M. Federico.

First end-to-end speech translation system that handles speaker turns and overlapped speech on a single channel.

HyperConformer: Multi-Head HyperMixer for Efficient Speech Recognition

Interspeech 2023

F. Mai, J. P. Zuluaga, T. Parcollet, P. Motlicek.

Replaces Conformer attention with HyperMixer, matching accuracy at a fraction of the compute.

CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification

Interspeech 2023 ★ Best Student Paper nominee

J. P. Zuluaga, S. Sarfjoo, A. Prasad, I. Nigmatulina, P. Motlicek, K. Ondrej, O. Ohneiser, H. Helmke.

Accent classification benchmark on Common Voice using large self-supervised models.

How Does Pre-Trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications

IEEE SLT 2022

J. P. Zuluaga, A. Prasad, I. Nigmatulina, S. Sarfjoo, P. Motlicek, M. Kleinert, H. Helmke, O. Ohneiser, Q. Zhan.

Systematic study of self-supervised pretraining under domain shift — 20–40% relative WER cut on Air Traffic Control.

→ all publications

featured projects

vLLM-Omni

Active contributor to vLLM-Omni — the production inference engine for omni-modality models.

20+ merged PRs across Qwen3-TTS and OmniVoice: streaming output, CUDA Graph + torch.compile, batched Code2Wav decoding, global speaker cache manager, and large throughput & latency wins under high concurrency.

InferenceTTSCUDAOpen-sourcevLLM

ATCO2 Corpus

5,000 hours of Air Traffic Control communications — the largest open ATC speech dataset.

A multilingual, semi-automatically labeled corpus built to advance ASR and natural language understanding on one of the hardest real-world speech domains. Includes audio, transcripts, speaker role annotations, and a preprocessing pipeline. Used as a benchmark by follow-up work across Europe.

SpeechASRDatasetNLU

wav2vec2-atc

Self-supervised ASR models fine-tuned for Air Traffic Control, available on HuggingFace.

A family of Wav2Vec2 models that achieve 20–40% relative WER reduction on ATC data compared to supervised baselines. Released with training recipes, evaluation scripts, and a Colab notebook for immediate inference. The benchmark paper at SLT 2022 studies self-supervised pretraining behavior under heavy domain shift.

ASRSelf-supervisedWav2Vec2HuggingFace

BERTraffic

Joint speaker-role and speaker-change detection from ATC transcripts — no audio required.

Most ATC diarization systems rely on audio signals, which are low-quality and short. BERTraffic reframes the problem as text classification: given a transcript, predict speaker turns and whether each turn is a pilot or controller. Beats audio-only baselines by 27% DER.

NLPBERTDiarizationATC

HyperConformer

A Conformer variant where attention is replaced with HyperMixer — matched accuracy, less compute.

Attention is the expensive part of Conformer-based ASR models. HyperConformer swaps it for a multi-head HyperMixer, which scales linearly in sequence length rather than quadratically. Same WER as Conformer at a meaningful compute cut.

ASRArchitectureEfficient ML

SpeechBrain 1.0

Co-authored the 1.0 release of the open-source conversational AI toolkit.

SpeechBrain is a PyTorch-based toolkit for speech and language tasks, used by dozens of research groups and startups. The 1.0 release (JMLR 2024) consolidates years of contributions into a stable API with comprehensive recipes for ASR, TTS, speaker recognition, and dialogue understanding.

Open SourcePyTorchSpeechNLP

→ all projects