Juan Pablo Zuluaga

Juan Pablo Zuluaga

Senior Research Engineer · Agigo AG · Zürich

I build and deploy speech-and-audio LLMs, production ASR, and spoken language understanding systems. PhD from EPFL & IDIAP. Previously at Apple and AWS.

40+ publications Interspeech · ICASSP · EMNLP · IEEE SLT · JMLR · TSD

about

I'm a Senior Research Engineer at Agigo AG, a Swiss AI company building autonomous AI agents. My work sits at the intersection of natural language processing and automatic speech recognition, with a strong focus on speech-and-audio LLMs.

I completed my PhD at EPFL and IDIAP in 2024. My thesis tackled automatic speech recognition for air traffic control — one of the hardest real-world ASR domains. Along the way I built the ATCO2 corpus, fine-tuned self-supervised models for this domain, and published work on speaker diarization, speaker role detection, and contextual ASR.

Before Agigo, I interned at Apple (ML for ASR on tail named entities) and at AWS (speech translation and transcription). I hold master's and bachelor's degrees in Mechatronics Engineering from Universidad de Oviedo and Universidad Autónoma del Caribe.

I live in Zürich. Originally from Baranoa, Colombia.

currently

Production speech LLMs — scaling acoustic and language models for multi-client inference.
Synthetic conversational data — generating high-quality dialogues for model training pipelines.
GPU efficiency — maximizing throughput for inference workloads across clients.
Generative TTS with universal phonetizers — improving zero-shot, multilingual text-to-speech with shared phonetic representations.
Omni-LLM data pipelines — large-scale data processing with omni LLMs to fuel multi-task RL training.

featured publications

Unifying Global and Near-Context Biasing in a Single Trie Pass

TSD 2025

I. Thorbecke, E. Villatoro-Tello, J. P. Zuluaga, S. Kumar, S. Burdisso, P. Rangappa, A. Carofilis, S. Madikeri, P. Motlicek, K. Pandia, K. Hacioglu, A. Stolcke.

Single-pass trie unifies global vocabulary biasing with utterance-level context biasing for transducer ASR.

Speech Data Selection for Efficient ASR Fine-Tuning

ICASSP 2025

P. Rangappa, S. Madikeri, J. P. Zuluaga, J. Villatoro-Tello, P. Motlicek.

A domain classifier plus pseudo-label filtering cuts ASR fine-tuning compute by ~40% at matched WER.

XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

ICASSP 2025

S. Madikeri, J. P. Zuluaga, P. Rangappa, J. Villatoro-Tello, P. Motlicek.

Streaming ASR atop a frozen self-supervised backbone, without sacrificing non-streaming accuracy.

Open-source Conversational AI with SpeechBrain 1.0

JMLR 2024

M. Ravanelli, T. Parcollet, A. Moumen, S. de Langen, C. Subakan, P. Plantinga, Y. Liao, S. Cornell, D. Roman, S. Moradi, D. Chander, D. Petermann, Y. Wang, J. P. Zuluaga, et al.

Co-authored the 1.0 release of SpeechBrain — a PyTorch toolkit for conversational AI.

End-to-end single-channel speaker-turn aware conversational speech translation

EMNLP 2023

J. P. Zuluaga, Z. Huang, X. Niu, R. Paturi, S. Srinivasan, P. Mathur, B. Thompson, M. Federico.

First end-to-end speech translation system that handles speaker turns and overlapped speech on a single channel.

HyperConformer: Multi-Head HyperMixer for Efficient Speech Recognition

Interspeech 2023

F. Mai, J. P. Zuluaga, T. Parcollet, P. Motlicek.

Replaces Conformer attention with HyperMixer, matching accuracy at a fraction of the compute.

CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification

Interspeech 2023

J. P. Zuluaga, S. Sarfjoo, A. Prasad, I. Nigmatulina, P. Motlicek, K. Ondrej, O. Ohneiser, H. Helmke.

Accent classification benchmark on Common Voice using large self-supervised models — **Best Student Paper nominee**.

How Does Pre-Trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications

IEEE SLT 2022

J. P. Zuluaga, A. Prasad, I. Nigmatulina, S. Sarfjoo, P. Motlicek, M. Kleinert, H. Helmke, O. Ohneiser, Q. Zhan.

Systematic study of self-supervised pretraining under domain shift — 20–40% relative WER cut on Air Traffic Control.

→ all publications

featured projects

ATCO2 Corpus

5,000 hours of Air Traffic Control communications — the largest open ATC speech dataset.

A multilingual, semi-automatically labeled corpus built to advance ASR and natural language understanding on one of the hardest real-world speech domains. Includes audio, transcripts, speaker role annotations, and a preprocessing pipeline. Used as a benchmark by follow-up work across Europe.

SpeechASRDatasetNLU

wav2vec2-atc

Self-supervised ASR models fine-tuned for Air Traffic Control, available on HuggingFace.

A family of Wav2Vec2 models that achieve 20–40% relative WER reduction on ATC data compared to supervised baselines. Released with training recipes, evaluation scripts, and a Colab notebook for immediate inference. The benchmark paper at SLT 2022 studies self-supervised pretraining behavior under heavy domain shift.

ASRSelf-supervisedWav2Vec2HuggingFace

BERTraffic

Joint speaker-role and speaker-change detection from ATC transcripts — no audio required.

Most ATC diarization systems rely on audio signals, which are low-quality and short. BERTraffic reframes the problem as text classification: given a transcript, predict speaker turns and whether each turn is a pilot or controller. Beats audio-only baselines by 27% DER.

NLPBERTDiarizationATC

HyperConformer

A Conformer variant where attention is replaced with HyperMixer — matched accuracy, less compute.

Attention is the expensive part of Conformer-based ASR models. HyperConformer swaps it for a multi-head HyperMixer, which scales linearly in sequence length rather than quadratically. Same WER as Conformer at a meaningful compute cut.

ASRArchitectureEfficient ML

SpeechBrain 1.0

Co-authored the 1.0 release of the open-source conversational AI toolkit.

SpeechBrain is a PyTorch-based toolkit for speech and language tasks, used by dozens of research groups and startups. The 1.0 release (JMLR 2024) consolidates years of contributions into a stable API with comprehensive recipes for ASR, TTS, speaker recognition, and dialogue understanding.

Open SourcePyTorchSpeechNLP

→ all projects