Open-source projects, datasets, and research artifacts I’ve built or co-authored. Most are available on GitHub or HuggingFace.
Active contributor to vLLM-Omni — the inference engine for omni-modality models (text, speech, audio, vision).
Ongoing contributions to vLLM-Omni’s Qwen3-TTS and OmniVoice paths: streaming output, Code2Wav batched decoding, CUDA Graph + torch.compile, voice cloning, and throughput/latency optimization for high-concurrency TTS serving.
5,000 hours of Air Traffic Control communications — the largest open ATC speech dataset.
A multilingual, semi-automatically labeled corpus built to advance ASR and natural language understanding on one of the hardest real-world speech domains. Includes audio, transcripts, speaker role annotations, and a preprocessing pipeline. Used as a benchmark by follow-up work across Europe.
Self-supervised ASR models fine-tuned for Air Traffic Control, available on HuggingFace.
A family of Wav2Vec2 models that achieve 20–40% relative WER reduction on ATC data compared to supervised baselines. Released with training recipes, evaluation scripts, and a Colab notebook for immediate inference. The benchmark paper at SLT 2022 studies self-supervised pretraining behavior under heavy domain shift.
Joint speaker-role and speaker-change detection from ATC transcripts — no audio required.
Most ATC diarization systems rely on audio signals, which are low-quality and short. BERTraffic reframes the problem as text classification: given a transcript, predict speaker turns and whether each turn is a pilot or controller. Beats audio-only baselines by 27% DER.
A Conformer variant where attention is replaced with HyperMixer — matched accuracy, less compute.
Attention is the expensive part of Conformer-based ASR models. HyperConformer swaps it for a multi-head HyperMixer, which scales linearly in sequence length rather than quadratically. Same WER as Conformer at a meaningful compute cut.
Co-authored the 1.0 release of the open-source conversational AI toolkit.
SpeechBrain is a PyTorch-based toolkit for speech and language tasks, used by dozens of research groups and startups. The 1.0 release (JMLR 2024) consolidates years of contributions into a stable API with comprehensive recipes for ASR, TTS, speaker recognition, and dialogue understanding.