Wav2Vec2 and XLS-R models fine-tuned on public ATC datasets (ATCOSIM, LDC-ATCC, UWB-ATCC), released through HuggingFace for anyone to benchmark or build on.
Headline results
- 20–40% relative WER reduction vs. supervised Conformer baselines on in-domain ATC test sets
- Cross-accent generalization via XLS-R — a single model trained on mixed European ATC data
- ~6% WER on ATCOSIM with the Wav2Vec2-Large fine-tune
What’s released
| Model | Training data | Link |
|---|---|---|
| Wav2Vec2-Large ATC | ATCOSIM | HuggingFace ↗ |
| Wav2Vec2-Base ATC | LDC-ATCC | HuggingFace ↗ |
| XLS-R ATC | All public ATC | HuggingFace ↗ |
Try it
The Colab notebook loads any of the models and transcribes an audio sample in under a minute — no GPU required for inference.
Context
This work is part of my PhD at Idiap/EPFL and was presented at IEEE SLT 2022. The accompanying paper systematically studies how self-supervised representations transfer under heavy domain shift — something surprisingly under-studied before we published.