Wav2Vec2 and XLS-R models fine-tuned on public ATC datasets (ATCOSIM, LDC-ATCC, UWB-ATCC), released through HuggingFace for anyone to benchmark or build on.

Headline results

  • 20–40% relative WER reduction vs. supervised Conformer baselines on in-domain ATC test sets
  • Cross-accent generalization via XLS-R — a single model trained on mixed European ATC data
  • ~6% WER on ATCOSIM with the Wav2Vec2-Large fine-tune

What’s released

ModelTraining dataLink
Wav2Vec2-Large ATCATCOSIMHuggingFace ↗
Wav2Vec2-Base ATCLDC-ATCCHuggingFace ↗
XLS-R ATCAll public ATCHuggingFace ↗

Try it

The Colab notebook loads any of the models and transcribes an audio sample in under a minute — no GPU required for inference.

Context

This work is part of my PhD at Idiap/EPFL and was presented at IEEE SLT 2022. The accompanying paper systematically studies how self-supervised representations transfer under heavy domain shift — something surprisingly under-studied before we published.