The idea
Traditional speaker diarization (“who spoke when”) relies on acoustic features — but for Air Traffic Control, the audio signal is poor: VHF noise, short turns (~2s average), and a single mono channel for both parties.
BERTraffic sidesteps that: take the ASR transcript, finetune BERT on a two-head classification task, and output both (a) turn boundaries and (b) the role of each turn (pilot/controller). Surprisingly, text-only beats the strongest audio-only baselines available.
Results
- 27% relative DER reduction vs. audio baseline on ATCO2 test set
- Works with noisy ASR transcripts, not just gold text — the model is robust to ~15% WER input
- Joint training of the two heads is better than pipelining them
Released
- Full training / evaluation code on GitHub
- Fine-tuned BERT models on HuggingFace (pilot/controller classifier, turn-change classifier)