The idea

Traditional speaker diarization (“who spoke when”) relies on acoustic features — but for Air Traffic Control, the audio signal is poor: VHF noise, short turns (~2s average), and a single mono channel for both parties.

BERTraffic sidesteps that: take the ASR transcript, finetune BERT on a two-head classification task, and output both (a) turn boundaries and (b) the role of each turn (pilot/controller). Surprisingly, text-only beats the strongest audio-only baselines available.

Results

  • 27% relative DER reduction vs. audio baseline on ATCO2 test set
  • Works with noisy ASR transcripts, not just gold text — the model is robust to ~15% WER input
  • Joint training of the two heads is better than pipelining them

Released

  • Full training / evaluation code on GitHub
  • Fine-tuned BERT models on HuggingFace (pilot/controller classifier, turn-change classifier)