ATCO2 is a four-year effort (2020–2024) funded by the European SESAR Joint Undertaking, aiming to produce a large-scale open corpus of Air Traffic Control communications.

What’s in it

  • ~5,000 hours of audio from live VHF recordings across European airspace
  • Transcripts (semi-automatic, manually verified on a subset)
  • Speaker role labels (pilot vs. air traffic controller)
  • Contextual metadata (callsigns, commands, waypoints — linked to surveillance data)
  • Four subsets: 4 hours gold-standard · 1 hour test-full · the full 5k-hour pool · ATCO2-PL (pilot-only)

Why it matters

ATC speech is uniquely hard for ASR: channel noise, rapid speech, heavy code-switching between English and local languages, and a strictly domain-specific vocabulary. Before ATCO2, the largest open ATC dataset was a few dozen hours — too small to train modern self-supervised systems. ATCO2 makes it possible to study domain-shift behavior of large pretrained models at a realistic scale.

Downstream use

The corpus has been used as a training set or benchmark by follow-up work on contextual ASR, speaker diarization, callsign recognition, and speech translation. It’s available on ELRA for research use.