Transcription of air traffic controllers’ speech recordings

Over the last months the MALORCA project team at ACG had to tackle some significant challenges but finally succeeded with the task of reliably transcribing air traffic controllers’ utterances. They have been made available as 8 kHz speech recordings of live conversations with aircraft. Transcription describes the process of transforming speech (wave files) to plain text (sequence of words).

Some of the encountered issues were difficult to overcome within the predetermined timeframe, e.g. the recorded speech data often contained communication of controllers to not only arrivals, but also departure and enroute flights, as the control responsibility often rests with the same controller. Therefore, the language model needed to be upgraded. The transcription tool from the former AcListant®-project did not support departures and overflights, which was basically known in advance but made transcription significantly more difficult and demanding.

Furthermore, in daily operations ATC usually takes advantage of frequency-coupling. This technology is very benefical when ATC-sectors are merged or split. In recent years adjacent units needed to be informed about changes of the frequency to which aircraft should be transferred anytime the sectorisation was changed. Today for each sector a unique frequency is in use permanently and as soon as two or more sectors are merged the relvant frequencies just need to be coupled. In other words: adjacent units may transfer aircraft to the same frequency at all times regardless of the sectorisation of the downstream unit. However, frequency-coupling is technically unfavorable for the production of useable recordings. Hence procedural mitigations had to be found and implemented for mitigation. In practical terms frequencies needed to be de-coupled and adjacent units had to be kept updated about all frequency-changes again, which is leading to a substantial increase of workload for the controllers during these phases.
Another challenge related to the fact that in many cases one voice recording included more than a single controller utterance, e.g. sequence of commands to aircraft A, 10 seconds silence, sequence of commands to aircraft B, silence … Therefore, a tool was developed by Idiap which automatically splits the controller utterances into separate files if silence occurs.

After successfully addressing all these issues, the emphasis could be placed on the actual transcription tool. Initial attempts for automatically transcribing speech data were not fully satisfying. With almost every controller’s utterance at least some manual corrections were required. However, within a few weeks’ time significant improvements were achieved and nowadays the number of errors in the transcriptions has been significantly reduced.

After the transcription of the speech data files had been accomplished, the annotation process was started, i.e. transforming the sequence of words in the relevant ATC concepts. The word sequence “sunturk four zero zulu fly direct whisky whisky nine seven two” seems to be quite easy. The relevant concepts are “PGT40Z DIRECT_TO WW972”. We try to transform a command into three concepts, i.e. callsign, command and value and in some cases we have as a forth concept the unit. “swiss one one five zulu turn left heading zero seven zero descend three thousand feet”. The three concept elements are here “SWR115Z TURN_LEFT_HEADING 70” and “SWR115Z DESCEND 3000 ALT” However, what do you suggest for the following sequence of words “november eight nine two delta echo descend flight level one one zero when clear of weather you are cleared direct to gesgi or correction to movos or destination whatever you prefer”? We suggest “N892DE DESCEND 110 FL”. The reader is encouraged to send us his feedback. email

Naturally, smallest deviations in the pronunciation of words from what was expected by the annotator tool for the translation from speech to software-processed commands did initially lead to errors. Nuanced calibration was required to sort out these deviations. For achieving this the importance of having comprehensive context (set of command hypotheses) available must not be underestimated. There is a need for permanent updating to ensure that all common airline- and waypoint names, etc. are always included in the database. MALORCA will, however, automate this process, by automatically learning airline and waypoint names from examples. So MALORCA will not only improve speech recognition performance, but also improve the annotation and transcription process.

We can finally conclude that a desirably high degree of accuracy with regard to the automatic transfer of transcribed speech data into software-processed commands has been achieved.