Training an 8 kHz recognizer with 16 kHz samples

The MALORCA project team has been working on solving issues with narrowband/wideband mismatch in speech recognition of ATC data. In recent years, large corpora of wideband training data (i.e. sampled with 16 kHz or higher) have been collected for building state-of-the-art applications that process wideband speech. One difficulty in the deployment of automatic speech recognition systems arises when the target systems in operation only records narrowband speech (i.e. sampled with 8 kHz), as this is often case in the ATC world. Most of the speech corpora already recorded for incident evaluation are only sampled with 8 kHz.

The performance obtained by restricting the input bandwidth of the recognizer to 8 kHz is significantly lower, since wideband (16 kHz) speech contains information that is useful for classifying phones. Nevertheless, many training corpora for adapting the acoustic models are available from ATC world. These wideband training data sets are not directly usable for 8 kHz applications. Just down sampling them to 8 kHz loses information.

MALORCA project, therefore, investigated mixed-bandwidth training using deep neural networks (DNN) through multi-task learning and hierarchical DNN structures. These networks are able to jointly predict wideband and narrowband targets, and therefore indirectly exploit both narrowband and wideband speech features. Our experimental work shows cross-portability of speech features extracted using deep learning for domain adaptation even if there is a mismatch in sampling rates.

The proposed approaches were evaluated on air traffic control operator speech sampled at 8 kHz. Achieved results (about 10% relative improvement in word error rate compared to just down sampling the training data to 8 kHz) reveal that wideband training data can be exploited to improve the recognition performance of narrowband speech by using multi-task DNN learning. MALORCA project plans to use these findings, i.e. combining 8 kHz and 16 kHz data, in building and improving the baseline speech recognition system, which will be used in work package 4 more further improvements by machine learning techniques.

Furthermore, Idiap Research Institute has progressed in the work of automatic acoustic model learning through using unsupervised adaptation techniques. We are currently experimenting with low dimensional structures of phone posteriori probabilities and their further enhancement for deep learning based acoustic modeling. The proposed technique allows to efficiently exploiting untranscribed speech data (new ATC recordings) for automatic adaptation of acoustic models toward ATC domain.