Various accents pose a problem to automated speech recognition software. If accents can be more easily detected, different spoken language models can be applied to speech recognition software to make for a more correct interpretation of spoken language.
In this tutorial, we describe our exploration of Cambridge University’s Hidden Markov Model Toolkit as a tool to use for spoken accent prediction. We explore the classification of the pronunciation of the word ”security” as spoken by native English and native Spanish speakers.
The website for HTK can be found here. The HTK developers require that you register for a username and password through their site before downloading their software. After registering, visit the downloads page and download the HTK source code (available as a tarball). It is also useful to download the HTKBook as a PDF (available on the downloads page, below the software). If you do not wish to download the book, you can view the book online after registering.
Northwestern University’s Online Speech/Corpora Archive and Analysis Resource (OSCAAR) is a collection of speech recordings from speakers with different backgrounds, assembled from various datasets.
The dataset that we found most appropriate for our goal of accent detection and classification is the ALLSTAR dataset from the Speech and Communication Research Group at Northwestern University.
It is possible to record one’s own audio for this classification using HSLab. For those like us who already have a dataset and will not use the recording features of HSLab, you can use a configuration file to change the anticipated input format to the program.
HSLab allowed us to label the boundaries between words and the silence around them in each .wav file. With all important sections of the .wav file labeled, a .lab file for each clip was created.
The .wav files themselves cannot be analyzed using HTK, so we used HCopy to convert the original .wav files into .mfcc files. The .mfcc files, which each contain a set of vector representations of the sound signal, can be analyzed. Each 25ms segment is represented by a vector of acoustical coefficients, which provides a description of that segment’s spectral properties.
Training is the process of estimating the parameters of the HMMs by using labelled sound examples. To start, we initialize each HMM with the HTK tool HInit, which time-aligns the training data with a Viterbi algorithm.
Then we train the models by using HRest on each HMM until convergence. HRest uses Baum-Welch parameter re-estimation to perform one re-estimation iteration on an input HMM, producing a new HMM.
Once you have created HMMs for each of the accents you’ll be including in your test sample, you’re ready to define the task. The first step in defining the task is creating a grammar, which contains the syntactic structure of examples to be tested.
The tool used to run the new test sample (MFCC file) through the network is HVite.