Various accents pose a problem to automated speech recognition software. If accents can be more easily detected, different spoken language models can be applied to speech recognition software to make for a more correct interpretation of spoken language.

In this tutorial, we describe our exploration of Cambridge University’s Hidden Markov Model Toolkit as a tool to use for spoken accent prediction. We explore the classification of the pronunciation of the word ”security” as spoken by native English and native Spanish speakers.

1 Installation of HTK Software

The website for HTK can be found here. The HTK developers require that you register for a username and password through their site before downloading their software. After registering, visit the downloads page and download the HTK source code (available as a tarball). It is also useful to download the HTKBook as a PDF (available on the downloads page, below the software). If you do not wish to download the book, you can view the book online after registering.

2 Data Acquisition

Northwestern University’s Online Speech/Corpora Archive and Analysis Resource (OSCAAR) is a collection of speech recordings from speakers with different backgrounds, assembled from various datasets.

The dataset that we found most appropriate for our goal of accent detection and classification is the ALLSTAR dataset from the Speech and Communication Research Group at Northwestern University.

3 Labelling the Training Corpus

It is possible to record one’s own audio for this classification using HSLab. For those like us who already have a dataset and will not use the recording features of HSLab, you can use a configuration file to change the anticipated input format to the program.

HSLab allowed us to label the boundaries between words and the silence around them in each .wav file. With all important sections of the .wav file labeled, a .lab file for each clip was created.

4 Coding the Data

The .wav files themselves cannot be analyzed using HTK, so we used HCopy to convert the original .wav files into .mfcc files. The .mfcc files, which each contain a set of vector representations of the sound signal, can be analyzed. Each 25ms segment is represented by a vector of acoustical coefficients, which provides a description of that segment’s spectral properties.

5 Setting Parameters for the HMM

Training is the process of estimating the parameters of the HMMs by using labelled sound examples. To start, we initialize each HMM with the HTK tool HInit, which time-aligns the training data with a Viterbi algorithm.

Then we train the models by using HRest on each HMM until convergence. HRest uses Baum-Welch parameter re-estimation to perform one re-estimation iteration on an input HMM, producing a new HMM.

6 Defining Your Task & Recognition

Once you have created HMMs for each of the accents you’ll be including in your test sample, you’re ready to define the task. The first step in defining the task is creating a grammar, which contains the syntactic structure of examples to be tested.

The tool used to run the new test sample (MFCC file) through the network is HVite.

Spoken Language Accent Detection

Probabilistic Accent Detection using HTK

Proof of Concept and HTK Tutorial

Project Introduction

Motivation and Implementation

Tutorial

How We Used HTK

1

Installation of HTK Software

2

Data Acquisition

3

Labelling the Training Corpus

4

Coding the Data

5

Setting Parameters for the HMM

6

Defining Your Task & Recognition

Learn More

Tutorial, Data, and Shell Script

Meet the Team

Tre Calhoun

Andrew Vaslas

La Vesha Parker

Nicolas Vera