Isolated Word Recognition


05.07.10 Posted in EEN502Project by

This is the third and final project of EEN540 – Digital Speech and Audio Processing.

For this project, I will be using a corpora of pre-recorded subjects to test two different methods of isolated word recognition. The methods will both involve dynamic time warping (DTW), and this will be a speaker-dependent system. The corpora can be found here. Each subject in the corpora has 50 utterances with 5 repetitions of digits zero through nine. The first four repetitions of each digit will be used for training, and the last for testing.

The first system that is developed uses 17 linearly spaced band energies. First, a template of features for each digit is found by using dynamic time warping. The features for each digit are warped to the smallest template in the collection. Then, each testing digit is compared to all of the templates and a 10-length vector of DTW distances is acquired. The minimum index of this vector is what the system recognized.

This recognized digit is compared with the ground truth, and the results for the 17 linearly spaced band energies is below. The data is collected in a confusion matrix, and the word error rate (WER) is also reported.


The next system is the exact same as the last, except it uses mel-cepstral coefficients for the features. This proved to be far more successful, and the results can be seen below.

And finally, the templates from the mel-cepstral coefficient system are used to test spoken data that I recorded of myself saying the digits. All fifty utterances were correctly classified.

I am quite happy with the results from the MFCC recognition systems. The reason for the 6.00% WER for the second system seemed to be due to a single speaker that was consistently incorrectly classified. That subject likely had speech that was most dissimilar to the rest of the corpora. I was also pleasantly surprised that all 50 of my utterances were correctly classified with the MFCC approach.

The 17 linearly spaced band energies did not provide an adequate system for speech classification. I am completely sure that my code is correct – these energies are just not good features. I was manually inspecting the features to see if something was going wrong. Only the first three bands provided any meaningful data as a feature, and the rest were very close to zero. This is most likely why mel/bark spacing is preferred to linear spacing for feature extraction.

For my recorded utterances, see here.

For all of the MATLAB code used, see here.




Comments are closed.

Social Networks
Links
Search the Archives: