Archive for EEN502Project

Isolated Word Recognition

EEN502Projecton May 7th, 2010Comments Off

This is the third and final project of EEN540 – Digital Speech and Audio Processing.

For this project, I will be using a corpora of pre-recorded subjects to test two different methods of isolated word recognition. The methods will both involve dynamic time warping (DTW), and this will be a speaker-dependent system. The corpora can be found here. Each subject in the corpora has 50 utterances with 5 repetitions of digits zero through nine. The first four repetitions of each digit will be used for training, and the last for testing.

The first system that is developed uses 17 linearly spaced band energies. First, a template of features for each digit is found by using dynamic time warping. The features for each digit are warped to the smallest template in the collection. Then, each testing digit is compared to all of the templates and a 10-length vector of DTW distances is acquired. The minimum index of this vector is what the system recognized.

This recognized digit is compared with the ground truth, and the results for the 17 linearly spaced band energies is below. The data is collected in a confusion matrix, and the word error rate (WER) is also reported.


The next system is the exact same as the last, except it uses mel-cepstral coefficients for the features. This proved to be far more successful, and the results can be seen below.

And finally, the templates from the mel-cepstral coefficient system are used to test spoken data that I recorded of myself saying the digits. All fifty utterances were correctly classified.

I am quite happy with the results from the MFCC recognition systems. The reason for the 6.00% WER for the second system seemed to be due to a single speaker that was consistently incorrectly classified. That subject likely had speech that was most dissimilar to the rest of the corpora. I was also pleasantly surprised that all 50 of my utterances were correctly classified with the MFCC approach.

The 17 linearly spaced band energies did not provide an adequate system for speech classification. I am completely sure that my code is correct – these energies are just not good features. I was manually inspecting the features to see if something was going wrong. Only the first three bands provided any meaningful data as a feature, and the rest were very close to zero. This is most likely why mel/bark spacing is preferred to linear spacing for feature extraction.

For my recorded utterances, see here.

For all of the MATLAB code used, see here.


Sound Production Modeling Using Concatenated Acoustic Tubes

EEN502Projecton March 22nd, 2010Comments Off

Results and Code for Part A

Results and Code for Part B

Code for Part C – Hey

Hey Sound

Code for Part C – Wow

Wow Sound


Part A seems to have come out alright. Nothing was extraordinarily surprising – all figures were kind of what I expected. I think the same about part B. The result signal looks very much like a glottal pulse.

Part C is where I experienced some difficulties. As can be heard from the sound samples above, they are not entirely convincing. That being said, the ‘wow’ sounds OK, but the ‘hey’ sounds more like ‘hi’. I didn’t spend too much time on finding an appropriate energy envelope, but I did spend time trying to make the pitch of the synthesized word sound realistic. For each part of the word, I created new glottal pulses with appropriate frequencies. When transitioning between phonemes, a single, new area function is found by interpolating between the others. To achieve a more realistic sound, I tried playing around with different parameters of each phoneme or transition between phonemes: the time/duration in seconds, the pitch, and the envelope. I feel that the ‘wow’ came out well, but I have some issues with my ‘hey’. It seems my ‘a’ is more like a long ‘a’ as in ‘aww’ where I needed an ‘a’ as in ‘apex’. Darn!


The Acoustic Features of Speech Sounds

EEN502Projecton February 14th, 2010Comments Off


Part A

Click on one of the phonemes below to view it’s acoustic features.

group group

Or click below to see 3D plots of Power Spectral Density vs Frequency vs Time.


For a downloadable zip file of the sounds, please click here. The differences between the major phonemic categories are seen in the above figures, and the results are as expected. Vowels exhibit highly periodic spectrograms, while unvoiced phonemes were largely characterized by their noisy spectrograms.

Part B

An ideal telephone channel was created using overlap and add frequency domain techniques with zero padding. The channel is shown below.

Both a spoken sentence and a sung sentence were passed through the ideal filter. The plots of their magnitude spectrums are below. If interested, check out the original song and the filtered song.

Also, the phonemes from before were run through this ideal filter. This had a particular effect on the voiced and unvoiced fricatives, since the filter removed aspects that previously distinguished the two.


EEN502 Project 3

EEN502Projecton December 4th, 2009Comments Off

Problem 1 consists of four parts.

I first recorded myself saying my first and last name. Then, the time waveform, wideband spectrogram, and narrowband spectrogram are generated and displayed. Finally, two 30ms vowel segments are chosen and their magnitude and phase spectra are displayed.

The code for all four parts can be found here:
Code for Problem 1
Sound for Problem 1 (my name)

Problem 2

In this problem, we first generate a chirp signal that goes from 20Hz to 20kHz (human audible range). Next, an equal loudness curve is generated using the official iso226 standard. Finally, an “equal loudness chirp” is created by filtering the original chirp in the frequency domain by simple multiplication of the equal loudness curve. An audio file was also passed through the filtering process to generate and equal loudness version of the sound. The spectrums of the original and equal loudness versions can be seen in the last figure in the link below. It is apparent that the low frequencies were boosted, but not as obvious for the high frequencies because there weren’t many to begin with.
Code and figures for Problem 2
Sound for Problem 2 (chirp)
Sound for Problem 2 (equal loudness chirp)

BONUS

I chose the second method outlined in Zwicker, which uses two pure tones and the threshold of narrowband noise that is found by changing the separation of the two tones. The most challenging part of this bonus problem was generating narrowband noise (100Hz bandwidth) when the sampling rate is as high as 44100 Hz. I tried every kind of filter design technique I knew to get a filter with those specifications: IIR, FIR, butterworth, chebyshev, elliptical, parks-mclellen, cascaded second order section filtering, etc. None of these techniques provided the appropriate filter. I believe the problems I was having are discussed in the elliptical filter’s help file under the “Limitations” section. I finally gave up and used Simulink blocks to produce the appropriate noise source. Even then, it took simulink about 4 minutes to generate two seconds of narrowband noise!

I set out to verify Zwicker’s data in the Critical Band PDF where they show the test data of the test to find the critical band at 2kHz. I have two main loops in the program. The outer loop adjusts the frequency separation of the two pure tones, and the inner loop adjusts the level of the narrowband noise until the user decides it is masked. This was as accurate as I could think to make it. Thankfully, I did end up with a graph that closely resembles the one found in Zwicker’s book, and I got a similar result for the critical bandwidth at 2kHz.

This work definitely could have been extended to find critical bandwidths at other frequencies, but I believe that I have demonstrated the appropriate concepts. All I would have to do is add another outer loop that iterates over different center frequencies. The link below shows the code and the resulting figures from these tests.

Code and figures for Bonus

Conclusion and Results

The first problem was straightforward. I originally forgot to unwrap the phase, but everything else went smoothly. However, the second problem did present quite a challenge. I was not satisfied by estimating an equal loudness curve, so I downloaded the official ISO226 files from the MathWorks website. This file returned length 29 arrays – frequencies and levels – that correspond to the equal loudness curve. These frequency values are heavily weighted towards those under 500 Hz. Basically, the spl values were not linearly spaced apart in frequency. To remedy this situation, I used spline interpolation over a range of 0 to 20500 for 100 data points. This created 100 points that were indeed linearly spaced in frequency and also went up to our Nyquist frequency (the iso226 spec stops at 12500Hz). The, I resampled the new dataset to be the appropriate length for my frequency domain multiplication. The final chip’s waveform clearly shows the equal loudness curve boosting and attenuating the signal in appropriate places. There is some slight noise in the final signal. This could be a result of the equal loudness curve zeroing out a very small range of high frequencies. When an audio file was used and its equal loudness counterpart was generated, this noise was not noticeable. (The “Bonus” section contains a thorough discussion/conclusion about finding critical bands with MATLAB)


EEN 502 – Project 2

EEN502Projecton October 26th, 2009Comments Off

This project seeks to explore spatial audio with a library of HRTF data provided from: http://sound.media.mit.edu/resources/KEMAR.html

Problem 1 consists of three parts.

The code for all three parts can be found here:
Code for Problem 1

a) A gaussian noise source moving clockwise at 2 revolutions per second, 4 seconds total
Sound for Problem 1, Part A

b) A gaussian noise source moving counter-clockwise at 1 revolution per second, 2 seconds total
Sound for Problem 1, Part B

c) Both sources from part a and b. Part b was extended to four seconds so that they both ran for the same amount of time.
Sound for Problem 1, Part C

Problem 2

In this problem, a three tone signal moves along a line in front of a “listener.” The sound takes into account the doppler effect, distance envelope, and HRTFs for spatialization.
Code for Problem 2
Sound for Problem 2

BONUS

Moving source 3D MATLAB plot.
Sound for Problem 2

Conclusion and Results

Problem one was a success. The code was even able to spatialize audio when provided a musical signal instead of broadband noise. However, problem two has its problems. As heard in the audio sample above, tiny clicks occur in the resulting audio whenever the algorithm switches between HRTF impulses. This is likely due to the fact that our HRTF data are all spaced by five degrees. A simple low pass filter could mitigate the clicks, but a more elegant solution could exist in somehow interpolating between the impulses separated by five degrees. This interpolation could yield impulses with 1 degree of precision, or less. This implementation would reduce the clicking effect and generally provide a more accurate resulting signal.


Experimenting with Doppler Sound Effects

EEN502Projecton September 23rd, 2009Comments Off

This project seeks to explore the doppler sound effect in several different scenarios.

In Part 1, the source sound moves towards a listener on the same plane.
Part 1 code and figures
Part 1 sound

In Part 2, the observer is separated from the source sound trajectory by an offset.
Part 2 code and figures
Part 2 sound

In Part 3, the offset remains, but the tone from the source consists of three tones. For this, I chose a major triad.
Part 3 code and figures
Part 3 sound

Part 4 had two different components. The components replicate Parts 1 and 2 respectively, but they add a second observer. The two observers act as “ears” of a listener, and a stereo track is produced as a result.
Part 4a code and figures
Part 4a sound
Part 4b code and figures
Part 4b sound

Here, you can download all five m-files and run them on any copy of MATLAB.
Download M-Files as a ZIP

Conclusion and Results

The sound files with an offset sounded much better than those with the source and observer on the same plane. This is due to the clicking that occurs when the two are sufficiently close together to cause the envelope to tend towards infinity. This problem was mitigated by limiting the maximum values of the envelope, but it is a non-ideal solution, and discontinuities still occur.

When simulating a person’s two ears, the fact that the distance between ears is very small led to the graphs looking very similar. I changed the distance to something unrealistic (i.e. 4 meters) to make sure that the files were working properly.