Prosody-Dependent Speech Recognition

This project is a collaboration between faculty in the University of Illinois Departments of ECE, Linguistics, and Computer Science. This research was initiated by a University of Illinois Critical Research Grant.

Prosody (προσῳδία) is the music of speech: its phrasing and prominence.

  • Phrasing is the way in which syllables are chunked, either consciously (for communicative effect) or epiphenomenally (because short-term memory only allows us to plan a limited number of syllables in advance). Phrasing is communicated primarily by lengthening phonemes in the rhyme of the phrase-final syllable.
  • Prominence is the emphasis placed upon particular syllables, either consciously (for communicative effect) or epiphenomenally (because prominence on certain syllables helps to convey phrase structure). Prominence is communicated by increased duration and energy of every articulator movement in the prominent syllable, and of the signal itself.
  • Both phrasing and prominence may be signalled by pitch movements (the “singing” of natural speech), but these pitch movements seem to be very much under the control of the speaker — it’s possible to communicate prosody with or without the pitch movements.

Landmarks are salient instantaneous acoustic events. Landmarks carry information: if the listener can decode the landmarks, then the listener can understand the signal. The syllable structure of speech is conveyed by the Stevens landmarks: consonant releases, consonant closures, syllable nucleus peaks, and intersyllabic dips. The acoustic events that occur in and around a landmark are shaped by the articulator movements that produced it. The articulator movements, in turn, are controlled by a master gestural score governed by (1) the word being spoken, and (2) the prosody with which it is spoken.

The goal of this research is to implement statistical models of phrasing, prominence, of the articulatory plans and actions that implement them, and of the acoustic landmarks that communicate them, in order to improve the accuracy of automatic speech recognition.

People

  • Investigators
    • Jennifer S. Cole, Department of Linguistics
    • Chilin Shih, Department of Linguistics
    • Margaret Fleck, Department of Computer Science
    • Mark Hasegawa-Johnson, Department of Electrical and Computer Engineering
  • Post-Doc
    • Jeung-Yoon Choi
  • Graduate Students
    • Sarah Borys
    • Tim Mahrt
    • Ken Chen
    • Hansook Choi
    • Aaron Cohen
    • Ameya Deoras
    • Chi Hu
    • Jui-Ting Huang
    • Heejin Kim
    • Sung-Suk Kim
    • Mohamed Kamal Omar
    • Taejin Yoon
    • Tong Zhang
    • Yanli Zheng
    • Xiaodan Zhuang

Results


Star Challenge: language-independent spoken word detection

SST and IFP jointly entered a UIUC team in the 2008 A*STAR Star Challenge, a multimedia retrieval competition held in Singapore. This competition included image retrieval, video shot retrieval, and unknown-language spoken term detection retrieval components. UIUC was the only United States team to make the finals, and took third place in the competition.

We wrote a CLIAWS paper based on our Star Challenge system. The system was trained in English, Russian and Spanish, then tested in Croatian. Acoustic models were either not adapted (AM0) or adapted (AMt) to the Croatian speech. The phoneme bigram language model was also either not adapted (LM0) or adapted (LMt). IR was conducted using queries specified using IPA notation, or as an audio example. Resulting scores (MAP=mean average precision) are shown as a function of the degree of query expansion (number of allowed phonological feature substitutions).


Prosody reduces the word error rate of a speech recognizer

In our 2005 Speech Communication paper and several conference papers leading up to it, we demonstrated that prosodic tags can reduce the word error rate of a speech recognizer by 13% relative (table below). The most interesting finding was that the benefits of a prosody-dependent acoustic model and of a prosody-dependent language model are super-additive. We believe that these two models serve as a sort of consistency check: if the prosody for a candidate transcription matches the acoustics but not the context, or vice versa, then it can be ruled out.

Word Error Rates Acoustic Model has no Prosody Acoustic Model has Prosody
Language Model has no Prosody 24.8% 24.0%
Language Model has Prosody 24.3% 21.7%

Distinctive features reduce the word error rate of a speech recognizer

The word error rate of a speech recognizer can be reduced slightly if its observations include estimates of the distinctive features of acoustic phonetic landmarks, computed using support vector machines. Sarah Borys demonstrated an HMM-based system in which telephone-band phone error rates were reduced from 63.9% to 62.8%. Using a better baseline system, Hasegawa-Johnson et al. (Karen Livescu ran the best experiment) demonstrated a DBN-based system in which telephone-band word error rates were reduced from 27.7% to 27.2%.