Audiovisual Description and Recognition of Audible and Visible Dysarthric Phonology

Mark Hasegawa-Johnson, Jon Gunderson, Adrienne Perlman and Thomas S. Huang

Audiovisual recordings of sixteen volunteer subjects with cerebral palsy and/or spastic dysarthria are now available, via secure ftp, to researchers at university or government labs who are interested in the development of human-computer interface for talkers with dysarthria. See the database page or the interspeech paper for details; to request sftp access, contact Mark Hasegawa-Johnson.

Parts of this research are funded by the National Science Foundation under grant IIS 05-34106, 2006-2008. Parts of this research are funded by the National Institutes of Health grant DC008090A.

Project Summary

Audiovisual Phonologic-Feature-Based Recognition of Dysarthric Speech

This project studied word-based, phone-based, and phonologic-feature-based audio and audiovisual speech recognition models for both small-vocabulary and large-vocabulary speech recognizers, designed to be used for unrestricted text entry on a personal computer. The models will be based on audio and video analysis of phonetically balanced speech samples from a group of speakers with dysarthria. Analysis will include speakers with reduced intelligibility caused by dysarthria, categorized into four groups based on intelligibility. Interactive phonetic analysis will seek to describe the talker-dependent characteristics of articulation error in dysarthria; based on analysis of preliminary data, we hypothesize that manner of articulation errors, place of articulation errors, and voicing errors are approximately independent events. Preliminary experiments also suggest that different dysarthric users will require dramatically different speech recognition architectures, because the symptoms of dysarthria vary so much from subject to subject. We propose to develop and test at least three categories of audio-only and audiovisual speech recognition algorithms for dysarthric users: phone-based and whole-word recognizers using hidden Markov models (HMMs), phonologic-feature-based and whole-word recognizers using support vector machines (SVMs), and hybrid SVM-HMM recognizers. The models will be evaluated to determine, first, overall recognition accuracy of each algorithm, second, changes in accuracy due to learning, third, group differences in accuracy due to severity of dysarthria, and fourth, dependence of accuracy on vocabulary size. The results of this research will contribute to scientific and technological knowledge about the acoustic and visual properties of dysarthric speech.

Speech and language disorders result from many types of congenital or traumatic disorders of the brain, nerves, and muscles. Dysarthria refers to the set of disorders in which unintelligible or perceptually abnormal speech results from impaired control of the oral, pharyngeal, or laryngeal articulators. The specific type of speech impairment is often an indication of the neuromotor deficit causing it, therefore speech language pathologists have developed a system of dysarthria categories reflecting both genesis and symptoms of the disorder. The most common category of dysarthria among children and young adults is spastic dysarthria. Symptoms of spastic dysarthria vary from talker to talker, but typical symptoms include strained phonation, imprecise placement of the articulators, incomplete consonant closure resulting in sonorant implementation of many stops and fricatives, and reduced voice onset time distinctions between voiced and unvoiced stops.

We are interested in spastic dysarthria because it is the most common type of severe, chronic speech disorder experienced by students at the University of Illinois, as well as being one of the most common types of dysarthria generally (Love, 1992). Spastic dysarthria is associated with a variety of disabilities such as, but not limited to, cerebral palsy and traumatic brain injury (Darley, 1975, Duffy, 1995). 0.26% of all seven-year-old children in the United States have moderate or severe cerebral palsy, and an additional 0.2% are reported to have mild cerebral palsy (Leske, 1981). Adults with cerebral palsy are able to perform most of the tasks required of a college student, including reading, listening, thinking, talking, and composing text: in our experience, their greatest handicap is their relative inability to control personal computers. Typing typically requires painstaking selection of individual keys. Some students are unable to type with their hands (or find it too tiring), and therefore choose to type using a head-mounted pointer. Many students with noticeable dysarthria are less impaired by their dysarthria, in daily life, than by their inability to use computers.

The speech impairments resulting from spastic dysarthria are neither arbitrary nor unpredictable. Most of the specific impairments reported in the literature can be characterized as imprecision in the implementation of one or two distinctive features; e.g., /t/->/k/ is a mistake in the place of articulation of the stop, while /d/->/n/ is a mistake in the sonorancy of the consonant. In 1955, Miller and Nicely showed that errors in the perception of different distinctive features are nearly independent: when a phoneme with distinctive features [d_1,…,d_N] is produced, the probability that a listener will hear a phoneme with distinctive features [e_1,…,e_N] is approximately given by p(e_1|d_1)…p(e_N|d_N). One of the implications of the finding of Miller and Nicely is that errors in perception of one distinctive feature are far more common than errors in perception of two features. The speech production errors produced by subjects with dysarthria show a pattern similar to the Miller and Nicely results: based on patterns of phoneme error under dysarthria reported in the literature, it appears plausible that the probability of a phoneme substitution error may be factored into the independent probabilities of distinctive feature substitutions. If true, this hypothesis could have important implications both for theories of speech production, and for the technological development of automatic speech recognizers for subjects with dysarthria based on a small amount of recorded speech data.

Several studies have demonstrated that adults with dysarthria are capable of using automatic speech recognition (ASR), and that in some cases, human-computer interaction using speech recognition is faster and less tiring than interaction using a keyboard (Chang 1993, Doyle 1997, Kotler 1997, Thomas-Stonell 1998, Hux 2000). With few exceptions, the technology used in these studies is commercial off-the-shelf speech recognition technology. Dysarthric speakers may have trouble training ASR systems, especially speaker-dependent systems, because of the great amount of training data required. Reading a long training passage can be very tiring for a dysarthric speaker. In part because of training data limitations, most studies of speech recognition for dysarthric talkers have focused on small-vocabulary applications, with vocabulary sizes ranging from ten to seventy words.