The Shape of Your Voice: Predicting Vocal Tract Geometry from Spectral Information

SUTD research on predicting vocal tract geometry from spectral information opens doors to diagnosing vocal disorders non-invasively

Determining accurate and acoustically meaningful vocal tract geometry (i.e. articulatory information) directly from speech sounds is crucial in vocal tract acoustics and voice research. However, this remains an unresolved but fundamental classic inverse problem in voice science.

An inverse problem uses “output” observations to infer the “input” parameters of the system investigated; however, inverse problems are inherently intractable. This is because different parameter values will yield the same resulting observation, and more input parameters exponentially increase the resulting parameter space, making a thorough examination practically unfeasible.

While direct methods to determine vocal tract geometry and gather articulatory information through nasal endoscopy, x-ray fluoroscopy, ultrasound, CT and MRI are available, they are unsatisfactory for fundamental voice research because they can be highly invasive, expensive, potentially harmful or difficult to interpret. Additionally, these direct methods do not directly or meaningfully relate articulatory gesture with the output acoustic information.

Although indirect methods using acoustic information, either using the voice itself or introducing an excitation signal at the lips (see Figure 1) have been successful in providing highly resolved frequency information non-invasively, the question still remains: how do we relate this frequency information back to vocal tract geometry?

In a study published in the Journal of the Acoustical Society of America, the research team from the Singapore University of Technology and Design (SUTD) employed a data-driven approach using artificial neural networks (ANNs) to sidestep the inverse area function problem while determining the non-linear relationship between vocal tract impedance (output) when compared with its corresponding vocal tract geometry (input).

The model assumes the vocal tract consists of several concatenated cylindrical waveguides of varying lengths and radii (input geometry parameters), resulting in unique acoustic impedance spectra derived from the analytical domain (output).

For instance, in a 3-cylinder model, 3648 input-output vector pairs were generated; similarly, increasing larger datasets for 4-/5-/6-cylinder (physiologically-informed) models were created. As this would lead to an impossibly large number of combinations to examine, the research team therefore chose a subset of ‘only’ 100 000 randomly selected combinations and their impedances; 70% was used for training the deep neural network and then tested on the remaining 30%.

The predicted cylindrical radii and the actual radii were found to have very high correlation in the 3- and 4-cylinder models (Pearson coefficient and Lin concordance coefficient: >95%). However, for the 6-cylinder model, the correlation was expectedly somewhat lower (Pearson ~75% and Lin ~69%). Nevertheless, upon standardising the impedance spectra (a pre-processing technique), the predicted geometry correlation thus recovered and improved significantly for all cases (Pearson and Lin: >90%).

Although not strictly physiological, the researchers’ proof-of-concept approach strategically complemented recent studies applying similar deep neural network techniques to address other intractable inverse problems in voice science such as vocal fold mechanics, subglottal pressure and other physiological control parameters.

“This unified insight thus paves the way to meaningfully resolve fundamental questions of vocal tract geometry, articulatory conditions and voice mechanics for both speech and singing, and thus offers the potential of a diagnostic voice tool for applications in vocal disorders, speech pathology, voice therapy, and language pronunciation training in a natural, ecological and non-invasive context,” said principal investigator Assistant Prof Chen Jer-Ming from SUTD.

“This unified insight thus paves the way to meaningfully resolve fundamental questions of vocal tract geometry, articulatory conditions and voice mechanics for both speech and singing, and thus offers the potential of a diagnostic voice tool for applications in vocal disorders, speech pathology, voice therapy, and language pronunciation training in a natural, ecological and non-invasive context,”
Principal Investigator Assistant Prof Chen Jer-Ming from SUTD