Visual morphing techniques were used together with a high-quality vocoder to study the audiovisual contribution of talker gender to the identification of frequency-shifted vowels. A nine-step continuum ranging from ``bit'' to ``bet'' was constructed from natural recorded syllables spoken by an adult female talker. Upward and downward frequency shifts in spectral envelope (scale factors of 0.85 and 1.0) were applied in combination with shifts in fundamental frequency, F0 (scale factors of 0.5 and 1.0). Downward frequency shifts generally resulted in malelike voices whereas upward shifts were perceived as femalelike. Two separate nine-step visual continua from ``bit'' to ``bet'' were also constructed, one from a male face and the other a female face, each producing the end-point words. Each step along the two visual continua was paired with the corresponding step on the acoustic continuum, creating natural audiovisual utterances. Category boundary shifts were found for both acoustic cues (F0 and formant frequency shifts) and visual cues (visual gender). The visual gender effect was larger when acoustic and visual information were matched appropriately. These results suggest that visual information provided by the speech signal plays an important supplemental role in talker normalization.
Acoustical Society of America Journal
- Pub Date:
- October 2003