Spatial Analysis of Meeting Speech Scenes

Multiparty meetings, common in many business, education and research environments, involve participants that are generally stationary. Therefore, speaker location information can be used to spatially analyze a meeting audio (primarily speech) scene for further processing e.g., steered recording (beamforming), to describe the change of speaker 'event' in metadata, or segmentation/diarisation into each speaker’s ‘turn’ for further annotation or transcription. This presentation describes doctorate research that proposes the use of speech and/or audio coding engines for the extraction of speaker location ‘cues’; the information required for parallel processing of these two tasks is not independent.

Techniques proposed build upon current methods that utilize time-delay estimations (TDE) as location cues. In addition, the use of speech and spatial audio coders and spatial audio cues for meeting speech analysis is introduced, and the effects of microphone characteristics on the spatial cue performance studied. Evaluations progressed from theoretically degraded (through room impulse response modeling) synthetic vowels, real speech vowels, and (anechoically recorded) real speech sentences, to speech recorded from both anechoic and reverberant rooms.

Experimental results have shown that intelligent choice of microphone characteristics can adversely affect spatial cue performance; mixed-pattern microphone arrays have shown to outperform homogenous arrays when estimating spatial cues. Furthermore, experiments suggest that it would be prudent to employ a meeting speech analysis engine that extracts TDE in the time domain and spatial cues in frequency subbands: where TDE fails the spatial cues can assist with speaker location cue disambiguation.

Authors: Eva Cheng, Ian Burnett, Christian Ritz

Event: SF08: Search and Information Extraction from Audio Data Workshop

← View all submissions for this event.