Providing intelligent access to large collections of spoken audio is one of speech technology's most important challenges. In pursuit of this goal, there has been a recent surge in interest in the field of Spoken Term Detection (STD), which involves the detection of all occurrences of a specified word or phrase of interest, rapidly and accurately in large heterogeneous audio archives. There is direct demand from security and defence, broadcast monitoring and consumer search.
STD systems first pre-process the audio to create an index that allows for subsequent rapid searching. The choice of index is generally either based on an automatic word-level or phone-level transcription. A word-level index can provide for accurate term detection; however, indexing is slow, and new and rare terms cannot be easily detected at search time. A phonetic index, on the other hand, can be created quickly and is inherently open-vocabulary. Phonetic systems have potential for languages with limited training data and avoid the costly training, development and runtime requirements associated with large vocabulary continuous speech recognition (LVCSR) engines.
The system developed at the Queensland University of Technology uses Dynamic Match Lattice Spotting to detect occurrences of phonetic sequences which closely match the search term, providing fast, open vocabulary search without the need for an LVCSR engine. This will be discussed, along with the view that STD should perhaps not be viewed as simply an application of LVCSR to a different problem.
Authors: Roy Wallace, Sridha Sridharan
Event: SF08: Search and Information Extraction from Audio Data Workshop