As part of the Thinking Head project we are responsible for the language generation component of an ECA. One of the aspects of the language generation process that heavily influences acceptability of an agent is whether it produces speech with appropriate and natural voice intonation. The ultimate goal of automatic speech synthesis is to produce speech that is indistinguishable from that of a human speaker, but current synthesis methods fall short of achieving this lofty goal; in particular, while current unit concatenation approaches can synthesise speech with a very naturalistic voice quality, there is still a noticeable shortfall in the realism of prosody and intonation. We are interested in determining the extent to which the deficits here lie in the range of outputs the synthesis engine can produce, and the extent to which they are due to unsophisticated use of the parameters that control the synthesiser: in other words, what are the best results that we can achieve using current state-of-the art speech synthesis technology? We present an experimental evaluation where we determine an upper bound on the quality of output that can be produced using current synthesis techniques, and discuss the control of parameters that needs to be exercised by a natural language generation component in order to achieve this standard.
Authors: Ilya Anisimoff and Robert Dale
Event: SF08: Embodied Interaction in Mobile, Physical and Virtual Environments Workshop