next up previous contents
Next: Quality considerations for concatenative Up: Unit 8: Pitch patterns Previous: Goals

Pitch in concatenative speech synthesis: HYPRSYN

In concatenative speech synthesis, recorded stretches of speech (`canned speech') are concatenated in order to produce a possibly unlimited number of outputs, as in the example on this page. One application of prosodic analysis is in the design of a simple concatenative text-to-speech (TTS) synthesiser of this kind.

The HYPRSYN synthesiser permits the generation of digit sequences with automatic selection of different pitch patterns, and elementary prosodic grouping (currently with the characters `=', `/', `-', `.', `,'). The synthesiser can generate any English digit sequence, however long - an infinite set. For this reason, a length limit (currently 10) has been imposed in order to restrict the size of the audio output; the synthesiser itself would happily generate thousands of digits one after the other.

HyprSyn Concatenative TTS Synthesiser
Input a sequence of digits:

This technique of `canned word concatenation' with prefabricated prosodic word variants is frequently used in the synthesis of telephone numbers in telephone enquiry systems. If you would like produce a similar effect manually, check these audiovisual examples based on the same canned signals as the synthesiser.

But there are much more sophisticated techniques for speech synthesis than this:

  1. Currently the most popoular technique is PSOLA modified diphone synthesis, with the following features:

    1. Concatenation of diphones in the sense of phoneme pairs, from approximately the centre of one phoneme to the centre of the next; this technique is preferable to concatenating phones (realisations of phonemes) because it captures some transition and coarticulation phenomena.
    2. Adaptation of the pitch contours using the `Pitch Synchronisation by Overlap and Add (PSOLA)' technique: the fundamental frequency (tex2html_wrap_inline1072) of the existing diphones is extracted and modified using a new pitch contour.

  2. Parametric Synthesis: Phonological structures such as phoneme strings, syllables, words, are mapped to acoustic properties, not to holistic phones, which are then converted to acoustic signals; the most well-known software which performs the conversion is the Klatt synthesiser.

Synthesisers of these types are used in the contexts of Text-To-Speech (TTS) synthesis. Recent research is starting to concentrate more on `Concept to Speech' (CTS) synthesis or, more plausible from the linguistic point of view, `Meaning to Speech' (MTS) synthesis, in which the text input of the `dictaphone' approach of TTS is not used, and the input to the synthesiser consists of linguistic structures derived from a semantic representation. This approach is more promising for the generation of good prosody.




next up previous contents
Next: Quality considerations for concatenative Up: Unit 8: Pitch patterns Previous: Goals

Dafydd Gibbon
Mon Sep 14 14:35:18 MET DST 1998