Go to the first, previous, next, last section, table of contents.


13 Utterances

The utterance structure lies at the heart of Festival. This chapter describes its basic form and the functions available to manipulate it.

13.1 Utterance structure

An utterance structure consists of a basic type and a number of named streams. Each stream consists of an ordered list of items. Items have features and values and may also be related to items in other streams. This basic structure is similar to the one used in CHATR (black94).

Typical types are Text, Word, and Segment. Example streams are Word, Syllable, and Wave. Typically relations are made between syllables and the words they are contained within, and segments and the syllables they are contained within, though there is no strict hierarchical structure required in relations.

13.2 Utterance types

The primary purpose of types is to define which modules are to be applied to an utterance. UttTypes are defined in `lib/synthesis.scm'. The function defUttType defines which modules are to be applied to an utterance of that type. The function utt.synth is called applies this list of module to an utterance before waveform synthesis is called.

For example when a Segment type Utterance is synthesized it needs only have its values loaded into a Segment stream and a Target stream, then the low level waveform synthesis module Wave_Synth is called. This is defined as follows

(defUttType Segments
  (Initialize utt)
  (Wave_Synth utt))

A more complex type is Text type utterance which requires many more modules to be called before a waveform can be synthesized

(defUttType Text
  (Initialize utt)
  (Text utt)
  (Token utt)
  (POS utt)
  (Phrasify utt)
  (Word utt)
  (Intonation utt)
  (Duration utt)
  (Int_Targets utt)
  (Wave_Synth utt)
)

The Initialize module should normally be called for all types. It loads the necessary streams from the input form and deletes all other streams (if any exist) ready for synthesis.

Modules may be directly defined as C/C++ functions and declared with a Lisp name or simple functions in Lisp that check some global parameter before calling a specific module (e.g. choosing between different intonation modules).

These types are used when calling the function utt.synth and individual modules may be called explicitly by hand if required.

Because we expect waveform synthesis methods to themselves become complex with a defined set of functions to select, join, and modify units we now support an addition notion of SynthTypes like UttTypes these define a set of functions to apply to an utterance. These may be defined using the defSynthType function. For example

(defSynthType Festival
  (print "synth method Festival")
  
  (print "select")
  (simple_diphone_select utt)

  (print "join")
  (cut_unit_join utt)

  (print "impose")
  (simple_impose utt)
  (simple_power utt)

  (print "synthesis")
  (frames_lpc_synthesis utt)
  )

A SynthType is selected by naming as the value of the parameter Synth_Method.

Duration the application of the function utt.synth there are three hooks applied. This allows addition control of the synthesis process. before_synth_hooks is applied before any modules are applied. after_analysis_hooks is applied at the start of Wave_Synth when all text, linguistic and prosodic processing have been done. after_synth_hooks is applied after all modules have been applied. These are useful for things such as, altering the volume of a voice that happens to be quieter than others, or for example outputing information for a talking head before waveform synthesis occurs so preparation of the facial frames and synthesizing the waveform may be done in parallel. (see `festival/lib/th-mode.scm' for an example use of these hooks for a talking head text mode.)

13.3 Example utterance types

A number of utterance types are currently supported. It is easy to add new ones but the standard distribution includes the following.

Text
Raw text as a string.
(Utterance Text "This is an example")
Words
A list of words
(Utterance Words (this is an example))
Words may be atomic or lists if further features need to be specified. For example to specify a word and its part of speech you can use
(Utterance Words (I (live (pos v)) in (Reading (pos n) (tone H-H%))))
Note: the use of the tone feature requires an intonation mode that supports it. Any feature and value named in the input will be added to the Word stream item.
Phrase
This allows explicit phrasing and features on Tokens to be specified. The input consists of a list of phrases each contains a list of tokens.
(Utterance
 Phrase
 ((Phrase ((name B))
   I saw the man
   (in ((EMPH 1)))
   the park)
  (Phrase ((name BB))
   with the telescope)))
ToBI tones and accents may also be specified on Tokens but these will only take effect if the selected intonation method uses them.
Segments
This allows specification of segments, durations and F0 target values.
(Utterance 
 Segments
 ((# 0.19 )
  (h 0.055 (0 115))
  (@ 0.037 (0.018 136))
  (l 0.064 )
  (ou 0.208 (0.0 134) (0.100 135) (0.208 123))
  (# 0.19)))
Note the times are in seconds NOT milliseconds. The format of each segment entry is segment name, duration in seconds, and list of target values. Each target value consists of a pair of point into the segment (in seconds) and F0 value in Hz.
Phones
This allows a simple specification of a list of phones. Synthesis specifies fixed durations (specified in FP_duration, default 100 ms) and monotone intonation (specified in FP_F0, default 120Hz). This may be used for simple checks for waveform synthesizers etc.
(Utterance Phones (# h @ l ou #))
Note the function SayPhones allows synthesis and playing of lists of phones through this utterance type.
Wave
A waveform file. Synthesis here simply involves loading the file.
(Utterance Wave fred.wav)

Others are supported, as defined in `lib/synthesis.scm' but are used internally by various parts of the system. These include Tokens used in TTS and SegF0 used by utt.resynth.

13.4 Utterance modules

The module is the basic unit that does the work of synthesis. Within Festival there are duration modules, intonation modules, wave synthesis modules etc. As stated above the utterance type defines the set of modules which are to be applied to the utterance. These modules in turn will create and fill in the streams and stream items so that ultimately a waveform is generated, if required.

Many of the chapters in this manual are solely concerned with particular modules in the system. Note that many modules have internal choices, such as which duration method to use or which intonation method to use. Such general choices are often done through the Parameter system. Parameters may be set for different features like Duration_Method, Synth_Method etc. Formerly the values for these parameters were atomic values but now they may be the functions themselves. For example, to select the Klatt duration rules

(Parameter.set 'Duration_Method Duration_Klatt)

This allows new modules to be added without requiring changes to the central Lisp functions such as Duration, Intonation, and Wave_Synth.

13.5 Accessing an utterance

Functions exist in Lisp (and of course C++) for accessing an utterance. The Lisp access functions are

`(utt.streamnames UTT)'
returns a list of the names of the streams currently created in UTT.
`(utt.stream UTT STREAMNAME)'
returns a list of stream items in STREAMNAME in UTT. This is nil if no stream of that name exists.
`(utt.stream.head UTT STREAMNAME)'
returns the first stream item in STREAMNAME. Returns nil if this stream contains no items
`(utt.stream.tail UTT STREAMNAME)'
returns the last stream item in STREAMNAME. Returns nil if this stream contains no items
`(utt.streamitem.feat UTT STREAMITEM FEATNAME)'
returns the value of feature FEATNAME in STREAMITEM in UTT. UTT is required as FEATNAME may refer, via relations, to other derived bits of the utterance structure.
`(streamitem.features STREAMITEM)'
Returns an assoc list of feature-value pairs of all local features on this stream item.
`(streamitem.name STREAMITEM)'
Returns the name of this STREAMITEM. This could be accessed using utt.streamitem.feat but this provides faster access for this feature.
`(streamitem.end STREAMITEM)'
Returns the end time of this STREAMITEM. This could be accessed using utt.streamitem.feat but this provides faster access for this feature.
`(streamitem.set_name STREAMITEM NEWNAME)'
Sets name on STREAMITEM to be NEWNAME. Note although name may be accessed as a feature you must use this function to set it rather than streamitem.set_feat.
`(streamitem.set_end STREAMITEM NEWNAME)'
Sets the end time on STREAMITEM to be NEWNAME. Note although end may be accessed as a feature you must use this function to set it rather than streamitem.set_feat. Care should be taken that this end time is after the previous item's end time and before the next. Although this isn't strictly required a number of modules depend on this.
`(streamitem.set_feat STREAMITEM FEATNAME FEATVALUE)'
set the value of FEATNAME to FEATVALUE in STREAMITEM. FEATNAME should be a simple name and not refer to next, previous or other streams via links.
`(utt.streamitem.rel UTT STREAMITEM RELNAME)'
return a list of stream cells related to STREAMITEM in UTT by RELNAME.

As from 1.2 the utterance structure may be fully manipulated from Scheme. Streams, and stream items may be created and deleted.

`(utt.present UTT STREAMNAME)'
returns t if stream named STREAMNAME is present, nil otherwise.
`(utt.stream.create UTT STREAMNAME)'
Creates a new stream called STREAMNAME. This causes an error if a stream of that name already exists.
`(utt.stream.delete UTT STREAMNAME)'
Deletes the stream called STREAMNAME in utt. This causes an error if no stream of that name exists
`(utt.streamitem.insert UTT STREAMNAME STREAMITEM)'
Create a new streamitem in stream called STREAMNAME in UTT. If there is no stream of that name a new one is created. If STREAMITEM is nil or unspecified the new item is appended to the stream. It STREAMITEM is non-nil then a new stream item is inserted after it.
`(utt.streamitem.delete UTT STREAMITEM)'
Delete STREAMITEM from UTT.
`(utt.relate_items UTT STREAMITEM1 STREAMITEM2)'
Add a relation between these two items in UTT.

With the above functions quite elaborate utterance manipulations can be performed. For example in post-lexical rules where modifications to the segment stream are required based on the words and their context. See section 12.7 Post-lexical rules for an example of using various utterance access functions.

13.6 Features

A stream item has a number of features directly related to it. All stream items have a name, start, end, duration and addr. In addition they have other features depending on which stream they are in, and what modules have been called. For example, word stream items have pos for part of speech. Features may simple feature names or refer to feature functions. Feature functions may be written in C++ or Scheme are available for particular streams. A list of all feature functions is given in an appendix of this document. See section 30 Feature functions.

Feature names may also refer indirectly to values. If a feature name is prefixed by `n.' the next item in the stream is accessed before the remainder of the name is interpreted. Likewise, `p.', `pp.', `nn.' refer to the previous, previous to previous, and next to next items. Note that where a feature name cannot refer to an actual stream (or feature) the value `0' is returned.

Also feature names may follow relations to other streams. If the feature name up to the next `.' is a stream name, a link from the current cell is followed to the first item in that relation. Modifiers for last in the relation, and the number of items that are related are included. Here are some example feature names and a description of their meaning, its really quite simple. Let us assume our current item is a syllable.

`stress'
This item's lexical stress
`n.stress'
The next syllable's lexical stress
`p.stress'
The previous syllable's lexical stress
`Word.name'
The word this syllable is in
`Word.n.name'
The word next to the word this syllable is in
`n.Word.name'
The word the next syllable is in
`Word.Syllable:last.addr'
The addr of the last syllable in this word. Note if addr equals Word.Syllable:last.addr it means this is word-final.
`Word.Syllable:num'
Number of syllables in this word.

In C++ feature values are of class PVal which may be a string, int, or a float. In Scheme this distinction cannot not always be made and sometimes when you expect an int you actually get a string. Care should be take to ensure the right matching functions are use in Scheme. It is recommended you use string-append or string-match as they will always work.

When collecting data from speech databases it is often useful to collect a whole set of features from all utterances in a database. These features can then be used for building various models (both CART tree models and linear regression modules use these feature names),

A number of functions exist to help in this task. For example

(utt.features utt1 'Word '(name pos p.pos n.pos))

will return a list of word, and part of speech context for each word in the utterance.

See section 24.2 Extracting features for an example of extracting sets of features from a database for use in building stochastic models.

13.7 Utterance I/O

A number of functions are available to allow an utterance's structure to be made available for other programs.

The whole structure, all streams including relations and features may be saved in an Xlabel-like format using the function utt.save. This file may be reloaded using the utt.load function. Note the waveform is not saved using the form. It is possible to modify the external representation of the utterance in this external file and re-load it though care should be taken in ensuring the structure is still valid.

Individual aspects of an utterance may be selectively saved. The waveform itself may be saved using the function utt.save.wave. This will save the waveform in the named file in the format specified in the Parameter Wavefiletype. All formats supported by the Edinburgh Speech Tools are valid including nist, esps, sun, riff, aiff, raw and ulaw. Note the functions utt.wave.rescale and utt.wave.resample may be used to change the gain and sample frequency of the waveform before saving it. A waveform may be imported into an existing utterance with the function utt.import.wave. This is specifically designed to allow external methods of waveform synthesis. However if you just wish to play an external wave or make it into an utterance you should consider the utterance Wave type.

The segments of an utterance may be saved in a file using the function utt.save.segs which saves the segments of the named utterance in xlabel format. Any other stream may also be saved using the more general utt.save.stream which takes the additional argument of a stream name. The names of each stream items and the end duration of each item are save in the named file, again in Xlabel format. For more elaborated saving methods you can easily write a Scheme function to save data in an utterance in whatever format is required. See the file `lib/mbrola.scm' for an example.

A simple function to allow the displaying of an utterance in Entropic's Xwaves tool is provided by the function display. It simply saves the waveform and the segments and sends appropriate commands to (the already running) Xwaves and xlabel programs.

A function to synthesize an externally specified utterance is provided for by utt.resynth which takes two filename arguments, an xlabel segment file and an F0 file. This function loads, synthesizes and plays an utterance synthesized from these files. The loading is provided by the underlying function utt.load.segf0.


Go to the first, previous, next, last section, table of contents.