Go to the first, previous, next, last section, table of contents.


12 Lexicons

A Lexicon in Festival is a subsystem that provides pronunciations for words. It can consist of three distinct parts: an addenda, typically short consisting of hand added words; a compiled lexicon, typically large (10,000s of words) which sits on disk somewhere; and a method for dealing with words not in either list.

12.1 Lexical entries

Lexical entries consist of three basic parts, a head word, a part of speech and a pronunciation. The headword is what you might normally think of as a word e.g. `walk', `chairs' etc. but it might be any token.

The part-of-speech field currently consist of a simple atom (or nil if none is specified). Of course there are many part of speech tag sets and whatever you mark in your lexicon must be compatible with the subsystems that use that information. You can optionally set a part of speech tag mapping for each lexicon. The value should be a reverse assoc-list of the following form

(lex.set.pos.map 
   '((( punc fpunc) punc)
     (( nn nnp nns nnps ) n)))

All part of speech tags not appearing in the left hand side of a pos map are left unchanged.

The third field contains the actual pronunciation of the word. This consists of the syllables, stress markings and phones themselves. There is an alternative format for lexical entries for compilation where syllable boundaries are not marked, in that case these determined automatically during compilation.

Some typical example entries are

( "walkers" n ((( w oo ) 1) (( k @ z ) 0)) )
( "present" v ((( p r e ) 0) (( z @ n t ) 1)) )
( "monument" n ((( m o ) 1) (( n y u ) 0) (( m @ n t ) 0)) )

Note you may have two entries with the same headword, but different part of speech fields allow differentiation. For example

( "lives" n ((( l ai v z ) 1)) )
( "lives" v ((( l i v z ) 1)) )

See section 12.3 Lookup process for a description of how multiple entries with the same headword are used during lookup.

By current conventions, single syllable function words should have no stress marking, while single syllable content words should be stressed.

NOTE: the POS field may change in future to contain more complex formats. The same lexicon mechanism (but different lexicon) is used for holding part of speech tag distributions for the POS prediction module.

12.2 Defining lexicons

As stated above, lexicons consist of three basic parts (compiled form, addenda and unknown word method) plus some other declarations.

Each lexicon in the system has a name which allows different lexicons to be selected from efficiently when switching between voices during synthesis. The basic steps involved in a lexicon definition are as follows.

First a new lexicon must be created with a new name

(lex.create "cstrlex")

A phone set must be declared for the lexicon, to allow both checks on the entries themselves and to allow phone mapping between different phone sets used in the system

(lex.set.phoneset "mrpa")

The phone set must be already declared in the system.

A compiled lexicon, the construction of which is described below, may be optionally specified

(lex.set.compile.file "/projects/festival/lib/dicts/cstrlex.out")

The method for dealing with unknown words, See section 12.4 Letter to sound rules, may be set

(lex.set.lts.method 'lts_rules)
(lex.set.lts.ruleset 'nrl)

In this case we are specifying the use of a set of letter to sound rules originally developed by the U.S. Naval Research Laboratories. The default method is to give an error if a word is not found in the addenda or compiled lexicon. (This is discussed more fully below.)

Finally addenda items may be added for words that are known to be common, but not in the lexicon and cannot reasonably be analysed by the letter to sound rules.

(lex.add.entry 
  '( "awb" n ((( ei ) 1) ((d uh) 1) ((b @ l) 0) ((y uu) 0) ((b ii) 1))))
(lex.add.entry 
  '( "cstr" n ((( s ii ) 1) (( e s ) 1) (( t ii ) 1) (( aa ) 1)) ))

Using lex.add.entry again for the same word and part of speech will redefine the current pronunciation.

For large lists, compiled lexicons are best. The function lex.compile takes two filename arguments, a file name containing a list of lexical entries and an output file where the compiled lexicon will be saved.

Compilation can take some time and may require lots of memory, as all entries are loaded in, checked and then sorted before being written out again. During compilation if some entry is malformed the reading process halts with a not so useful message. Note that if any of your entries include quote or double quotes the entries will probably be misparsed and cause such a weird error. In such cases try setting

(debug_output t)

before compilation. This will print out each entry as it is read in which should help to narrow down where the error is.

12.3 Lookup process

When looking up a word, either through the C++ interface, or Lisp interface, a word is identified by its headword and part of speech. If no part of speech is specified, nil is assumed which matches any part of speech tag.

The lexicon look up process first checks the addenda, if there is a full match (head word plus part of speech) it is returned. If there is an addenda entry whose head word matches and whose part of speech is nil that entry is returned.

If no match is found in the addenda, the compiled lexicon, if present, is checked. Again a match is when both head word and part of speech tag match, or either the word being searched for has a part of speech nil or an entry has its tag as nil. Unlike the addenda, if no full head word and part of speech tag match is found, the first word in the lexicon whose head word matches is returned. The rationale is that the letter to sound rules (the next defence) are so bad that any pronunciation for that head word will be better than what the letter to sound rules produce, even more so given that as there is an entry with the head word but a different part of speech this word may have an unusual pronunciation that the letter to sound rules will have no chance in producing.

Finally if the word is not found in the compiled lexicon it is passed to whatever method is defined for unknown words. This is most likely a letter to sound module. See section 12.4 Letter to sound rules.

Compiled lexicons may be created from lists of lexical entries. A compiled lexicon is much more efficient for look up than the addenda. Compiled lexicons use a binary search method while the addenda is searched linearly. Also it would take a prohibitively long time to load in a typical full lexicon as an addenda. If you have more than a few hundred entries in your addenda you should seriously consider adding them to your compiled lexicon.

Because many publicly available lexicons do not have syllable markings for entries the compilation method supports automatic syllabification. Thus for lexicon entries for compilation, two forms for the pronunciation field are supported: the standard full syllabified and stressed form and a simpler linear form found in at least the BEEP and CMU lexicons. If the pronunciation field is a flat atomic list it is assumed syllabification is required.

Syllabification is done by finding the minimum sonorant position between vowels. It is not guaranteed to be accurate but does give a solution that is sufficient for many purposes. A little work would probably improve this significantly. Of course syllabification requires the entry's phones to be in the current phone set. The sonorant values are calculated from the vc, ctype, and cvox features for the current phoneset. See `src/arch/festival/Phone.cc:ph_sonority()' for actual definition.

Additionally in this flat structure vowels (atoms starting with a, e, i, o or u) may have 1 2 or 0 appended marking stress. This is again following the form found in the BEEP and CMU lexicons.

Some example entries in the flat form (taken from BEEP) are

("table" nil (t ei1 b l))
("suspicious" nil (s @ s p i1 sh @ s))

Also if syllabification is required there is an opportunity to run a set of "letter-to-sound"-rules on the input (actually an arbitrary re-write rule system). If the variable lex_lts_set is set, the lts ruleset of that name is applied to the flat input before syllabification. This allows simple predictable changes such as conversion of final r into longer vowel for English RP from American labelled lexicons.

A list of all matching entries in the addenda and the compiled lexicon may be found by the function lex.lookup_all. This function takes a word and returns all matching entries irrespective of part of speech.

12.4 Letter to sound rules

Each lexicon may define what action to take when a word cannot be found in the addenda or the compiled lexicon. There are a number of options which will hopefully be added to as more general letter to sound rule systems are added.

The method is set by the command

(lex.set.lts.method METHOD)

Where METHOD can be any of the following

`Error'
Throw an error when an unknown word is found (default).
`lts_rules'
Use externally specified set of letter to sound rules (described below). The name of the rule set to use is defined with the lex.lts.ruleset function. This method runs one set of rules on an exploded form of the word and assumes the rules return a list of phonemes (in the appropriate set). If multiple instances of rules are required use the function method described next.
`function'
Call the lisp function lex_user_unknown_word. This function is given two arguments: the word and the part of speech. It should return a valid lexical entry.
`none'
This returns an entry with a nil pronunciation field. This will only be valid in very special circumstances.

The letter to sound rule system is very simple but is powerful enough to build reasonably complex letter to sound rules.

The basic form of a rule is as follows

( LEFTCONTEXT [ ITEMS ] RIGHTCONTEXT = NEWITEMS )

This interpretation is that if ITEMS appear in the specified right and left context then the output string is to contain NEWITEMS. Any of LEFTCONTEXT, RIGHTCONTEXT or NEWITEMS may be empty. Note that NEWITEMS is written to a different "tape" and hence cannot feed further rules (within this ruleset). An example is

( # [ c h ] C = k )

The special character # denotes a word boundary, and the symbol C denotes the set of all consonants, sets are declared before rules. This rule states that a ch at the start of a word followed by a consonant is to be rendered as the k phoneme. Symbols in contexts may be followed by the symbol * for zero or more occurrences, or + for one or more occurrences.

The symbols in the rules are treated as set names if they are declared as such or as symbols in the input/output alphabets. The symbols may be more than one character long and the names are case sensitive.

The rules are tried in order until one matches the first (or more) symbol of the tape. The rule is applied adding the right hand side to the output tape. The rules are again applied from the start of the list of rules.

The function used to apply a set of rules if given an atom will explode it into a list of single characters, while if given a list will use it as is. This reflects the common usage of wishing to re-write the individual letters in a word to phonemes but without excluding the possibility of using the system for more complex manipulations, such as multi-pass LTS systems and phoneme conversion.

From lisp there are three basic access functions, there are corresponding functions in the C/C++ domain.

(lts.ruleset NAME SETS RULES)
Define a new set of lts rules. Where NAME is the name for this rule, SETS is a list of set definitions of the form (SETNAME e0 e1 ...) and RULES are a list of rules as described above.
(lts.apply WORD RULESETNAME)
Apply the set of rules named RULESETNAME to WORD. If WORD is a symbol it is exploded into a list of the individual characters in its print name. If WORD is a list it is used as is. If the rules cannot be successfully applied an error is given. The result of (successful) application is returned in a list.
(lts.check_alpha WORD RULESETNAME)
The symbols in WORD are checked against the input alphabet of the rules named RULESETNAME. If they are all contained in that alphabet t is returned, else nil. Note this does not necessarily mean the rules will successfully apply (contexts may restrict the application of the rules), but it allows general checking like numerals, punctuation etc, allowing application of appropriate rule sets.

The letter to sound rule system may be used directly from Lisp and can easily be used to do relatively complex operations for analyzing words without requiring modification of the C/C++ system. For example the Welsh letter to sound rule system consists or three rule sets, first to explicitly identify epenthesis, then identify stressed vowels, and finally rewrite this augmented letter string to phonemes. This is achieved by the following function

(define (welsh_lts word features)
  (let (epen str wel)
    (set! epen (lts.apply (downcase word) 'newepen))
    (set! str (lts.apply epen 'newwelstr))
    (set! wel (lts.apply str 'newwel))
    (list word
          nil
          (lex.syllabify.phstress wel))))

The LTS method for the Welsh lexicon is set to function, and lex_user_unknown_word is set to the above function when the Welsh speaker is selected. Note the above function first downcases the word and then applies the rulesets in turn, finally calling the syllabification process and returns a constructed lexically entry.

12.5 Lexicon requirements

For English there are a number of assumptions made about the lexicon which are worthy of explicit mention. If you are basically going to use the existing token rules you should try to include at least the following in any lexicon that is to work with them.

The lexicon is one of the largest parts of the system and many may wish to try and reduce its size. There probably is a reasonable way to do this. That is remove words that the letter to sound rules adequately deal with. With a little more care on the letter to sound rules and the addition of a lexical stress assignment algorithm I suspect the lexicon could be reduced by a significant proportion, only keeping words whose pronunciation is unusual (many of the common words) and homographs. A morphological decomposition algorithm, like that described in black91, would also make lexical compression more feasible.

12.6 Available lexicons

Currently Festival supports a number of different lexicons. They are all defined in the file `lib/lexicons.scm' each with a number of common extra words added to their addendas. They are

`CUVOALD'
The Computer Users Version of Oxford Advanced Learner's Dictionary is available from the Oxford Text Archive `ftp://ota.ox.ac.uk/pub/ota/public/dicts/710'. It contains about 70,000 entries and is a part of the BEEP lexicon. It is more consistent in its marking of stress though its syllable marking is not what works best for our synthesis methods. Many syllabic `l''s, `n''s, and `m''s, mess up the syllabification algorithm, making results sometimes appear over reduced. It is however our current default lexicon. It is also the only lexicon with part of speech tags that can be distributed (for non-commercial use).
`CMU'
This is automatically constructed from `cmu_dict-0.1' available from many places on the net (see comp.speech archives). It is not in the mrpa phone set because it is American English pronunciation. Although mappings exist between its phoneset (`darpa') and `mrpa' the results for British English speakers are not very good. However this is probably the biggest, most carefully specified lexicon available. It contains just under 100,000 entries. Our distribution has been modified to include part of speech tags on
`mrpa'
A version of the CSTR lexicon which has been floating about for years. It contains about 25,000 entries.
`BEEP'
A British English rival for the `cmu_lex'. BEEP has been made available by Tony Robinson at Cambridge and is available in many archives. It contains 163,000 entries and has been converted to the `mrpa' phoneset (which was a trivial mapping). Although large, it suffers from a certain randomness in its stress markings, making use of them dubious.

All of the above lexicons have some distribution restrictions (though mostly pretty light), but as they are mostly freely available we provide programs that can convert the originals into Festival's format.

The MOBY lexicon has recently been released into the public domain and will be converted into our format soon.

12.7 Post-lexical rules

It is the lexicon's job to produce a pronunciation of a given word. However in most languages the most natural pronunciation of a word cannot be found in isolation from the context in which it is to be spoken. This includes such phenomena as reduction, phrase final devoicing and r-insertion. In Festival this is done by post-lexical rules.

PostLex is a module which is run after accent assignment but before duration and F0 generation. This is because knowledge of accent position is necessary for vowel reduction and other post lexical phenomena and changing the segmental stream will affect durations.

The PostLex first applies a set of built in rules (which could be done in Scheme but for historical reasons are still in C++). It then applies the functions set in the hook postlex_rules_hook. These should be a set of functions that take an utterance and apply appropriate rules. This should be set up on a per voice basis.

Although a rule system could be devised for post-lexical sound rules it is unclear what the scope of them should be, so we have left it completely open. Our vowel reduction model uses a CART decision tree to predict which syllables should be reduced, while the "'s" rule is very simple (shown in `festival/lib/postlex.scm').

The 's in English may be pronounced in a number of different ways depending on the preceding context. If the preceding consonant is a fricative or affricative and not a palatal labio-dental or dental a schwa is required (e.g. bench's) otherwise no schwa is required (e.g. John's). Also if the previous phoneme is unvoiced the "s" is rendered as an "s" while in all other cases it is rendered as a "z".

For our English voices we have a lexical entry for "'s" as a schwa followed by a "z". We use a post lexical rule function called postlex_apos_s_check to modify the basic given form when required. After lexical lookup the segment stream contains the concatenation of segments directly from lookup in the lexicon. Post lexical rules are applied after that.

In the following rule we check each segment to see if it is part of a word labelled "'s", if so we check to see if are we currently looking at the schwa or the z part, and test if modification is required

(define (postlex_apos_s_check utt)
  "(postlex_apos_s_check UTT)
Deal with possessive s for English (American and British).  Delete
schwa of 's if previous is not a fricative or affricative, and
change voiced to unvoiced s if previous is not voiced."
  (mapcar
   ;; for each segment in the stream
   (lambda (seg)
     (if (string-equal "'s" ;; are we in an apostrophe s 
          (utt.streamitem.feat utt seg 'Syllable.Word.name))
         (if (string-equal "a" (utt.streamitem.feat utt seg 'ph_vlng))
             ;; schwa part
             (if (and (member_string (utt.streamitem.feat utt seg 'p.ph_ctype) 
                                     '(f a))
                      (not (member_string
                            (utt.streamitem.feat utt seg 'p.ph_cplace) 
                            '(d b g))))
                 t                                 ;; don't delete schwa
                 (utt.streamitem.delete utt seg))  ;; do delete schwa
             ;; s/z part
             (if (string-equal "-" (utt.streamitem.feat utt seg 'p.ph_cvox))
                 (streamitem.set_name seg "s"))))) ;; change from "z"
   (utt.stream utt 'Segment))
  utt)


Go to the first, previous, next, last section, table of contents.