Spoken Text Mark-up Language is a synthesizer independent, SGML-based mark up language for labelling text (sproat97). This is based on the earlier SSML (Speech Synthesis Markup Language) which was supported by previous versions of Festival (taylor96). Raw text has the problem that it cannot always easily be rendered as speech in the way the author wishes. STML offers a well-defined way of marking up text so that the synthesizer may render it appropriately.
The definition of STML is by no means settled and is still in development. In this release Festival offers people working on STML and related SGML-based markup languages a chance to quickly experiment with prototypes by providing a DTD (document type descriptions) and the mapping of the elements in the DTD to Festival functions.
Primarily we see STML as a language that will be generated by other programs, e.g. text generation systems, dialog managers etc. therefore a standard, easy to parse, format is required, even if it seems overly verbose for human writers.
Here is a simple example of STML marked up text
<!doctype stml system "STML.dtd" []> <stml> <language id="english"> <speaker name="male1"> The boy saw the girl in the park <bound strength=2> with the telescope. The boy saw the girl <bound strength=2> in the park with the telescope. Good morning <div> My name is Stuart, which is spelled <rate scheme="native" speed="1.4"> <literal mode="spell">stuart</literal> </rate> though some people pronounce it <phonetic scheme="native"> s t uu @ t </phonetic>. My telephone number is <literal mode="spell">2787</literal>. <define words="buccleuch" scheme="native" prons="b uh k l uu1"> I used to work in Buccleuch Place, but no one can pronounce that. <aimg src="laughter.au"> </stml>
After the initial definition of the STML tags, through the file `STML.dtd', which is distributed as part of Festival, the body is given. There are tags for identifying the language and the voice. Explicit boundary markers may be given in text. Also duration and intonation control can be explicit specified as can new pronunciations of words. The last element specified an external filename to play at that point.
There is not yet a definitive set of tags but hopefully such a list will form over the next few months. As adding support for new tags is often trivial the problem lies much more in defining what tags there should be than in actually implementing them.
The tags that are supported in the release versions are as follows. There are mostly taken from sproat97 though sometime their semantics has been changed slightly.
language
ID
attribute. Valid values in Festival are, english
,
britishenglish
, americanenglish
, spanish
etc. For example
<language id="english">
voice
name
which takes values
male1
, male2
, female1
, etc. There
is currently no definition about what happens when a voice is selected
which the synthesizer doesn't support. An example is
<speaker name="male1">
genre
bound
strength
. Strenth may be values
2, 3 or 4. A 4 signals an utterance break in Festival. 3 and 2
insert a minor break.
<bound strength=3>
div
type
attribute may be specified but it is ignored
by Festival.
phonetic
scheme
attribute, which may take the
values ipa
worldbet
or native
. Festival
only supports native at present. An example is
<phonetic scheme="native"> s t uu @ t </phonetic>
literal
scheme
supported by Festival is spell
. Examples
are
Buccleuch is spelled, <literal mode="spell">Buccleuch</literal>. My room number is <literal mode="spell">E17</literal>
omitted
<ommitted> Don't say this bit </omitted>
aimg
src
.
<aimg src="laughter.au">
define
words
(though Festival only allows definitions
of single white space separated tokens), prons
a string of
phonemes, and scheme
which can be native
, ipa
, or
worldbet
-- Festival currently only support native
.
In Festival the vowels may be appended with 1 to denote stress.
<define words="buccleuch" scheme="native" prons="b uh k l uu1">
call
engid
is not recognised.
<call engid="festival" command="(Parameter.set 'Duration_Stretch 1.0)">
emph
type
. This causes the text
enclosed to be "emphasized". In Festival this doesn't work very well
as the default intonation mechanism is statistically trained without
any notion of emphasis. Thus the results from this are unsatisfying.
I want to go to the <emph>park</emph> not the mall.
intonat
med
, amplitude
, scheme
(abs
or rel
) or specified in some native
synthesizer
specifica form. This is ignored by Festival.
rate
speed
attribute is identifed by
the scheme
attribute which takes the values native
,
wpm
, and description
. Festival only supports the
native
schemem which is a multiplicative factor.
<rate speed="1.5" scheme="native"> It can speak very slowly </rate><rate speed="0.6" scheme="native"> or talk very fast indeed,</rate> and then back to normal again.
emotion
state
attribute. This is ignored
by Festival.
facial
expression
attribute and is intended
for use in conjunction with talking heads. This element is currently
ignored by Festival.
word
name
, pos
(part of speech), accent
(intonation accent), tone
(intonation end tone), and
phonemes
(a list of phonemes in native phoneme set for
the pronunciation. Note the accent
and tone
are
only of use if the intoantion method used by the current
voice actually check for them.
Homographs are words that are written the same but have different pronunciations, such as <word name="lives" pos="nnp"> and <word name="lives" pos="vbz">. You say <word name="either" phonemes="ii dh r">, while I say <word name="either" phonemes="ai dh r">.
These tags may change in name but they cover the aspects of speech mark up that we wish to express. Later additions and changes to these are expected.
See the files `festival/examples/example.stml' and `festival/examples/example2.stml' for working examples.
We do not yet claim that there is a fixed standard for STML tags but we wish to move towards such a standard. In the mean time we have made it easy in Festival to add support for new tags without, in general, having to change any of the core functions.
Two changes are necessary to add a new tags. First, change the
definition in `lib/STML.dtd', so that STML files may use it. The
second stage is to make Festival sensitive to that new tag. The example
in festival/lib/stml-mode.scm
shows how a new text mode may be
implemented for an SGML-based markup language. The basic point is
that an identified function will be called on finding a start tag
or end tags in the document. It is the tag-function's job
to synthesize the given utterance if the tag signals an utterance
boundary. The return value from the tag-function is the
new status of the current utterance, which may remain unchanged
or if the current utterance has been synthesized nil
should
be returned signalling a new utterance.
Note the hierarchical structure of the document is not available in this method of tag-functions. Any hierarchical state that must be preserved has to be done using explicit stacks. This is a problem with the current implementation but due to the cross relationship to utterances and tags (utterances may end within start and end tags), the desire to have all specification in Scheme rather than C++, and lack of time to find the correct solution the current method is offered.
The tag-functions are defined in an elements list. They are identified
with names such as "(STML" and ")STML" denoting start and end tags
respectively. Two arguments are passed to these tag functions,
an assoc list of attributes and values as specified in the document
and the current utterances. If the tag denotes an utterance
break call xxml_synth
on UTT
and return nil
.
If a tag (start or end) is found in the document and there is no
corresponding tag-function it is ignored.
New features may be added to words with a start and end tag by
adding features to the global xxml_word_features
. Any
features in that variable will be added to each word.
Note this describes one SGML-based markup language, others may be easily specified within Festival. Use `lib/stml-mode.scm' as a model.
STML is an SGML language and hence is best parsed by SGML tools. Such parsing is not part of Festival itself. Festival support for STML requires such a parser. Such a parser must validate an STML file, as well as add default tags. We have tested `nsgmls-1.0' which is available as part of the SGML tools set `sp-1.1.tar.gz' which is available from `http://www.jclark.com/sp/index.html'. This seems portable between many platforms.
The types of markup being defined for spoken text probably do not require the full power of SGML. It is possible once the markup tagset is more settled we will restrict it to XML (wwwxml97).
Support in Festival for STML is as a text mode. In the command mode use the following to process an STML file
(tts "stmlfile" 'stml)
From the Unix command line the Festival script `festival/examples/stml' will render STML files as speech.
stml stmlfile
Ensure either `stml' has been moved to a directory in your
PATH
or you add its directory to your PATH
. Also
the automatic selection of mode based on file type has been set up
such that files ending `.stml' will be automatically synthesized
in this mode.
Another way of using STML is through the Emacs interface. The say-buffer command will send the Emacs buffer mode to Festival as its tts-mode. If the mode is stml or sgml the file is treated as an stml file. See section 10 Emacs interface
Many people experimenting with STML (and TTS in general) often want all the waveform output to be saved to be played at a later date. Note that because STML is a tts mode Festival synthesizes the contents of the file utterance by utterance thus many waveforms are generated. Also note that these waveforms may be of different sample rates (due to selection of different voices, languages and importing files) so they can't be trivially concatenated together. The following command will cause all all waveforms to be saved in files named `tts_file_N' in the current directory. You may then use `ch_wave' or similar to resample them and concatenate them together.
(set! tts_hooks (list utt.synth save_tts_output utt.play))
Either execute this in the command interpreter before using STML or add it to your `.festivalrc' if you are not using the command interpreter directly (remember to remove this after using it though).
Go to the first, previous, next, last section, table of contents.