next up previous
Next: Text-to-speech components Up: Speech synthesis systems Previous: Speech synthesis systems

MBROLA

M     M  BBBB    RRRR      OOO    L        A
MM   MM  B   B   R   R    O   O   L       A A
M M M M  B   B   R   R   O     O  L      A   A
M  M  M  BBBB    RRR     O     O  L     AAAAAAA
M     M  B   B   R  R    O     O  L     A     A
M     M  B    B  R   R    O   O   L     A     A
M     M  BBBBB   R    R    OOO    LLLLL A     A

Version 3.00, Thu Feb 26 14:48:28 MET 1998

--------------------------------------------------------------
Table of Contents
--------------------------------------------------------------

1.0 License
2.0 A brief description of the MBROLA software
3.0 Distribution
4.0 Installation, and Tests
5.0 Format of input and output files - Limitations
6.0 Joining the MBROLA project as a user
7.0 Joining the MBROLA project as database provider
8.0 Acknowledgments
9.0 Contacting the author

--------------------------------------------------------------
1.0 License
--------------------------------------------------------------

This program and object code is being provided to "you", the licensee,
by Thierry  Dutoit, the "author",  under  the following license, which
applies to any  program, object code or other   work which contains  a
notice placed  by  the copyright holder saying  it  may be distributed
under the terms of this license.   The "program", below, refers to any
such program, object code or work.

By obtaining,  using and/or copying  this program, you  agree that you
have   read,  understood,  and   will  comply   with  these  terms and
conditions:

Terms and conditions for the distribution of the program
--------------------------------------------------------

This program may not be sold or incorporated into any product which is
sold without prior permission from the author.

When no  charge is made, this  program  may be copied  and distributed
freely, provided that   this notice  is  copied  and distributed  with
it. Each time you redistribute  the program (or any  work based on the
program), the recipient   automatically receives  a license from   the
original  licensor to copy or distribute  the program subject to these
terms and conditions.  You may  not impose any further restrictions on
the recipients' exercise of   the rights granted  herein. You  are not
responsible for enforcing compliance by third parties to this License.

If you wish to incorporate the program  into other free programs whose
distribution conditions are different, write to  the author to ask for
permission.

If, as   a consequence of a   court judgment  or  allegation of patent
infringement or  for any other reason  (not limited to patent issues),
conditions are imposed  on you (whether  by court order,  agreement or
otherwise) that contradict the conditions of this license, they do not
excuse   you from the   conditions  of this  license.    If you cannot
distribute so as to satisfy simultaneously your obligations under this
license and any other pertinent obligations, then as a consequence you
may   not distribute the  program at  all.    For example, if a patent
license would not permit royalty-free redistribution of the program by
all those who receive copies directly or  indirectly through you, then
the only way you  could satisfy both it and  this license would be  to
refrain entirely from distribution of the program.

Terms and conditions on the use of the program
----------------------------------------------

Permission  is  granted   to use  this   software  for non-commercial,
non-military purposes,  with and  only  with  the voice   and language
databases  made available by  the author  from the  MBROLA project www
homepage:

         http://tcts.fpms.ac.be/synthesis

In return, the author asks you to mention the MBROLA reference paper:

T. DUTOIT, V. PAGEL, N. PIERRET, F.  BATAILLE, O. VAN DER VRECKEN
"The MBROLA Project: Towards a Set of High-Quality Speech
Synthesizers Free of Use for Non-Commercial Purposes"
Proc. ICSLP'96, Philadelphia, vol. 3, pp. 1393-1396.  

or,  for a more general  reference   to Text-To-Speech synthesis,  the
book:

An Introduction to Text-To-Speech Synthesis,
T. DUTOIT, Kluwer Academic Publishers, Dordrecht 
Hardbound, ISBN 0-7923-4498-7
April 1997, 312 pp. 

in any scientific publication referring to work for which this program
has been used.

Disclaimer
----------

THIS  SOFTWARE CARRIES NO   WARRANTY, EXPRESSED OR IMPLIED.  THE  USER
ASSUMES ALL RISKS, KNOWN OR UNKNOWN, DIRECT OR INDIRECT, WHICH INVOLVE
THIS SOFTWARE IN ANY WAY. IN PARTICULAR, THE  AUTHOR DOES NOT TAKE ANY
COMMITMENT IN VIEW OF ANY POSSIBLE THIRD PARTY RIGHTS.

--------------------------------------------------------------
2.0 A brief description of MBROLA
--------------------------------------------------------------

MBROLA v3.00 is a  speech synthesizer  based  on the concatenation  of
diphones. It takes a list of phonemes as input, together with prosodic
information  (duration of phonemes  and a piecewise linear description
of  pitch), and produces  speech samples  on 16  bits (linear), at the
sampling frequency of the diphone database. 

It is therefore NOT a Text-To-Speech  (TTS) synthesizer, since it does
not accept raw text as input.  In  order to obtain  a full TTS system,
you need to use this synthesizer in combination with a text processing
system that produces phonetic and prosodic commands.

We maintain a web page with pointers to such freely available systems:

http://tcts.fpms.ac.be/synthesis/mbrtts.html

This software is the heart of the MBROLA project,  the aim of which is
to   obtain  a set a   speech  synthesizers for as   many languages as
possible, free of use for non-commercial applications.

The terms of this project can be summarized as follows :

After some official agreement between the  author of this software and
the  owner of  a diphone database,   the database is  processed by the
author and  adapted  to the mbrola format,  for  free.  The  resulting
mbrola diphone database  is made available  for non-commercial  use as
part of  the MBROLA project.  Commercial rights on the mbrola database
remain with  the database provider, for exclusive  use with the mbrola
software.

The ultimate goal of this project is to boost  up academic research on
speech synthesis, and particularly on prosody generation, known as one
of the biggest challenges taken  up by Text-To-Speech synthesizers for
the years to come.

More details can be found at the MBROLA project homepage :

http://tcts.fpms.ac.be/synthesis

The synthesizer uses a synthesis method known itself as MBROLA.

--------------------------------------------------------------
3.0 Distribution
--------------------------------------------------------------

This distribution of mbrola contains the following files :

MBROLA.exe  or MBROLA: An  executable  file of the synthesizer  itself
(depends on the computer supposed to run it) README.TXT : This file

As   such,  it  requires an  MBROLA    language/voice database  to run
properly. A French male voice sampled at 16kHz has been made available
by the  author.    Additional  languages and voices   are  or  will be
available in   the context of the MBROLA   project.

Main difference between 2.0 and 3.0 releases is that it now includes a
powerfull   online decoder that   allows  MBROLA to use quite  smaller
diphone databases than previously ( generally about 1Mb ).

Please consult the MBROLA project homepage:

http://tcts.fpms.ac.be/synthesis

--------------------------------------------------------------
4.0 Installation and Tests
--------------------------------------------------------------

The following computers/OS are currently supported :

SUN Sparc 5/S5R4 (Solaris2.4)
HPUX9.0 and HPUX10.0 tested on :
HP-UX A.09.05 A 9000/712
HP-UX A.09.05 A 9000/715
HP-UX A.09.01 E 9000/755
HP-UX B.10.01 A 9000/710
Not tested on HP9000/735, but should work properly. 
VAX/VMS V6.2 (V5.5-2 won't work)
Tested on :
VAXstation 3100-M76 
VAXstation 4000-90A
VAXstation 4000-60
DECALPHA(AXP)/VMS 6.2
Tested on :
DEC 3000 - M600 
DEC 2000 Model 300
AlphaStation 200 4/233
AlphaStation 200 4/166
IBM RS6000 Aix 4.12
PC486/DOS6 (but other PCs/DOSs should do, too)
PC486/WIN31
PC486/WIN95 
PC/LINUX 1.2.11
PCPentium120/Solaris2.4
OS/2
BeBox

Please send acknowledgement when mbrola works  on a machine not listed
here. A special  DLL version is  distributed for  PC Windows to  allow
direct audio output, check on the Mbrola site.

See the MBROLA Homepage if your computer or OS is not supported yet.

Assuming you have  copied  the right .zip   file,  create a  directory
mbrola (although this is not  critical), copy the mbrXXX.zip file into
it (in which XXX stands for a version number), and unzip the file:

unzip mbrXXX.zip (or pkunzip on PC/DOS)

You are now ready to synthesize your first French words. 

First try: mbrola

to see the terms and conditions on the use of this software. 

Then try: mbrola -h 

to get some help on how to use the software:

> USAGE: mbrola [-e] [-i] [-c CC] [-v VR] [-f FR] [-t TR] [-l VF] [-s] database pho_file* output_file
>
>A - instead of pho_file or output_file means stdin or stdout
>Extension of output_file ( raw, au, wav, aiff ) tells the wanted audio format
>
>e = No fatal error on unkown diphone
>i = Print the database information if any
>CC= Comment Char, escape sequence for a comment
>VR= Volume Ratio, float ratio applied to ouput samples
>FR= Frequency Ratio, float ratio applied to pitch targets
>TR= Time Ratio, float ratio applied to phone durations
>VF= Voice Freq, target freq for voice quality
>s = Disable spectral smoothing (database debugging purpose)

Now in order  to go further, you  need to get a  version of an  MBROLA
language/voice  database from  the  MBROLA project  homepage.  Let  us
assume   you   have copied  the   FR1  database  and   referred to the
accompanying fr1.txt file for its installation.

Then try: mbrola fr1/fr1 fr1/TEST/bonjour.pho bonjour.wav

it uses the format:

mbrola diphone_database command_file1 command_file2 ... output_file

and creates a sound file for the word 'bonjour'. 

Basically  output  file is  composed of signed  integer  numbers on 16
bits, corresponding to samples at the sampling frequency of the MBROLA
voice/language  database (16 kHz for  the diphone database supplied by
the author of MBROLA : Fr1).  MBROLA  can produce different audio file
formats:   .au, .wav,.aiff, .aif,  and   .raw files  depending  on the
ouput_file extension. If the  extension is not recognized,  the format
is RAW (no header).

To display information  about the phoneme  set  used by the  database,
type:
		  mbrola -i fr1/fr1

It displays the phonetic  alphabet  as well as copyright   information
about the database.

Option   -e makes Mbrola  ignore  wrong or  missing diphones sequences
(replaced  by silence) which can  be quite usefull when debugging your
TTS. It can also be triggered  from the phonetic  file with the escape
sequence: 

;; E=OFF

to ignore missing diphones, or 

;; E=ON

to reactivate the checking.

Optional parameters let you shorten  or lengthen synthetic speech  and
transpose it by providing optional time and frequency ratios:

mbrola -t 1.2 -f 0.8 -v 0.7 fr1/fr1 TEST/bonjour.pho bonjour.wav

for instance,  will result in  a  RIFF Wav file  bonjour.wav 1.2 times
longer than  the previous one (slower  rate), and containing speech in
which all  fundamental frequency  values  have been multiplied  by 0.8
(sounds lower). You  can also  set  the values of  these  coefficients
directly in a .pho file by adding special escape sequence like :

;; F=0.8
;; T=1.2

You can change the voice characteristics with the -l parameter. If the
sampling rate of your database   is 16000, indicating -l 18000  allows
you to shorten  the vocal tract by a  ratio 16/18 (children voice,  or
women voice  depending   on the voice you're   working  on).  With  -l
10000,you can  lengthen the vocal tract by  a  ratio 18/10 (namely the
voice of a Troll).

Option "-v" gives a VolumeRatio which multiplies each output sample.

The -c option lets you specify which symbol  will be used as an escape
sequence for comments and commands in .pho files. The default value is
the semi-colon ';', but  you may want  to change this if your phonetic
alphabet use this symbol, like in:

mbrola -c ! fr1/fr1 TEST/test1.pho test2.pho test.wav

A - instead  of command_file or  output_file means stdin or stdout. On
multitasking machines, it is easy to run  the synthesizer in real time
to obtain audio output from the audio device, by using pipes.

Below are a number of machine dependent hints for best using mbrola.

On MSDOS/Windows or OS/2
------------------------

Type: mbrola fr1/fr1 TEST/bonjour.pho bonjour.wav

Then you can play the RIFF Wav file with windows sound utility On OS/2
pipes may be used just like below.

On modern Unix systems such as Solaris or HPUX or Linux
-------------------------------------------------------

mbrola fr1/fr1 TEST/bonjour.pho -.au | audioplay

where  audioplay is your audio  file player (*  the name vary with the
platform, e.g. splayer for HPUX *)

If your audioplayer has problems with sun .AU files,  try with .wav or
.raw

On Sun4 ( old audio interface )
-------------------------------

Those machines are   now quite  old and  only   provide a mulaw   8Khz
output. A hack is:

mbrola fr1/fr1 input.pho - | sox -t raw  -sw -r 16000  - -t raw -Ub -r 8000 - > /dev/audio

(providing you have the public domain sox utility developed by Ircam).
You should  hear  'bonjour' without the   need  to create intermediate
files. Note  that we strongly recommend  that you DON'T use SOX, since
its resampling "algorithm" will permanently damage the sound.

Other   solution:  The UTILITY.ZIP  file   available  from the  MBROLA
homepage provides RAW2SUN which does this conversion.

On VAX or AXP workstations
--------------------------

To make it  easier   for users to  find MBROLA,   you should add   the
following command to your system startup procedure:

$ DEFINE/SYSTEM/EXEC MBROLA_DIR disk:[dir]

where "disk:[dir]"  is the name  of the directory  you created for the
MBROLA_DIR files.  You could also  add  the following command to  your
system login command procedure:

$ MBROLA :== $MBROLA_DIR:MBROLA.EXE
$ RAW2SUN :== $MBROLA_DIR:RAW2SUN.EXE

to use the decsound device:

$ MCR DECSOUND - volume 40 -play sound.au 

See also the MBR_OLA.COM batch file in  the UTILITY.ZIP file available
from the MBROLA  Homepage if you cannot   play 16 bits sound files  on
your machine.

--------------------------------------------------------------
5.0 Format of input and output files - Limitations
--------------------------------------------------------------

5.1 Phoneme commands
--------------------

The input file bonjour.pho in the above example simply contains :

; bonjour 
_ 51 25 114
b 62 
o~ 127 48 170.42 
Z 110 53.5 116 
u 211 
R 150 50 91 
_ 91

This shows the format of the input data  required by MBROLA. Each line
contains  a phoneme name, a duration  (in ms),  and a series (possibly
none) of pitch targets composed of two float numbers each : the
position of the pitch  target within  the phoneme (in % of  its
total duration), and the pitch value (in Hz) at this position.

In order to increase readability, it is also possible to enclose pitch
target in parentheses. Hence,   the first line of  bonjour.pho could
be written :

_ 51 (25,114)

it tells the synthesizer to produce a  silence of 51  ms, and to put a
pitch target of 114 Hz  at 25% of  51 ms. Pitch targets define a
piecewise linear pitch curve.   Notice that the pitch targets they
define is continuous, since the program automatically drops pitch
information when synthesizing unvoiced phones.

The    data  on  each   line  are separated    by  blank characters or
tabs. Comments can optionally be introduced in command files, starting
with a semi-colon ';'. This default can  be overrun with the -c option
of the command line.

Another special escape    sequence ';;' allow  the user   to introduce
commands in the middle of  .pho files as  described below. This escape
sequence is also affected by the -c option.

5.2 Changing the Freq Ratio or Time Ratio
-----------------------------------------

A command escape  sequence containing a  line like "T=xx" modifies the
time  ratio to  xx,  the same result   is obtained on the  fundamental
frequency by replacing T with F, like in:

;; T = 1.2
;;F=0.8

5.3 Renaming phonemes in a set
------------------------------

Command escape  sequences may also define  renaming tables  of for the
phoneme set. A line like:

;; RENAME A my_a

tells  the synthesizer  that the  phoneme previously  called A  is now
called my_a. This  facility is provided to make  your life easier when
your Natural Language Processing  unit does not  complies to our SAMPA
alphabet. The  only limitation is that  the phoneme name can't contain
blank characters.

We suggest that you  don't mix renaming commands  and true .pho files,
for  example grouping all  your rename command in  a  '.set' file, and
then calling:

mbrola fr1/fr1 fr1.set command1.pho command2.pho output.wav

WARNING: circular renaming can lead to name collision, like in
;; RENAME y u
;; RENAME u ou

THIS GENERATES AN ERROR BECAUSE  OF NAME COLLISIONS  (old y and u will
be named as ou)

which should be written:
;; RENAME u ou
;; RENAME y u

When circuits in renaming can't be avoided, like in:
;; RENAME # _
;; RENAME _ #

you should write:

;; RENAME # temp
;; RENAME _ #
;; RENAME temp _

Once the  renaming has  occurred  there is absolutely   NO PERFORMANCE
DROPS related to this renaming, so use it rather than a pre-processor.

Before renaming anything as # check the paragraph below!

5.4 Flush the output stream
---------------------------

Note, finally, that the synthesizer outputs chunks of synthetic speech
determined as   sections of the piecewise   linear pitch curve. Phones
inside a section of  this curve are synthesized  in one go.  The  last
one of  each chunk, however,  cannot be properly synthesized while the
next phone is   not known (since the program   uses diphones  as  base
speech   units). When  using    mbrola  with pipes,    this  may be  a
problem. Imagine,  for instance,  that  mbrola is  used to   create  a
pipe-based speaking clock on an HP:

speaking_clock | mbrola - -.au | splayer

which tells the time,  say, every 30 seconds.  The  last phone of each
time announcement will only be synthesized  when the next announcement
starts.    To bypass this problem,   mbrola accepts  a special command
phone, which flushes the synthesis buffer : "#"

This default character can be replaced by another symbol thanks to the
command:

;; FLUSH new_flush_symbol

Limitations of the program
--------------------------

1. There may be up to 20 pitch targets in each phone, although
not more than three or four are sufficient to copy natural prosody. We
have  set up a  higher limit  so  as to enable the   use of MBROLA  to
produce synthetic singing  voices, in   which  case long  vowels  with
vibrato may require a large number of pitch targets.

2. Phones can be synthesized with  a maximum duration which depends on
the fundamental frequency with which they are produced. The higher the
frequency, the lower  the duration.  For a  frequency  of 133  Hz, the
maximum duration is 7.5 sec. For a frequency of 66.5 Hz, it is 15 sec.
For a frequency of 266 Hz, it is 3.75 sec.

3. Although pitch targets are facultative, the synthesizer will
refuse to produce  sequences  of more  than 250  phones with no  pitch
information.

--------------------------------------------------------------
6.0 Joining the MBROLA project as a user 
--------------------------------------------------------------

For convenience, we have defined two mailing lists :

* mbrola-interest@tcts.fpms.ac.be :  a forum for  MBROLA questions and
issues. It is   used  by the   maintainers of  the mbrola  project  to
announce new releases, bug fixes,  new voices and languages, and other
information of interest to all MBROLA  users.  Users who want to share
.pho files  or free applications running on  top of mbrola should send
mail to mbrola-interest.

It  is your interest, as a  user,  to subscribe to the mbrola-interest
mailing list, by sending an e-mail to :

          mbrola-interest-request@tcts.fpms.ac.be

with the word  'subscribe' in either the header  or the main  text. To
unsubscribe, just send another mail with 'unsubscribe'.

BUGS
----

If you detect a bug, or if you find an input  for which the quality of
the speech provided by mbrola is  not as good  as usual, first consult
the  FAQ  file  from  the  MBROLA Project  homepage,   which will   be
frequently updated.

If this is  of no help, send  a kind mail to mbrola@tcts.fpms.ac.be in
which you include  the .pho file with  which  the problem  appears and
mention your machine architecture.

NEW DATABASES
-------------

If you want   to participate to  the  mbrola  project by  providing  a
diphone database (i.e. a set of sample files  with one example of each
diphone in your  language), refer to the mbrola  WWW homepage, or send
an email to: mbrola@tcts.fpms.ac.be.

APPLICATIONS
------------

If  you have used mbrola   to build speaking apps  on  top of it (like
talking  clocks,  talking    agendas, talking  tools   for handicapped
persons, etc., and  want to  make it  available to  the community (for
free, of course, and for non-commercial, non-military applications, as
imposed by the mbrola license agreement), just make an announcement to
the mbrola mailing list:

          mbrola-interest@tcts.fpms.ac.be. 

COMMERCIAL VERSION
------------------

If you are interested in the commercial version of mbrola (source code
available), send an email to: mbrola@tcts.fpms.ac.be 

FEEDBACK
--------

If you simply find  this initiative useful, please drop  us a note  at
mbrola@tcts.fpms.ac.be. We have spent a lot of our time to provide you
with this program, and we would like to get some feedback in return.

Don't forget, either, to mention the MBROLA reference paper :

T. DUTOIT, V. PAGEL, N. PIERRET, F. BATAILLE, O. VAN DER VRECKEN
"The MBROLA Project: Towards a Set of High-Quality Speech
Synthesizers Free of Use for Non-Commercial Purposes" 
Proc. ICSLP 96, Philadelphia, vol. 3, pp. 1393-1396

or,  for  a more  general reference  to  Text-To-Speech synthesis, the
book:

An Introduction to Text-To-Speech Synthesis,
T. DUTOIT, Kluwer Academic Publishers, Dordrecht 
Hardbound, ISBN 0-7923-4498-7
April 1997, 312 pp. 

in any scientific publication referring to work for which this program
has been used.

--------------------------------------------------------------
7.0 Joining the MBROLA project as a database provider
--------------------------------------------------------------

One of the biggest interests of the MBROLA project (and definitely its
most  original aspect) lies in its  ability to provide an ever growing
set of languages/voices to users.

To achieve this goal, the MBROLA project has  itself been organized so
as to incite  other research labs  or companies to share their diphone
databases.

The terms of this sharing policy can be summarized as follows :

1. We shall only  use your database to  adapt it to the mbrola format,
and destroy the copy when this is done.

2.   The resulting mbrola diphone  database will  be copyright Faculte
Polytechnique de Mons.  Non-commercial   use  of the database in   the
framework of  the MBROLA project    will be automatically  granted  to
Internet users. In return, we shall send you a license agreement which
will  transfer  all our  commercial  rights on  the  database  to you,
provided the database is used with and only with the MBROLA program.

3. All these  details will be fixed by  some official agreement before
you send us anything.

If you want to create a database from scratch
---------------------------------------------

First, you should be aware that recording a  diphone database is not a
trivial operation. If it is not performed carefully, the result can be
deceiving. FR1, for  instance, required about  one month of  work, yet
with the help of some efficient laboratory  tools for signal recording
and  editing.  What is more,  some  phonetic knowledge of the targeted
language is necessary to create the initial corpus.

So if you just  think of designing a new  diphone database as a  game,
forget it.

If, on the contrary, you are willing to spend some time to provide the
MBROLA community with a new language or voice, or  if you already have
a diphone database and wish to share it  in mbrola format (and receive
in return the rights  for  any commercial  exploitation of the  mbrola
diphone database we will create for you), welcome here.

If you still want to create a database from scratch
---------------------------------------------------

Creating a database is typically achieved in four steps: 

   * Creating a text corpus
   * Recording the corpus
   * Segmenting the speech corpus
   * Equalizing diphones

Creating a text corpus
------------------------

Diphones are speech units that begin in the middle of the stable state
of a  phone and end  in the  middle of the   following one. Their main
interest in synthesis is that   they minimize concatenation  problems,
since they  involve  most  of  the transitions    and co-articulations
between phones,  while requiring an   affordable amount of  memory, as
their number  remains relatively small (as  opposed to other synthesis
units such as half-syllables or triphones).

Hence, the first step to build a diphone database consists of fixing a
list of all the phones of a language. Notice  that phones are acoustic
instances   of  phonemes.  Phonemes   are  themselves   defined  on  a
functional, linguistic level.

Obtaining a list of phones from a list  of phonemes requires to number
allophones, i.e. acoustic versions of some phonemes that significantly
differ   from the    standard  one, mostly    due to   co-articulation
constraints.  Although  it    is not necessary  to    account  for all
allophonic  variations to   build   an intelligible  synthesizer,  the
naturalness of synthetic speech may  be affected if too few allophones
are considered. In FR1, for example, we did not consider allophones at
all. As a result, some allophonic  phenomena, such as devoicing of /R/
when  followed  or preceded  by unvoiced plosives,   is only partially
accounted for.

When a  complete list of  phones has emerged, including  allophones if
possible, a  corresponding list of  diphones is  immediately obtained,
and  a list of words  is carefully completed, in such  a way that each
diphones       appears  at least    once   (twice     is  better,  for
security). Unfavorable positions, like inside stressed syllables or in
strongly   reduced  (i.e.   over-co-articulated)  contexts, should  be
excluded. One typically uses carrier  sentences in which the word with
the diphone  considered is  inserted.  Notice that many  diphones only
appear in the association  of words (i.e.  not   in single words).   A
number of   diphones even never appear at   all.  Hence,  the  task of
creating   a  text corpus  which  contains  all  existing  ones is not
trivial.

Recording the corpus
--------------------

The  corpus is   then read,  by a   professional speaker if  possible,
digitally recorded, and stored in digital format.

IMPORTANT :  In order for the mbrola  resynthesis operation to achieve
best results,  the  corpus should  be read   with the most   monotonic
intonation possible (just   like   when  reading a  long   and  boring
enumeration). Even the end of words  should maintain their fundamental
frequency constant. Since this is a totally unnatural way of reading a
text, the speaker should train before starting the recording session.

NOTA BENE : If you  already have a diphone database  which you want to
make  available in mbrola  format, contact the author,  even if it has
not been recorded  with constant pitch.  It is  very  likely that your
database can be used anyway.

It is best to use high quality audio devices (microphone, pre-amp, A/D
converter).  The sound recording   tools provided with many  low-price
commercial boards,  for example,  should  be avoided, as they  produce
undesired recording  noise.   To  roughly  test  the  quality  of your
recording system, just plug  the microphone  in, adjust the  recording
level, hold your breath, and record. Or, if you can, short circuit the
microphone entry of your system, and  record. See the recording noise.
In the case of FR1, the noise level only corrupted the last three bits
of our data, leaving thirteen significant bits.

Another  important  type   of noise  to  avoid is    ambient noise and
reverberation. In particular,   the recording should  be  free  of low
frequency noises, due  to   trucks passing  in the  neighborhood   for
instance.  Most of the  time you won't hear  them, but your microphone
will hardly  fail to detect  them, especially if it  is a high quality
one.  The best  way to avoid them  is to install your recording system
inside a professional soundproof room. For FR1, this is what we did.

Segmenting the corpus
---------------------

Once   The corpus has been  recorded,   all diphones  must be spotted,
either  manually  with  the  help  of signal  visualization tools,  or
automatically   thanks to  segmentation algorithms,  the  decisions of
which are checked  and corrected interactively.  A diphone database is
finally created, which  centralizes the results, in the  form of : the
name of diphones, the related  waveforms, their duration, and internal
sub-splittings.  As  a matter of   fact, the  position of  the  border
between  phones should be  stored,  so as to  be   able to modify  the
duration of one  half-phone without affecting  the length of the other
one.

NOTA BENE  : For optimal  results  with  mbrola, it   is best to  keep
diphones in   context.   The MBROLA  resynthesis   operation,  indeed,
includes  some  pitch analysis, which  itself   achieves more accurate
results when, say, 50 ms of  speech are kept at  the left and right of
each diphone.

Equalizing diphones
-------------------

Since diphones to  be chained  up have  generally been extracted  from
different words, that is in   different phonetic contexts, they  often
present amplitude   and  timbre mismatches.    Even  in  the case   of
stationary  vocalic  sounds,  for instance,   a  rough  sequencing  of
diphones typically leads to audible discontinuities.

Amplitude mismatches can  be coped  to  some extent  as early  as  the
constitution  of the diphone   database, thanks to equalization.  This
operation smoothly modifies the energy  levels at the beginning and at
the  end  of  segments,  in  such  a way   as to   eliminate amplitude
mismatches (by setting the energy of all the phones of a given phoneme
to their average value).

In contrast, timbre  conflicts are better  tackled at run-time, by the
mbrola algorithm itself.

Notice, however, that equalization is only  facultative, as the mbrola
resynthesis operation (the one we shall perform to adapt your database
to the mbrola format) also includes some equalization facilities.

IMPORTANT
---------

If you want to build a new diphone database, please contact the author
first.  He will help   you as much  as  he can, by providing  phonetic
information if available for instance.

In all cases, make a first dummy  trial : create  a small corpus for a
few diphones, record them, segment them, equalize them if you can, and
send the result directly to the author.  He will  test your data, tell
you how good it is, and what should be done to make it better.

If you want to share an existing database
-----------------------------------------

Read the information  above to see if  your database has been designed
and recorded correctly. Contact the authors (see below) anyway.

--------------------------------------------------------------
8.0 Acknowledgments
--------------------------------------------------------------

I  would like to thank   Vincent Pagel (Mons  / BE)  for his intensive
programming, testing, and debugging of this program, and for all sorts
of fruitful discussions.

Sam  Przyswa (Paris/FR), Fred  Englert (Frankfurt/DE), Arnaud Gaudinat
(University of  Geneva,  CH),  Cyrille Mastchenko  (Paris/FR), Michael
C. Thornburgh (USA),  Eric  Keller (University of  Lausanne,CH), Bruno
Langlois (Quebec/CA),  Christophe  M.  Vallat  (Domerat/FR), Cristiano
Verondini (Bologna/Italy), and Gerald Kerma (G'K2 Vaugrigneuse/FR) for
their help in the compilation of MBROLA.  

Arnaud Gaudinat    (Lausanne/CH), Vincent Pagel,   Michael   M.  Cohen
(University of California - Santa Cruz),  and Patrick Bouffer (France)
have arranged mirror sites.

Let's greet  our  pioneer database  providers: Marian   Boldea, Denis
Costa,  Arthur Dirksen, Thierry  Dutoit,  Fred Englert, Vincent Pagel
and  the team at  University   Autonoma of Barcelona and   Alistair
Conkie! May they be thanked for their work.

Stephen Isard and  Alistair Conkie have  provided the Freespeech TTS!! 
Alan Black and Paul Taylor have supported the  Mbrola Project in their
great Festival multilingual TTS Project. 

Fabrice   Malfrere (Mons/BE)  who has developped   an efficient speech
alignment program for Windows (distributed on the mbrola site).

Alain  Ruelle (Mons/BE)  who has  developped  the MBRPlay  dll and the 
Mbroli interactive pho file player for Windows.

Last but not least,  I am also greatly  indebted to  Francois Bataille
(Mons/BE) for having supported the creation of this internet project.

--------------------------------------------------------------
9.0 Contacting the author
--------------------------------------------------------------

Dr Thierry Dutoit

Faculte Polytechnique de Mons, TCTS Lab,
31, bvd Dolez, B-7000 Mons, Belgium. 
tel : /32/65/374133
fax : /32/65/374129
e-mail: mbrola@tcts.fpms.ac.be, for general information, 
questions on the installation of software and databases.


Dafydd Gibbon, Sat Oct 17 18:27:56 CEST 1998