M M BBBB RRRR OOO L A
MM MM B B R R O O L A A
M M M M B B R R O O L A A
M M M BBBB RRR O O L AAAAAAA
M M B B R R O O L A A
M M B B R R O O L A A
M M BBBBB R R OOO LLLLL A A
Version 3.00, Thu Feb 26 14:48:28 MET 1998
--------------------------------------------------------------
Table of Contents
--------------------------------------------------------------
1.0 License
2.0 A brief description of the MBROLA software
3.0 Distribution
4.0 Installation, and Tests
5.0 Format of input and output files - Limitations
6.0 Joining the MBROLA project as a user
7.0 Joining the MBROLA project as database provider
8.0 Acknowledgments
9.0 Contacting the author
--------------------------------------------------------------
1.0 License
--------------------------------------------------------------
This program and object code is being provided to "you", the licensee,
by Thierry Dutoit, the "author", under the following license, which
applies to any program, object code or other work which contains a
notice placed by the copyright holder saying it may be distributed
under the terms of this license. The "program", below, refers to any
such program, object code or work.
By obtaining, using and/or copying this program, you agree that you
have read, understood, and will comply with these terms and
conditions:
Terms and conditions for the distribution of the program
--------------------------------------------------------
This program may not be sold or incorporated into any product which is
sold without prior permission from the author.
When no charge is made, this program may be copied and distributed
freely, provided that this notice is copied and distributed with
it. Each time you redistribute the program (or any work based on the
program), the recipient automatically receives a license from the
original licensor to copy or distribute the program subject to these
terms and conditions. You may not impose any further restrictions on
the recipients' exercise of the rights granted herein. You are not
responsible for enforcing compliance by third parties to this License.
If you wish to incorporate the program into other free programs whose
distribution conditions are different, write to the author to ask for
permission.
If, as a consequence of a court judgment or allegation of patent
infringement or for any other reason (not limited to patent issues),
conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this license, they do not
excuse you from the conditions of this license. If you cannot
distribute so as to satisfy simultaneously your obligations under this
license and any other pertinent obligations, then as a consequence you
may not distribute the program at all. For example, if a patent
license would not permit royalty-free redistribution of the program by
all those who receive copies directly or indirectly through you, then
the only way you could satisfy both it and this license would be to
refrain entirely from distribution of the program.
Terms and conditions on the use of the program
----------------------------------------------
Permission is granted to use this software for non-commercial,
non-military purposes, with and only with the voice and language
databases made available by the author from the MBROLA project www
homepage:
http://tcts.fpms.ac.be/synthesis
In return, the author asks you to mention the MBROLA reference paper:
T. DUTOIT, V. PAGEL, N. PIERRET, F. BATAILLE, O. VAN DER VRECKEN
"The MBROLA Project: Towards a Set of High-Quality Speech
Synthesizers Free of Use for Non-Commercial Purposes"
Proc. ICSLP'96, Philadelphia, vol. 3, pp. 1393-1396.
or, for a more general reference to Text-To-Speech synthesis, the
book:
An Introduction to Text-To-Speech Synthesis,
T. DUTOIT, Kluwer Academic Publishers, Dordrecht
Hardbound, ISBN 0-7923-4498-7
April 1997, 312 pp.
in any scientific publication referring to work for which this program
has been used.
Disclaimer
----------
THIS SOFTWARE CARRIES NO WARRANTY, EXPRESSED OR IMPLIED. THE USER
ASSUMES ALL RISKS, KNOWN OR UNKNOWN, DIRECT OR INDIRECT, WHICH INVOLVE
THIS SOFTWARE IN ANY WAY. IN PARTICULAR, THE AUTHOR DOES NOT TAKE ANY
COMMITMENT IN VIEW OF ANY POSSIBLE THIRD PARTY RIGHTS.
--------------------------------------------------------------
2.0 A brief description of MBROLA
--------------------------------------------------------------
MBROLA v3.00 is a speech synthesizer based on the concatenation of
diphones. It takes a list of phonemes as input, together with prosodic
information (duration of phonemes and a piecewise linear description
of pitch), and produces speech samples on 16 bits (linear), at the
sampling frequency of the diphone database.
It is therefore NOT a Text-To-Speech (TTS) synthesizer, since it does
not accept raw text as input. In order to obtain a full TTS system,
you need to use this synthesizer in combination with a text processing
system that produces phonetic and prosodic commands.
We maintain a web page with pointers to such freely available systems:
http://tcts.fpms.ac.be/synthesis/mbrtts.html
This software is the heart of the MBROLA project, the aim of which is
to obtain a set a speech synthesizers for as many languages as
possible, free of use for non-commercial applications.
The terms of this project can be summarized as follows :
After some official agreement between the author of this software and
the owner of a diphone database, the database is processed by the
author and adapted to the mbrola format, for free. The resulting
mbrola diphone database is made available for non-commercial use as
part of the MBROLA project. Commercial rights on the mbrola database
remain with the database provider, for exclusive use with the mbrola
software.
The ultimate goal of this project is to boost up academic research on
speech synthesis, and particularly on prosody generation, known as one
of the biggest challenges taken up by Text-To-Speech synthesizers for
the years to come.
More details can be found at the MBROLA project homepage :
http://tcts.fpms.ac.be/synthesis
The synthesizer uses a synthesis method known itself as MBROLA.
--------------------------------------------------------------
3.0 Distribution
--------------------------------------------------------------
This distribution of mbrola contains the following files :
MBROLA.exe or MBROLA: An executable file of the synthesizer itself
(depends on the computer supposed to run it) README.TXT : This file
As such, it requires an MBROLA language/voice database to run
properly. A French male voice sampled at 16kHz has been made available
by the author. Additional languages and voices are or will be
available in the context of the MBROLA project.
Main difference between 2.0 and 3.0 releases is that it now includes a
powerfull online decoder that allows MBROLA to use quite smaller
diphone databases than previously ( generally about 1Mb ).
Please consult the MBROLA project homepage:
http://tcts.fpms.ac.be/synthesis
--------------------------------------------------------------
4.0 Installation and Tests
--------------------------------------------------------------
The following computers/OS are currently supported :
SUN Sparc 5/S5R4 (Solaris2.4)
HPUX9.0 and HPUX10.0 tested on :
HP-UX A.09.05 A 9000/712
HP-UX A.09.05 A 9000/715
HP-UX A.09.01 E 9000/755
HP-UX B.10.01 A 9000/710
Not tested on HP9000/735, but should work properly.
VAX/VMS V6.2 (V5.5-2 won't work)
Tested on :
VAXstation 3100-M76
VAXstation 4000-90A
VAXstation 4000-60
DECALPHA(AXP)/VMS 6.2
Tested on :
DEC 3000 - M600
DEC 2000 Model 300
AlphaStation 200 4/233
AlphaStation 200 4/166
IBM RS6000 Aix 4.12
PC486/DOS6 (but other PCs/DOSs should do, too)
PC486/WIN31
PC486/WIN95
PC/LINUX 1.2.11
PCPentium120/Solaris2.4
OS/2
BeBox
Please send acknowledgement when mbrola works on a machine not listed
here. A special DLL version is distributed for PC Windows to allow
direct audio output, check on the Mbrola site.
See the MBROLA Homepage if your computer or OS is not supported yet.
Assuming you have copied the right .zip file, create a directory
mbrola (although this is not critical), copy the mbrXXX.zip file into
it (in which XXX stands for a version number), and unzip the file:
unzip mbrXXX.zip (or pkunzip on PC/DOS)
You are now ready to synthesize your first French words.
First try: mbrola
to see the terms and conditions on the use of this software.
Then try: mbrola -h
to get some help on how to use the software:
> USAGE: mbrola [-e] [-i] [-c CC] [-v VR] [-f FR] [-t TR] [-l VF] [-s] database pho_file* output_file
>
>A - instead of pho_file or output_file means stdin or stdout
>Extension of output_file ( raw, au, wav, aiff ) tells the wanted audio format
>
>e = No fatal error on unkown diphone
>i = Print the database information if any
>CC= Comment Char, escape sequence for a comment
>VR= Volume Ratio, float ratio applied to ouput samples
>FR= Frequency Ratio, float ratio applied to pitch targets
>TR= Time Ratio, float ratio applied to phone durations
>VF= Voice Freq, target freq for voice quality
>s = Disable spectral smoothing (database debugging purpose)
Now in order to go further, you need to get a version of an MBROLA
language/voice database from the MBROLA project homepage. Let us
assume you have copied the FR1 database and referred to the
accompanying fr1.txt file for its installation.
Then try: mbrola fr1/fr1 fr1/TEST/bonjour.pho bonjour.wav
it uses the format:
mbrola diphone_database command_file1 command_file2 ... output_file
and creates a sound file for the word 'bonjour'.
Basically output file is composed of signed integer numbers on 16
bits, corresponding to samples at the sampling frequency of the MBROLA
voice/language database (16 kHz for the diphone database supplied by
the author of MBROLA : Fr1). MBROLA can produce different audio file
formats: .au, .wav,.aiff, .aif, and .raw files depending on the
ouput_file extension. If the extension is not recognized, the format
is RAW (no header).
To display information about the phoneme set used by the database,
type:
mbrola -i fr1/fr1
It displays the phonetic alphabet as well as copyright information
about the database.
Option -e makes Mbrola ignore wrong or missing diphones sequences
(replaced by silence) which can be quite usefull when debugging your
TTS. It can also be triggered from the phonetic file with the escape
sequence:
;; E=OFF
to ignore missing diphones, or
;; E=ON
to reactivate the checking.
Optional parameters let you shorten or lengthen synthetic speech and
transpose it by providing optional time and frequency ratios:
mbrola -t 1.2 -f 0.8 -v 0.7 fr1/fr1 TEST/bonjour.pho bonjour.wav
for instance, will result in a RIFF Wav file bonjour.wav 1.2 times
longer than the previous one (slower rate), and containing speech in
which all fundamental frequency values have been multiplied by 0.8
(sounds lower). You can also set the values of these coefficients
directly in a .pho file by adding special escape sequence like :
;; F=0.8
;; T=1.2
You can change the voice characteristics with the -l parameter. If the
sampling rate of your database is 16000, indicating -l 18000 allows
you to shorten the vocal tract by a ratio 16/18 (children voice, or
women voice depending on the voice you're working on). With -l
10000,you can lengthen the vocal tract by a ratio 18/10 (namely the
voice of a Troll).
Option "-v" gives a VolumeRatio which multiplies each output sample.
The -c option lets you specify which symbol will be used as an escape
sequence for comments and commands in .pho files. The default value is
the semi-colon ';', but you may want to change this if your phonetic
alphabet use this symbol, like in:
mbrola -c ! fr1/fr1 TEST/test1.pho test2.pho test.wav
A - instead of command_file or output_file means stdin or stdout. On
multitasking machines, it is easy to run the synthesizer in real time
to obtain audio output from the audio device, by using pipes.
Below are a number of machine dependent hints for best using mbrola.
On MSDOS/Windows or OS/2
------------------------
Type: mbrola fr1/fr1 TEST/bonjour.pho bonjour.wav
Then you can play the RIFF Wav file with windows sound utility On OS/2
pipes may be used just like below.
On modern Unix systems such as Solaris or HPUX or Linux
-------------------------------------------------------
mbrola fr1/fr1 TEST/bonjour.pho -.au | audioplay
where audioplay is your audio file player (* the name vary with the
platform, e.g. splayer for HPUX *)
If your audioplayer has problems with sun .AU files, try with .wav or
.raw
On Sun4 ( old audio interface )
-------------------------------
Those machines are now quite old and only provide a mulaw 8Khz
output. A hack is:
mbrola fr1/fr1 input.pho - | sox -t raw -sw -r 16000 - -t raw -Ub -r 8000 - > /dev/audio
(providing you have the public domain sox utility developed by Ircam).
You should hear 'bonjour' without the need to create intermediate
files. Note that we strongly recommend that you DON'T use SOX, since
its resampling "algorithm" will permanently damage the sound.
Other solution: The UTILITY.ZIP file available from the MBROLA
homepage provides RAW2SUN which does this conversion.
On VAX or AXP workstations
--------------------------
To make it easier for users to find MBROLA, you should add the
following command to your system startup procedure:
$ DEFINE/SYSTEM/EXEC MBROLA_DIR disk:[dir]
where "disk:[dir]" is the name of the directory you created for the
MBROLA_DIR files. You could also add the following command to your
system login command procedure:
$ MBROLA :== $MBROLA_DIR:MBROLA.EXE
$ RAW2SUN :== $MBROLA_DIR:RAW2SUN.EXE
to use the decsound device:
$ MCR DECSOUND - volume 40 -play sound.au
See also the MBR_OLA.COM batch file in the UTILITY.ZIP file available
from the MBROLA Homepage if you cannot play 16 bits sound files on
your machine.
--------------------------------------------------------------
5.0 Format of input and output files - Limitations
--------------------------------------------------------------
5.1 Phoneme commands
--------------------
The input file bonjour.pho in the above example simply contains :
; bonjour
_ 51 25 114
b 62
o~ 127 48 170.42
Z 110 53.5 116
u 211
R 150 50 91
_ 91
This shows the format of the input data required by MBROLA. Each line
contains a phoneme name, a duration (in ms), and a series (possibly
none) of pitch targets composed of two float numbers each : the
position of the pitch target within the phoneme (in % of its
total duration), and the pitch value (in Hz) at this position.
In order to increase readability, it is also possible to enclose pitch
target in parentheses. Hence, the first line of bonjour.pho could
be written :
_ 51 (25,114)
it tells the synthesizer to produce a silence of 51 ms, and to put a
pitch target of 114 Hz at 25% of 51 ms. Pitch targets define a
piecewise linear pitch curve. Notice that the pitch targets they
define is continuous, since the program automatically drops pitch
information when synthesizing unvoiced phones.
The data on each line are separated by blank characters or
tabs. Comments can optionally be introduced in command files, starting
with a semi-colon ';'. This default can be overrun with the -c option
of the command line.
Another special escape sequence ';;' allow the user to introduce
commands in the middle of .pho files as described below. This escape
sequence is also affected by the -c option.
5.2 Changing the Freq Ratio or Time Ratio
-----------------------------------------
A command escape sequence containing a line like "T=xx" modifies the
time ratio to xx, the same result is obtained on the fundamental
frequency by replacing T with F, like in:
;; T = 1.2
;;F=0.8
5.3 Renaming phonemes in a set
------------------------------
Command escape sequences may also define renaming tables of for the
phoneme set. A line like:
;; RENAME A my_a
tells the synthesizer that the phoneme previously called A is now
called my_a. This facility is provided to make your life easier when
your Natural Language Processing unit does not complies to our SAMPA
alphabet. The only limitation is that the phoneme name can't contain
blank characters.
We suggest that you don't mix renaming commands and true .pho files,
for example grouping all your rename command in a '.set' file, and
then calling:
mbrola fr1/fr1 fr1.set command1.pho command2.pho output.wav
WARNING: circular renaming can lead to name collision, like in
;; RENAME y u
;; RENAME u ou
THIS GENERATES AN ERROR BECAUSE OF NAME COLLISIONS (old y and u will
be named as ou)
which should be written:
;; RENAME u ou
;; RENAME y u
When circuits in renaming can't be avoided, like in:
;; RENAME # _
;; RENAME _ #
you should write:
;; RENAME # temp
;; RENAME _ #
;; RENAME temp _
Once the renaming has occurred there is absolutely NO PERFORMANCE
DROPS related to this renaming, so use it rather than a pre-processor.
Before renaming anything as # check the paragraph below!
5.4 Flush the output stream
---------------------------
Note, finally, that the synthesizer outputs chunks of synthetic speech
determined as sections of the piecewise linear pitch curve. Phones
inside a section of this curve are synthesized in one go. The last
one of each chunk, however, cannot be properly synthesized while the
next phone is not known (since the program uses diphones as base
speech units). When using mbrola with pipes, this may be a
problem. Imagine, for instance, that mbrola is used to create a
pipe-based speaking clock on an HP:
speaking_clock | mbrola - -.au | splayer
which tells the time, say, every 30 seconds. The last phone of each
time announcement will only be synthesized when the next announcement
starts. To bypass this problem, mbrola accepts a special command
phone, which flushes the synthesis buffer : "#"
This default character can be replaced by another symbol thanks to the
command:
;; FLUSH new_flush_symbol
Limitations of the program
--------------------------
1. There may be up to 20 pitch targets in each phone, although
not more than three or four are sufficient to copy natural prosody. We
have set up a higher limit so as to enable the use of MBROLA to
produce synthetic singing voices, in which case long vowels with
vibrato may require a large number of pitch targets.
2. Phones can be synthesized with a maximum duration which depends on
the fundamental frequency with which they are produced. The higher the
frequency, the lower the duration. For a frequency of 133 Hz, the
maximum duration is 7.5 sec. For a frequency of 66.5 Hz, it is 15 sec.
For a frequency of 266 Hz, it is 3.75 sec.
3. Although pitch targets are facultative, the synthesizer will
refuse to produce sequences of more than 250 phones with no pitch
information.
--------------------------------------------------------------
6.0 Joining the MBROLA project as a user
--------------------------------------------------------------
For convenience, we have defined two mailing lists :
* mbrola-interest@tcts.fpms.ac.be : a forum for MBROLA questions and
issues. It is used by the maintainers of the mbrola project to
announce new releases, bug fixes, new voices and languages, and other
information of interest to all MBROLA users. Users who want to share
.pho files or free applications running on top of mbrola should send
mail to mbrola-interest.
It is your interest, as a user, to subscribe to the mbrola-interest
mailing list, by sending an e-mail to :
mbrola-interest-request@tcts.fpms.ac.be
with the word 'subscribe' in either the header or the main text. To
unsubscribe, just send another mail with 'unsubscribe'.
BUGS
----
If you detect a bug, or if you find an input for which the quality of
the speech provided by mbrola is not as good as usual, first consult
the FAQ file from the MBROLA Project homepage, which will be
frequently updated.
If this is of no help, send a kind mail to mbrola@tcts.fpms.ac.be in
which you include the .pho file with which the problem appears and
mention your machine architecture.
NEW DATABASES
-------------
If you want to participate to the mbrola project by providing a
diphone database (i.e. a set of sample files with one example of each
diphone in your language), refer to the mbrola WWW homepage, or send
an email to: mbrola@tcts.fpms.ac.be.
APPLICATIONS
------------
If you have used mbrola to build speaking apps on top of it (like
talking clocks, talking agendas, talking tools for handicapped
persons, etc., and want to make it available to the community (for
free, of course, and for non-commercial, non-military applications, as
imposed by the mbrola license agreement), just make an announcement to
the mbrola mailing list:
mbrola-interest@tcts.fpms.ac.be.
COMMERCIAL VERSION
------------------
If you are interested in the commercial version of mbrola (source code
available), send an email to: mbrola@tcts.fpms.ac.be
FEEDBACK
--------
If you simply find this initiative useful, please drop us a note at
mbrola@tcts.fpms.ac.be. We have spent a lot of our time to provide you
with this program, and we would like to get some feedback in return.
Don't forget, either, to mention the MBROLA reference paper :
T. DUTOIT, V. PAGEL, N. PIERRET, F. BATAILLE, O. VAN DER VRECKEN
"The MBROLA Project: Towards a Set of High-Quality Speech
Synthesizers Free of Use for Non-Commercial Purposes"
Proc. ICSLP 96, Philadelphia, vol. 3, pp. 1393-1396
or, for a more general reference to Text-To-Speech synthesis, the
book:
An Introduction to Text-To-Speech Synthesis,
T. DUTOIT, Kluwer Academic Publishers, Dordrecht
Hardbound, ISBN 0-7923-4498-7
April 1997, 312 pp.
in any scientific publication referring to work for which this program
has been used.
--------------------------------------------------------------
7.0 Joining the MBROLA project as a database provider
--------------------------------------------------------------
One of the biggest interests of the MBROLA project (and definitely its
most original aspect) lies in its ability to provide an ever growing
set of languages/voices to users.
To achieve this goal, the MBROLA project has itself been organized so
as to incite other research labs or companies to share their diphone
databases.
The terms of this sharing policy can be summarized as follows :
1. We shall only use your database to adapt it to the mbrola format,
and destroy the copy when this is done.
2. The resulting mbrola diphone database will be copyright Faculte
Polytechnique de Mons. Non-commercial use of the database in the
framework of the MBROLA project will be automatically granted to
Internet users. In return, we shall send you a license agreement which
will transfer all our commercial rights on the database to you,
provided the database is used with and only with the MBROLA program.
3. All these details will be fixed by some official agreement before
you send us anything.
If you want to create a database from scratch
---------------------------------------------
First, you should be aware that recording a diphone database is not a
trivial operation. If it is not performed carefully, the result can be
deceiving. FR1, for instance, required about one month of work, yet
with the help of some efficient laboratory tools for signal recording
and editing. What is more, some phonetic knowledge of the targeted
language is necessary to create the initial corpus.
So if you just think of designing a new diphone database as a game,
forget it.
If, on the contrary, you are willing to spend some time to provide the
MBROLA community with a new language or voice, or if you already have
a diphone database and wish to share it in mbrola format (and receive
in return the rights for any commercial exploitation of the mbrola
diphone database we will create for you), welcome here.
If you still want to create a database from scratch
---------------------------------------------------
Creating a database is typically achieved in four steps:
* Creating a text corpus
* Recording the corpus
* Segmenting the speech corpus
* Equalizing diphones
Creating a text corpus
------------------------
Diphones are speech units that begin in the middle of the stable state
of a phone and end in the middle of the following one. Their main
interest in synthesis is that they minimize concatenation problems,
since they involve most of the transitions and co-articulations
between phones, while requiring an affordable amount of memory, as
their number remains relatively small (as opposed to other synthesis
units such as half-syllables or triphones).
Hence, the first step to build a diphone database consists of fixing a
list of all the phones of a language. Notice that phones are acoustic
instances of phonemes. Phonemes are themselves defined on a
functional, linguistic level.
Obtaining a list of phones from a list of phonemes requires to number
allophones, i.e. acoustic versions of some phonemes that significantly
differ from the standard one, mostly due to co-articulation
constraints. Although it is not necessary to account for all
allophonic variations to build an intelligible synthesizer, the
naturalness of synthetic speech may be affected if too few allophones
are considered. In FR1, for example, we did not consider allophones at
all. As a result, some allophonic phenomena, such as devoicing of /R/
when followed or preceded by unvoiced plosives, is only partially
accounted for.
When a complete list of phones has emerged, including allophones if
possible, a corresponding list of diphones is immediately obtained,
and a list of words is carefully completed, in such a way that each
diphones appears at least once (twice is better, for
security). Unfavorable positions, like inside stressed syllables or in
strongly reduced (i.e. over-co-articulated) contexts, should be
excluded. One typically uses carrier sentences in which the word with
the diphone considered is inserted. Notice that many diphones only
appear in the association of words (i.e. not in single words). A
number of diphones even never appear at all. Hence, the task of
creating a text corpus which contains all existing ones is not
trivial.
Recording the corpus
--------------------
The corpus is then read, by a professional speaker if possible,
digitally recorded, and stored in digital format.
IMPORTANT : In order for the mbrola resynthesis operation to achieve
best results, the corpus should be read with the most monotonic
intonation possible (just like when reading a long and boring
enumeration). Even the end of words should maintain their fundamental
frequency constant. Since this is a totally unnatural way of reading a
text, the speaker should train before starting the recording session.
NOTA BENE : If you already have a diphone database which you want to
make available in mbrola format, contact the author, even if it has
not been recorded with constant pitch. It is very likely that your
database can be used anyway.
It is best to use high quality audio devices (microphone, pre-amp, A/D
converter). The sound recording tools provided with many low-price
commercial boards, for example, should be avoided, as they produce
undesired recording noise. To roughly test the quality of your
recording system, just plug the microphone in, adjust the recording
level, hold your breath, and record. Or, if you can, short circuit the
microphone entry of your system, and record. See the recording noise.
In the case of FR1, the noise level only corrupted the last three bits
of our data, leaving thirteen significant bits.
Another important type of noise to avoid is ambient noise and
reverberation. In particular, the recording should be free of low
frequency noises, due to trucks passing in the neighborhood for
instance. Most of the time you won't hear them, but your microphone
will hardly fail to detect them, especially if it is a high quality
one. The best way to avoid them is to install your recording system
inside a professional soundproof room. For FR1, this is what we did.
Segmenting the corpus
---------------------
Once The corpus has been recorded, all diphones must be spotted,
either manually with the help of signal visualization tools, or
automatically thanks to segmentation algorithms, the decisions of
which are checked and corrected interactively. A diphone database is
finally created, which centralizes the results, in the form of : the
name of diphones, the related waveforms, their duration, and internal
sub-splittings. As a matter of fact, the position of the border
between phones should be stored, so as to be able to modify the
duration of one half-phone without affecting the length of the other
one.
NOTA BENE : For optimal results with mbrola, it is best to keep
diphones in context. The MBROLA resynthesis operation, indeed,
includes some pitch analysis, which itself achieves more accurate
results when, say, 50 ms of speech are kept at the left and right of
each diphone.
Equalizing diphones
-------------------
Since diphones to be chained up have generally been extracted from
different words, that is in different phonetic contexts, they often
present amplitude and timbre mismatches. Even in the case of
stationary vocalic sounds, for instance, a rough sequencing of
diphones typically leads to audible discontinuities.
Amplitude mismatches can be coped to some extent as early as the
constitution of the diphone database, thanks to equalization. This
operation smoothly modifies the energy levels at the beginning and at
the end of segments, in such a way as to eliminate amplitude
mismatches (by setting the energy of all the phones of a given phoneme
to their average value).
In contrast, timbre conflicts are better tackled at run-time, by the
mbrola algorithm itself.
Notice, however, that equalization is only facultative, as the mbrola
resynthesis operation (the one we shall perform to adapt your database
to the mbrola format) also includes some equalization facilities.
IMPORTANT
---------
If you want to build a new diphone database, please contact the author
first. He will help you as much as he can, by providing phonetic
information if available for instance.
In all cases, make a first dummy trial : create a small corpus for a
few diphones, record them, segment them, equalize them if you can, and
send the result directly to the author. He will test your data, tell
you how good it is, and what should be done to make it better.
If you want to share an existing database
-----------------------------------------
Read the information above to see if your database has been designed
and recorded correctly. Contact the authors (see below) anyway.
--------------------------------------------------------------
8.0 Acknowledgments
--------------------------------------------------------------
I would like to thank Vincent Pagel (Mons / BE) for his intensive
programming, testing, and debugging of this program, and for all sorts
of fruitful discussions.
Sam Przyswa (Paris/FR), Fred Englert (Frankfurt/DE), Arnaud Gaudinat
(University of Geneva, CH), Cyrille Mastchenko (Paris/FR), Michael
C. Thornburgh (USA), Eric Keller (University of Lausanne,CH), Bruno
Langlois (Quebec/CA), Christophe M. Vallat (Domerat/FR), Cristiano
Verondini (Bologna/Italy), and Gerald Kerma (G'K2 Vaugrigneuse/FR) for
their help in the compilation of MBROLA.
Arnaud Gaudinat (Lausanne/CH), Vincent Pagel, Michael M. Cohen
(University of California - Santa Cruz), and Patrick Bouffer (France)
have arranged mirror sites.
Let's greet our pioneer database providers: Marian Boldea, Denis
Costa, Arthur Dirksen, Thierry Dutoit, Fred Englert, Vincent Pagel
and the team at University Autonoma of Barcelona and Alistair
Conkie! May they be thanked for their work.
Stephen Isard and Alistair Conkie have provided the Freespeech TTS!!
Alan Black and Paul Taylor have supported the Mbrola Project in their
great Festival multilingual TTS Project.
Fabrice Malfrere (Mons/BE) who has developped an efficient speech
alignment program for Windows (distributed on the mbrola site).
Alain Ruelle (Mons/BE) who has developped the MBRPlay dll and the
Mbroli interactive pho file player for Windows.
Last but not least, I am also greatly indebted to Francois Bataille
(Mons/BE) for having supported the creation of this internet project.
--------------------------------------------------------------
9.0 Contacting the author
--------------------------------------------------------------
Dr Thierry Dutoit
Faculte Polytechnique de Mons, TCTS Lab,
31, bvd Dolez, B-7000 Mons, Belgium.
tel : /32/65/374133
fax : /32/65/374129
e-mail: mbrola@tcts.fpms.ac.be, for general information,
questions on the installation of software and databases.