NAME
klatt - Klatt
cascade-parallel formant synthesizer (v3.05)
SYNTAX
Klatt [-i input filename] [-o output
filename] [-q] [-t output waveform type] [-c ]
[-n number of formants in the cascade branch] [-s sample rate]
[-f number of milliseconds per frame] [-v voicing source] [-V sampled
voicing filename] [-r raw samples output type] [-F percent f0
flutter]
DESCRIPTION
The klatt software is an implementation of a speech synthesizer
first described by Dennis Klatt in 1980 [1]. The
object of the program is to convert a set of parameter values into a waveform
representing speech sound. The following pages describe the command line
options available to the user, and the format of the input and output data
files. Details of the history of this code and the modifications that have been
made can be found in the README file in the distribution.
OPTIONS
-h Displays a help message
-i filename User specified
input filename. The file specified will contain ASCII data in a format
described in a later section. If no filename is specified then input is assumed
to be from stdin.
-o filename User specified output
filename. The output speech waveform will be written to this file. The output
waveform may be written as signed 16 bit integers in raw binary samples, or as
a file of ASCII integers. This second format is suitable for plotting the
waveform using gnuplot etc. If no filename is
specified, then output is assumed to be to stdout.
-q Run
in quiet mode, no output messages are produced. The default is to
run in verbose mode, where details of the current frame of parameters being
processed will be displayed on the screen.
-t Ouput Waveform Type This option allows the user to select the type of waveform that is passed to the output file. The default for this option is the complete speech waveform. The list below indicates the available options. Note, a value must be set at compilation time to enable the code which generates the various output waveforms. This code may be disabled to improve the speed and efficiency of the program overall. The options available are listed below.
-c This flag selects full
cascade-parallel operation. The default setting of the synthesizer is parallel
branch only.
-n Number of Formants. This option is
used to set the number of formants in the cascade branch of the synthesizer. The
default number is 5.
-s Sample Rate Sets the sample rate
used for the output waveform. The default is 10000 (10kHz).
-f Number of Milliseconds per FrameThis value specifies
the number of milliseconds of output waveform each frame of synthesizer
parameters represents. The default is 10.
-v Voicing Source Three types of voicing source
are available, these are listed below.
-V Sampled Natural Excitation
FilenameThe sampled
excitation waveform used by the software can be loaded in from a file. The file
is expected in the following format, in ASCII characters. First, an integer
representing the total number of samples, secondly, a floating point value
indicating the amount these values are to be scaled by when used. Finally, the
required number of integer samples.
-r Raw Samples Output Type Selecting this
flag will produce the output waveform as a raw binary file, rather than as
ASCII integers. Two types are available, type 1 gives a high byte - low byte
arrangement, and type 2 gives a low byte-high byte arrangement.
-F Percent f0 Flutter The percentage of f0
flutter to be applied to the synthesized speech as described in [2]. f0 flutter is an attempt to cure synthetic speech of lack of
naturalness introduced by using constant values of f0. A small amount of
quasi-random f0 flutter is applied when this value is greater than 0.
INPUT FILE FORMAT
There
are two differences with the v3.03 Unix version:
The input file consists of a series of parameter frames. Each frame of
parameters (usually) represents 10ms of audio output, although this figure can
be adjusted down to 1ms per frame. The parameters in each frame are described
below. To avoid confusion, note that the cascade and parallel branch of the
synthesizer duplicate some of the control parameters.
time This the time at
which the following parameters have been estimated in the original speech file.
This allows the synchronisation of synthesis parameters with the original
signal within the context of copy synthesis. This parameter is not used during
synthesis.
f0 This is the
fundamental frequency (pitch) of the utterance in this case it is specified in
steps of 0.1 Hz, hence 100Hz will be represented by a value of 1000.
av Amplitude of
voicing for the cascade branch of the synthesizer in dB. Range 0-70, value
usually about 60 for a vowel sound.
f1 First formant frequency.
Range usually 200-1300 Hz.
b1 Cascade branch,
bandwidth of first formant. Range usually 40-1000 Hz.
f2 Second formant
frequency. Range usually 550-3000 Hz.
b2 Cascade branch,
bandwidth of second formant. Range usually 40-1000 Hz.
f3 Third formant
frequency. Range usually 1200-4999 Hz.
b3 Cascade branch
bandwidth of third formant. Range usually 40-1000 Hz.
f4 Fourth formant
frequency. Range usually 1200-4999 Hz.
b4 Cascade branch,
bandwidth of fourth formant. Range usually 40-1000 Hz.
f5 Fifth formant
frequency. Range usually 1200-4999 Hz.
b5 Cascade branch,
bandwidth of fifth formant. Range usually 40-1000 Hz.
f6 Sixth formant
frequency. Range usually 1200-4999 Hz.
b6 Cascade branch,
bandwidth of sixth formant. Range usually 40-2000 Hz.
fnz Frequency of the
nasal zero. Range usually 248-528 Hz (cascade branch only).
bnz Bandwidth of the
nasal zero. Range usually 40-1000 Hz (cascade branch only).
fnp Frequency of the
nasal pole. Range usually 248-528 Hz .
bnp Bandwidth of the
nasal pole in 40-1000 Hz
asp Amplitude of
aspiration 0-70 dB.
kopen Open quotient of
voicing waveform, range 0-60, usually 30. Will influence the gravelly or smooth quality of the voice. Only works
with impulse and natural simulations. For the sampled glottal excitation
waveform the open quotient is fixed.
aturb Amplitude of
turbulence 0-80 dB. A value of 40 is useful. Can be used to
simulate "breathy" voice quality.
tilt Spectral tilt in
dB, range 0-24. Tilts down the output spectrum. The
value refers to dB down at 3Khz. Increasing the value emphasizes the low
frequency content of the speech and attenuates the high frequency content.
af Amplitude of
frication in dB, range 0-80 (parallel branch).
skew Spectral Skew - skewness of alternate periods, range 0-40
a1 Amplitude of
first formant in the parallel branch, in 0-80 dB.
b1p Bandwidth of the
first formant in the parallel branch, in Hz.
a2 Amplitude of
parallel branch second formant.
b2p Bandwidth of
parallel branch second formant.
a3 Amplitude of
parallel branch third formant.
b3p Bandwidth of
parallel branch third formant.
a4 Amplitude of
parallel branch fourth formant.
b4p Bandwidth of
parallel branch fourth formant.
a5 Amplitude of
parallel branch fifth formant.
b5p Bandwidth of
parallel branch fifth formant.
a6 Amplitude of
parallel branch sixth formant.
b6p Bandwidth of
parallel branch sixth formant.
anp Amplitude of the
parallel branch nasal formant.
ab Amplitude of
bypass frication in dB. 0-80.
avp Amplitude of
voicing for the parallel branch, 0-70 dB.
gain Overall gain in
dB range 0-80.
Ra Ratio of ta to tc-te (characterizing the
return to zero) of the LF source in range 0-100. Typical values are between
10 and 50. The actual value in the synthesizer is divided by 1000.
Rk
Ratio of te-tp to tp
(characterizing the end of the open phase) of the LF source in range 0-100. Typical values
are between 10 and 70. The actual value in the synthesizer is divided by 100.
Rg
Ratio of half a fundamental period to tp
(characterizing the length of the open phase) of the LF source in range 0-100. Typical values
are between 10 and 70. The actual value in the synthesizer is divided by 50.
F1Nhz First extra nasal formant
frequency.
B1Nhz Cascade branch,
bandwidth of the first extra nasal formant.
A1N Amplitude of first
extra nasal formant in the parallel branch, in 0-80 dB.
B1Nphz
Bandwidth
of parallel branch first extra nasal formant.
F2Nhz Second extra nasal formant
frequency.
B2Nhz Cascade branch,
bandwidth of the second extra nasal formant.
A2N Amplitude of second
extra nasal formant in the parallel branch, in 0-80 dB.
B2Nphz
Bandwidth
of parallel branch second extra nasal formant.
F3Nhz Third extra nasal formant
frequency.
B3Nhz Cascade branch,
bandwidth of the third extra nasal formant.
A3N Amplitude of third
extra nasal formant in the parallel branch, in 0-80 dB.
B3Nphz
Bandwidth
of parallel branch third extra nasal formant.
EXAMPLES
Included
with the distribution are two example parameter files. They may be synthesized
using the command line:
klatt -i
example.par -o example.raw
-f 5 -v 2 -s 16000 -r 1
This
produces raw 16bit signed integers. A package like sox
can be used to convert to your favourite audio format. For example, conversion
to the ulaw encoded format used by Sun Sparc SLC's is given below.
sox -r 16000 -s -w example.raw -r 8000 -b -U example.au
Beware of
the byte ordering of your machine - if the above procedure produces distored rubbish, try using -r 2
instead of -r 1. This just reverses the byte ordering in the raw binary output
file. It is also worth noting that the above example reduces the quality of the
output, as the sampling rate is being halved and the number of bits per sample
is being halved. Ideally output should be at 16kHz
with 16 bits per sample.
BUGS
I have
not had a chance to test loading a sampled excitation waveform from a file. Please
let me know if there are problems.
My research does not (yet) require me to use the synthesizer in
its primary mode, which is combined cascade-parallel operation. I have
primarily used the synthesizer in parallel only mode. I would appreciate any
comments regarding use of the cascade branch.
Finally,
there is no protection against rapid parameter changes. Large jumps in many of
the parameters will cause clicks and pops in the output. This may be remedied
in future with some form of parameter clamping that becomes effective when
parameters exceed a set rate of change.
All bug
reports and queries to Jon Iles,
(j.p.iles@cs.bham.ac.uk)
University
of Birmingham, School of Computer Science, Edgbaston,
Birmingham.
B29 7PY. UK.
AUTHORS
Jon Iles (j.p.iles@cs.bham.ac.uk)
Nick Ing-Simmons (nicki@lobby.ti.com)
ACKNOWLEDGEMENTS
Many thanks to Tony Robinson for his help and support. Thanks also Alan
Black, Paul Callaghan, Johannes Kiehl, ArthurDirksen and Gary Murphy for prompt bug spotting and
feedback, to Mark Thornton for help with C7 and André Doherty for porting on
Borland C++.
REFERENCES