English Text to Speech Synthesizer using Concatenation
Sai Sawant1, M. S. Deshpande2
Department of Electronics and Telecommunication
Vishwakarma Institute of Technology, Pune, India
Abstract. Text to speech synthesis (TTS) system is used to produce
artificial human speech. Any language
text can be converted into speech signal using TTS system. This paper presents
a method to design a text to speech synthesis system for English language with
the use of MATLAB. Simple matrix operations and container map data structure
available in MATLAB are used to design this system. Phoneme concatenation is
performed to get speech signal for input text. Initially some words are
recorded that contain all the phonemes of English language. Phonemes are
extracted from these recorded words using PRAAT tool. The extracted phonemes
are compared with input text phonemes and then concatenated sequentially to
reconstruct the desired words. Implementation of this method is simple and
requires less memory usage.
Keywords: Text to speech, English, Phonetic
concatenation, MATLAB, PRAAT Tool
Text to speech system transforms
linguistic information present in the form of data or text into speech signal.
TTS acts as an interface between digital content and a greater population, such
as people with literacy difficulties, learning disabilities, reduced vision and
those learning a language. It is helpful for those people who are looking for
simple ways to access digital content. It can be used for
Telecommunication, Industrial and educational applications.
Synthetic speech can be formed by
concatenation of recorded speech units that are stored in a database. Systems
using concatenation technique for synthesis differ in the size of stored speech
units. Phones, diaphones, syllables etc. can be used as speech units. A system
that uses phones or diaphones provides the largest output range. For some
domains the usage of entire words or sentences allows high quality speech
signal. A synthesizer can also incorporate a model of the vocal tract and other
human voice characteristics to form a completely synthetic voice output.
The overall work is summarized as
follows: section 2 gives the brief description of concatenative synthesis and
its subtypes. Section 3 provides the flow diagram of implemented TTS. Section 4
and section 5 describe the implementation steps and experimental results
respectively. Section 6 concludes the discussion by summarizing the findings
and explaining the future direction of the work.
Concatenative synthesis is the
concatenation of the segments of recorded speech. This synthesis technique is
simple to implement as it doesn’t involve any mathematical model. Speech is
produced using natural, human speech. Concatenation of prerecorded speech
utterances produces understandable and natural sounding synthesis speech.
Concatenation can be done using different size of the stored speech units.
There are 4 subtypes of this synthesis method, depending upon the speech unit
size and use 4:
Domain specific synthesis
The most important aspect of concatenative
synthesis is to select correct unit length. With selection of longer speech
unit high naturalness, less concatenation points are achievable, but the amount
of required units and memory is increased. For shorter units less memory is
needed, but the sample collection and labeling procedures become difficult and
complex 10. The present system is implemented using phonemes as speech units.
Phoneme based Synthesis
In this synthesis technique,
sequential combination of phonemes is used to synthesis desired continuous
speech signal. For extraction of phonemes, different words need to be recorded that
contain all possible phonemes of desired TTS system language. From these recorded
word utterances, phonemes of specified duration are extracted. It creates
database of extracted phoneme sounds. Whenever the word is to be synthesized,
corresponding phonemes are fetched from the database and concatenated to obtain
required word sound. Following figure shows how phoneme based synthesis is
Fig. 1. Phoneme based Speech Synthesis
Fig. 2. Flow Chart showing the Methodology
of Text to Speech System
English words are recorded by a single speaker using Voice Recorder application
for android phones. Words selection is done in such a way that they covered all
the phonemes present in English language.
The 44 phonemes of
English language are considered as speech units for concatenation. These phonemes
are selected from the source, Orchestrating Success in Reading by Dawn Reithaug
(2002) 1. The sounds pertaining to these 44 phonemes form a database for
creating any English word in a standard lexicon. So these 44 phoneme sounds are
extracted from recorded words using existing PRAAT tool. This tool can be used
to segment recorded words into its constituents such as syllables, phonemes
etc. The TextGrid editor of PRAAT tool is used for segmenting recorded sounds
and labeling the segments 3. Hence, with the help of this tool, words are segmented
and annotated to obtain phonemes as shown in the following examples:
1. English Phonemes with Examples
Hat, Map, Cat
Train, Eight, Day
Fig. 3. Extraction of phoneme /k/ from word
Cat using PRAAT
MATLAB has Containers package with a Map class. Map
object which is an instance of the MATLAB containers.Map class is used. A
object is a data structure that allows retrieving values using a corresponding
key. Keys can be real numbers or character vectors and provide more flexibility
for data access than array indices and must be positive integers. Values can be
in the form of scalar or non-scalar arrays 2. Using this data structure,
extracted phoneme sounds are taken as values and keys are nothing but their
labels. So every unique label or annotation corresponds to particular phoneme
sound. This forms the key-value pair of phonemes and their respective
Grapheme to Phoneme Conversion
This process is
used to generate a pronunciation for a word using certain rules. The job of a
grapheme to phoneme algorithm is to convert a letter string like ‘Toy’ into a
phone string like t oy. Position of a letter in the given word is considered
to design rules. The input sequence is processed sequentially i.e., from left
to right. For each input word, a sequence of phoneme labels is selected. Every
time when the match is occurred between input letter (or group of letters) and
phoneme labels then the phonemic representation is stored in another variable.
The decision for every letter is taken before proceeding to the next letter;
this is a technique of local classification. It avoids the need to use a search
algorithm that is generally required to find the globally optimal solution.
Table 2. Phoneme and Grapheme Representation
sh, ss, ch
Ship, Mission, Chef
After grapheme to phoneme conversion of
input text, the phonemic representation is compared with keys (recorded phoneme
labels) of map data structure. If this representation has given keys then
values (phoneme sounds) corresponding to respective phoneme labels are fetched.
Since, all these speech units (phonemes) are just column vectors, their
constituent elements are placed one after another and stored in another vector.
This is how concatenation is done. In this way, all the words from input text
are played by selecting the phonemes and placing the phoneme vectors one after
Input text: Coin
Phoneme sequence: /k/ /oy/ /n/
For input text
‘Coin’, its grapheme sequence k oy n is used to obtain corresponding phoneme
sound files. These sound files are concatenated to obtain sound file for word
Fig. 4. Waveform of Phoneme /k/
Fig. 5. Waveform of Phoneme /oy/
Fig. 6. Waveform of Phoneme /n/
Fig. 7. Waveform of word ‘Coin’ after Concatenation
Fig. 8. Waveform of Coin utterance
Both waveforms (Fig. 7 and Fig. 8)
for originally uttered and concatenated word Coin are compared and some
similarities are found. The concatenated sound is close to the original sound.
The degree of similarity increases with the precision in extracting the 44
this work, English text to speech synthesis system using phoneme based
concatenative synthesis is developed. The system is implemented by the use of
MATLAB map data structure and simple matrix operations. Hence, this method is simple
and efficient to implement unlike other methods that involve complex algorithms
and techniques. As English phonemes are used as speech units, less memory is
required. In order to bring more naturalness in the speech output, text
analysis and prosody need to be improved.
Orchestrating Success in Reading by Dawn Reithaug
– MATLAB and Simulink for Technical Computing www.mathworks.com
Boersma & David
Praat: doing phonetics by computer Computer program, http://www.praat.org/
4. Dr. Shaila D. Apte, Speech and Audio Processing,
5. Narendra, N.P., Rao, K.S., Ghosh, K. et al. Int J
Speech Technol (2011) 14: 167. https://doi.org/10.1007/s10772-011-9094-4.
6. Panda, S.P.
& Nayak, A.K. Int J Speech Technol (2017) 20: 959. https://doi.org/10.1007/s10772-017-9463-8.
Mrs. S. D. Suryawanshi, Mrs. R. R. Itkarkar and Mr. D. T.
Mane, “High Quality Text to Speech Synthesizer using Phonetic Integration”,
International Journal of Advanced Research in Electronics and Communication Engineering
(IJARECE) Volume 3, Issue 2, February 2014.
Bisani, M., Ney, H.,
“Joint-Sequence Models for Grapheme-to-Phoneme Conversion”, Speech
Communication (2008), doi: 10.1016/j.specom.2008.01.002.
Kumar Patra, Biplab Patra and Puspanjali Mohapatra, “Text to Speech Conversion
with Phonematic Concatenation”, International Journal of Electronics
Communication and Computer Technology (IJECCT) Volume 2 Issue 5 (September
10. R. Shantha selva kumari, R. Sangeetha, “Conversion of English text to
speech (TTS) using Indian speech signal”, IJSET, Vol. 4, issue No. 8, pp:
447-450, Aug 2015.
11. Mr. S. D. Shirbahadurkar and Dr. D. S. Bormane, “Marathi Language Speech
Synthesizer Using Concatenative Synthesis Strategy (Spoken in Maharashtra,
India)”, 2009 Second International Conference on Machine Vision.
12. Deepshikha Mahanta, Bidisha Sharma, Priyankoo Sarmah, S R Mahadeva
to Speech Synthesis System in Indian English”, Region 10
Conference (TENCON), 2016 IEEE.
13. Hari Krishnan, Sree & Thomas, Samuel & Bommepally, Kartik &
Jayanthi, Karthik & Raghavan, Hemant & Murarka, Suket & Murthy,
Hema & Group, Tenet, “Design and Development of a Text-To-Speech Synthesizer
for Indian Languages”.
and D.S.Bormane, “Marathi Language Speech Synthesizer Using Concatenative
Synthesis Strategy (Spoken in Maharashtra, India)”, Second International
Conference on Machine Vision, pp. 181-185, (2009).
Vinodh.M.V., Ashwin Bellur, Badri Narayan K., Deepali M.
Thakare, Anila Susan, Suthakar N.M., Hema A. Murthy, “Using Polysyllabic units for
Text to Speech Synthesis in Indian languages”, 2010 National Conference on Communications (NCC).
Joshi, Deepa Chabbi , Suman M and Suprita Kulkarni, “Text To Speech System For
Kannada Language”, 2015 International Conference on Communications and Signal