Type: Process Essays
Sample donated: Pauline Houston
Last updated: May 23, 2019
English Text to Speech Synthesizer using ConcatenationTechniqueSai Sawant1, M.
S. Deshpande2Department of Electronics and TelecommunicationEngineeringVishwakarma Institute of Technology, Pune, India [email protected]@vit.eduAbstract.
Text to speech synthesis (TTS) system is used to produceartificial human speech. Any languagetext can be converted into speech signal using TTS system. This paper presentsa method to design a text to speech synthesis system for English language withthe use of MATLAB. Simple matrix operations and container map data structureavailable in MATLAB are used to design this system.
Phoneme concatenation isperformed to get speech signal for input text. Initially some words arerecorded that contain all the phonemes of English language. Phonemes areextracted from these recorded words using PRAAT tool. The extracted phonemesare compared with input text phonemes and then concatenated sequentially toreconstruct the desired words. Implementation of this method is simple andrequires less memory usage.
Keywords: Text to speech, English, Phoneticconcatenation, MATLAB, PRAAT Tool1 IntroductionText to speech system transformslinguistic information present in the form of data or text into speech signal.TTS acts as an interface between digital content and a greater population, suchas people with literacy difficulties, learning disabilities, reduced vision andthose learning a language. It is helpful for those people who are looking forsimple ways to access digital content. It can be used forTelecommunication, Industrial and educational applications.Synthetic speech can be formed byconcatenation of recorded speech units that are stored in a database. Systemsusing concatenation technique for synthesis differ in the size of stored speechunits. Phones, diaphones, syllables etc.
can be used as speech units. A systemthat uses phones or diaphones provides the largest output range. For somedomains the usage of entire words or sentences allows high quality speechsignal.
A synthesizer can also incorporate a model of the vocal tract and otherhuman voice characteristics to form a completely synthetic voice output.The overall work is summarized asfollows: section 2 gives the brief description of concatenative synthesis andits subtypes. Section 3 provides the flow diagram of implemented TTS. Section 4and section 5 describe the implementation steps and experimental resultsrespectively.
Section 6 concludes the discussion by summarizing the findingsand explaining the future direction of the work. 2 Concatenative SyntheisConcatenative synthesis is theconcatenation of the segments of recorded speech. This synthesis technique issimple to implement as it doesn’t involve any mathematical model. Speech isproduced using natural, human speech. Concatenation of prerecorded speechutterances produces understandable and natural sounding synthesis speech.
Concatenation can be done using different size of the stored speech units.There are 4 subtypes of this synthesis method, depending upon the speech unitsize and use 4: 1. Unitselection synthesis 2. Domain specific synthesis 3. Diphone synthesis 4. Phonemebased synthesis The most important aspect of concatenativesynthesis is to select correct unit length.
With selection of longer speechunit high naturalness, less concatenation points are achievable, but the amountof required units and memory is increased. For shorter units less memory isneeded, but the sample collection and labeling procedures become difficult andcomplex 10. The present system is implemented using phonemes as speech units. 2.
1 Phoneme based SynthesisIn this synthesis technique,sequential combination of phonemes is used to synthesis desired continuousspeech signal. For extraction of phonemes, different words need to be recorded thatcontain all possible phonemes of desired TTS system language. From these recordedword utterances, phonemes of specified duration are extracted.
It createsdatabase of extracted phoneme sounds. Whenever the word is to be synthesized,corresponding phonemes are fetched from the database and concatenated to obtainrequired word sound. Following figure shows how phoneme based synthesis isperformed.Fig. 1. Phoneme based Speech Synthesis 3 Methodology Fig. 2. Flow Chart showing the Methodology4 Implementationof Text to Speech System4.
1 Recording ofWordsDifferentEnglish words are recorded by a single speaker using Voice Recorder applicationfor android phones. Words selection is done in such a way that they covered allthe phonemes present in English language. 4.2 Extraction ofPhonemesThe 44 phonemes ofEnglish language are considered as speech units for concatenation.
These phonemesare selected from the source, Orchestrating Success in Reading by Dawn Reithaug(2002) 1. The sounds pertaining to these 44 phonemes form a database forcreating any English word in a standard lexicon. So these 44 phoneme sounds areextracted from recorded words using existing PRAAT tool. This tool can be usedto segment recorded words into its constituents such as syllables, phonemesetc.
The TextGrid editor of PRAAT tool is used for segmenting recorded soundsand labeling the segments 3. Hence, with the help of this tool, words are segmentedand annotated to obtain phonemes as shown in the following examples:Table1. English Phonemes with Examples Phonemes Example Words a Hat, Map, Cat ae Train, Eight, Day ee Key, Sweet oy Toy, Coin Fig. 3. Extraction of phoneme /k/ from wordCat using PRAAT4.3 Creation ofPhoneme DatabaseMATLAB has Containers package with a Map class. Mapobject which is an instance of the MATLAB containers.
Map class is used. AMapobject is a data structure that allows retrieving values using a correspondingkey. Keys can be real numbers or character vectors and provide more flexibilityfor data access than array indices and must be positive integers.
Values can bein the form of scalar or non-scalar arrays 2. Using this data structure,extracted phoneme sounds are taken as values and keys are nothing but theirlabels. So every unique label or annotation corresponds to particular phonemesound. This forms the key-value pair of phonemes and their respectiveannotations.4.4 Grapheme to Phoneme ConversionThis process isused to generate a pronunciation for a word using certain rules.
The job of agrapheme to phoneme algorithm is to convert a letter string like ‘Toy’ into aphone string like t oy. Position of a letter in the given word is consideredto design rules. The input sequence is processed sequentially i.
e., from leftto right. For each input word, a sequence of phoneme labels is selected. Everytime when the match is occurred between input letter (or group of letters) andphoneme labels then the phonemic representation is stored in another variable.
The decision for every letter is taken before proceeding to the next letter;this is a technique of local classification. It avoids the need to use a searchalgorithm that is generally required to find the globally optimal solution. Table 2. Phoneme and Grapheme Representation Phoneme Grapheme Example Words /b/ b, bb Bag, Rubber /sh/ sh, ss, ch Ship, Mission, Chef /e/ e, ea Bed, Head /ch/ ch, tch Chip, Match /ow/ ow, ou Cow, Out 4.5 ConcatenationAfter grapheme to phoneme conversion ofinput text, the phonemic representation is compared with keys (recorded phonemelabels) of map data structure. If this representation has given keys thenvalues (phoneme sounds) corresponding to respective phoneme labels are fetched.
Since, all these speech units (phonemes) are just column vectors, theirconstituent elements are placed one after another and stored in another vector.This is how concatenation is done. In this way, all the words from input textare played by selecting the phonemes and placing the phoneme vectors one afteranother. 5 Experimental ResultsInput text: CoinPhoneme sequence: /k/ /oy/ /n/For input text’Coin’, its grapheme sequence k oy n is used to obtain corresponding phonemesound files. These sound files are concatenated to obtain sound file for word’Coin’. Fig. 4.
Waveform of Phoneme /k/ Fig. 5. Waveform of Phoneme /oy/Fig. 6. Waveform of Phoneme /n/ Fig. 7. Waveform of word ‘Coin’ after ConcatenationFig.
8. Waveform of Coin utterance Both waveforms (Fig. 7 and Fig. 8)for originally uttered and concatenated word Coin are compared and somesimilarities are found. The concatenated sound is close to the original sound.The degree of similarity increases with the precision in extracting the 44phonemes.6 ConclusionInthis work, English text to speech synthesis system using phoneme basedconcatenative synthesis is developed.
The system is implemented by the use ofMATLAB map data structure and simple matrix operations. Hence, this method is simpleand efficient to implement unlike other methods that involve complex algorithmsand techniques. As English phonemes are used as speech units, less memory isrequired. In order to bring more naturalness in the speech output, textanalysis and prosody need to be improved. 7 References 1. Orchestrating Success in Reading by Dawn Reithaug(2002).
2. MathWorks- MATLAB and Simulink for Technical Computing www.mathworks.com3. PaulBoersma & DavidWeenink (2013),Praat: doing phonetics by computer Computer program, http://www.praat.org/4.
Dr. Shaila D. Apte, Speech and Audio Processing,Wiley-India, 2012.
5. Narendra, N.P., Rao, K.S., Ghosh, K. et al. Int JSpeech Technol (2011) 14: 167.
6. Panda, S.P.& Nayak, A.K.
Int J Speech Technol (2017) 20: 959. https://doi.org/10.1007/s10772-017-9463-8.7. Mrs.
S. D. Suryawanshi, Mrs. R. R. Itkarkar and Mr. D.
T.Mane, “High Quality Text to Speech Synthesizer using Phonetic Integration”,International Journal of Advanced Research in Electronics and Communication Engineering(IJARECE) Volume 3, Issue 2, February 2014.8.
Bisani, M., Ney, H.,”Joint-Sequence Models for Grapheme-to-Phoneme Conversion”, SpeechCommunication (2008), doi: 10.1016/j.specom.2008.
01.002.9. TapasKumar Patra, Biplab Patra and Puspanjali Mohapatra, “Text to Speech Conversionwith Phonematic Concatenation”, International Journal of ElectronicsCommunication and Computer Technology (IJECCT) Volume 2 Issue 5 (September2012).10. R.
Shantha selva kumari, R. Sangeetha, “Conversion of English text tospeech (TTS) using Indian speech signal”, IJSET, Vol. 4, issue No. 8, pp:447-450, Aug 2015.11. Mr. S.
D. Shirbahadurkar and Dr. D. S. Bormane, “Marathi Language SpeechSynthesizer Using Concatenative Synthesis Strategy (Spoken in Maharashtra,India)”, 2009 Second International Conference on Machine Vision.
12. Deepshikha Mahanta, Bidisha Sharma, Priyankoo Sarmah, S R MahadevaPrasanna, “Textto Speech Synthesis System in Indian English”, Region 10Conference (TENCON), 2016 IEEE.13. Hari Krishnan, Sree & Thomas, Samuel & Bommepally, Kartik , Karthik & Raghavan, Hemant & Murarka, Suket & Murthy,Hema & Group, Tenet, “Design and Development of a Text-To-Speech Synthesizerfor Indian Languages”.14. S.
D.Shirbahadurkarand D.S.Bormane, “Marathi Language Speech Synthesizer Using ConcatenativeSynthesis Strategy (Spoken in Maharashtra, India)”, Second InternationalConference on Machine Vision, pp. 181-185, (2009).
15. Vinodh.M.V., Ashwin Bellur, Badri Narayan K.
, Deepali M.Thakare, Anila Susan, Suthakar N.M.
, Hema A. Murthy, “Using Polysyllabic units forText to Speech Synthesis in Indian languages”, 2010 National Conference on Communications (NCC).16. AnushaJoshi, Deepa Chabbi , Suman M and Suprita Kulkarni, “Text To Speech System ForKannada Language”, 2015 International Conference on Communications and SignalProcessing (ICCSP).