key: cord-0330674-fy5wbx0w authors: Mussakhojayeva, Saida; Janaliyeva, Aigerim; Mirzakhmetov, Almas; Khassanov, Yerbolat; Varol, Huseyin Atakan title: KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset date: 2021-04-17 journal: nan DOI: 10.21437/interspeech.2021-2124 sha: f2e04ba8df53291c3a7e707d53f3a26b69661c10 doc_id: 330674 cord_uid: fy5wbx0w This paper introduces a high-quality open-source speech synthesis dataset for Kazakh, a low-resource language spoken by over 13 million people worldwide. The dataset consists of about 93 hours of transcribed audio recordings spoken by two professional speakers (female and male). It is the first publicly available large-scale dataset developed to promote Kazakh text-to-speech (TTS) applications in both academia and industry. In this paper, we share our experience by describing the dataset development procedures and faced challenges, and discuss important future directions. To demonstrate the reliability of our dataset, we built baseline end-to-end TTS models and evaluated them using the subjective mean opinion score (MOS) measure. Evaluation results show that the best TTS models trained on our dataset achieve MOS above 4 for both speakers, which makes them applicable for practical use. The dataset, training recipe, and pretrained TTS models are freely available. Text-to-speech (TTS) systems are essential in many applications, such as navigation, announcement, smart assistants, and other speech-enabled devices. For any language, it is also necessary to ensure accessibility for the visually-impaired, and availability of human-machine interaction without requiring visual and tactile interfaces [1] . To build a robust TTS system, a sufficiently large and high-quality speech dataset is required. In order to address this, we developed a large-scale open-source speech dataset for the Kazakh language. We named our dataset KazakhTTS, and it is primarily geared to build TTS systems. Kazakh is the official language of Kazakhstan, and it is spoken by over 13 million people worldwide 2 , including countries, such as China and Russia. It is an agglutinative language with vowel harmony belonging to the family of Turkic languages. Kazakh is a low-resource language and is considered to be endangered due to multiple factors, primarily the dominance of the Russian and English languages in education and administration [2] . Therefore, there is a growing awareness of the importance of increasing the number of Kazakh speakers reflected in many language rescue initiatives launched by the government. Currently, there is no Kazakh speech dataset of sufficient size and quality for building TTS systems, especially recently proposed end-to-end (E2E) neural architectures [3, 4, 5, 6, 7, 8, 9] . This work aims to fill this gap by introducing the KazakhTTS dataset. To the best of our knowledge, it is the first 1 https://github.com/IS2AI/Kazakh_TTS 2 https://www.ethnologue.com/language/kaz large-scale open-source dataset developed for building Kazakh TTS systems. Our dataset contains around 93 hours of highquality speech data read by two professional speakers (36 hours by female and 57 hours by male). The dataset was carefully annotated by native transcribers and covers most of the dailyuse Kazakh words. The KazakhTTS is freely available for both academic and commercial use under the Creative Commons Attribution 4.0 International License 3 . With the help of KazakhTTS, we plan to promote Kazakh language use in speech-based digital technologies and to advance research in Kazakh speech processing. We believe that our dataset will be a valuable resource for the TTS research community, and our experience will benefit other researchers planning to develop speech datasets for low-resource languages. Although the primary application domain of KazakhTTS is speech synthesis, it can also be used to aid other related applications, such as automatic speech recognition (ASR) and speechto-speech translation. To demonstrate the reliability of KazakhTTS, we built two baseline Kazakh E2E-TTS systems based on Tacotron 2 [7] and Transformer [9] architectures for each speaker. We evaluated these systems using the subjective mean opinion score (MOS) measure. The experiment results showed that the best TTS models built using our dataset achieve 4.5 and 4.1 in MOS for the female and male speakers, respectively, which assures the usability for practical applications. In addition, we performed a manual analysis of synthesized sentences to identify the most frequent error types. The dataset, reproducible recipe, and pretrained models are publicly available 1 . The rest of the paper is organized as follows. Section 2 briefly reviews related works for TTS dataset construction. Section 3 describes the KazakhTTS construction procedures and reports the dataset specifications. Section 4 explains the TTS experimental setup and discusses obtained results. Section 5 concludes the paper and highlights future research directions. The recent surge of speech-enabled applications, such as virtual assistants for smart devices, has attracted substantial attention from both academia and industry to the TTS research [3, 4, 5] . Consequently, many large-scale datasets have been collected [10, 11, 12, 13] , and challenges have been organized to systematically compare different TTS technologies [14] . However, these datasets and competitions are mostly restricted to resource-rich languages, such as English, Mandarin, and so on. To create a speech corpus suitable for developing TTS systems in low-resource settings, methods based on unsuper-vised [15] , semi-supervised [16] , and cross-lingual transfer learning [17] have been developed. In these approaches, raw audio files are automatically annotated by using other systems, such as ASR, or available resources from other languages are utilized, especially from phonologically close languages. Although TTS systems produced using these approaches have achieved good results, their quality is usually insufficient for practical applications. Furthermore, these approaches require some in-domain data or other systems, such as ASR, speech segmentation, and speaker diarisation, which might be unavailable for low-resource languages. An alternative approach is to record and manually annotate audio recordings. This is considered costly, since cumbersome manual work is required. Nevertheless, the produced dataset will be more reliable. Recently, several Kazakh speech corpora have been developed to accelerate speech processing research in this language. For example, the first open-source Kazakh speech corpus (KSC) containing over 332 hours of transcribed audio recordings was presented in [18] . However, the KSC was mainly constructed for ASR applications, and thus, crowdsourced from different speakers, with various background noises and speech disfluencies kept to make it similar to the realworld scenarios. As such, the quality of audio recordings in the KSC is insufficient for building robust TTS models. Additionally, in the KSC, the size of recordings contributed by a single speaker is small. Similarly, the other existing Kazakh speech datasets are either unsuitable or publicly unavailable [19, 20] . The KazakhTTS project was conducted with the approval of the Institutional Research Ethics Committee of Nazarbayev University. Each speaker participated voluntarily and was informed of the data collection and use protocols through a consent form. We started the dataset construction process with textual data collection. In particular, we manually extracted articles in chronological order from news websites to diversify the topic coverage (e.g., politics, business, sports, entertainment and so on) and eliminate defects peculiar to web crawlers. The extracted articles were manually filtered to exclude inappropriate content (e.g., sensitive political issues, user privacy, and violence) and stored in DOC format convenient for professional speakers (i.e., font size, line spacing, and typeface were adjustable to the preferences of the speakers). In total, we collected over 2,000 articles of different lengths. To narrate the collected articles, we first conducted an audition among multiple candidates from which one female and one male professional speaker were chosen. The speaker details, including age, working experience as a narrator, and recording device information, are provided in Table 1 . Due to the COVID-19 pandemic, we could not invite the speakers to our laboratory for data collection. Therefore, the speakers were allowed to record audio at their own studio at home or office. The speakers were instructed to read texts in a quiet indoor environment at their own natural pace and style. Additionally, they were asked to follow orthoepic rules, pause on commas, and use appropriate intonations for sentences ending with a question mark. The female speaker recorded audios at her office studio, the recordings were sampled at 44.1 kHz and stored using 16 bit/sample. The male speaker recorded audios at home studio, the recordings were sampled at 48 kHz and stored using 24 bit/sample. In total, female and male speakers read around 1,000 and 1,250 articles respectively, out of which around 300 articles are overlapping. We hired five native Kazakh transcribers to segment the recorded audio files into sentences and align them with text using the Praat toolkit [21] . The texts were represented using the Cyrillic alphabet consisting of 42 letters 4 . In addition to letters, the transcripts also contain punctuation marks, such as period ('.'), comma (','), hyphen ('-'), question mark ('?'), exclamation mark ('!'), and so on. The transcribers were instructed to remove segments with incorrect pronunciation and background noise, trim long pauses at the beginning and end of the segments, and convert numbers and special characters (e.g., '%', '$', '+', and so on) into the written forms. To ensure uniform quality of work among the transcribers, we assigned a linguist to randomly check the tasks completed by transcribers and organize regular "go through errors" sessions. To guarantee high quality, we inspected the segmented sentences using our ASR system trained on the KSC dataset [18] . Specifically, we used the ASR system to transcribe the segments. The recognized transcripts were then compared with the corresponding manually annotated transcripts. The segments with a high character error rate (CER) were regarded as incorrectly transcribed, and thus, rechecked by the linguist. Lastly, we filtered out all segments containing international words written using a non-Cyrillic alphabet, because speakers usually don't know how to correctly pronounce such words. The overall statistics of the constructed dataset are provided in Table 2 . In total, the dataset contains around 93 hours of audio consisting of over 42,000 segments. The distribution of segment lengths and durations are shown in Figure 1 . The whole dataset creation process took around five months, and the uncompressed dataset size is around 15 GB. The KazakhTTS dataset is organized as follows. The resources of two professional speakers are stored in two separate folders. Each folder contains a single metadata file and two subfolders containing audio recordings and their transcripts. The audio and corresponding transcript filenames are the same, except that the audio recordings are stored as WAV files, whereas the transcripts are stored as TXT files using the UTF-8 encoding. The naming convention for both WAV and TXT files are as follows source articleID segmentID. The audio recordings of both speakers have been downsampled to 22.05 kHz, with samples stored as 16-bit signed integers. The metadata contains speaker information, such as age, gender, working experience, and recording device. To demonstrate the utility of the constructed dataset, we built the first Kazakh E2E-TTS systems and evaluated them using the subjective MOS measure. We used ESPnet-TTS toolkit [22] to build the E2E-TTS models based on Tacotron 2 [7] and Transformer [9] architectures. Specifically, we followed the LJ Speech [10] recipe and used the latest ESPnet-TTS developments to configure our model building recipe. The input for each model is a sequence of characters 5 consisting of 42 letters and 5 symbols ('.', ',', '-', '?', '!'), and the output is a sequence of acoustic features (80 dimensional log Mel-filter bank features). To transform these acoustic features into the time-domain waveform samples, we tried different approaches, such as Griffin-Lim algorithm [23] , WaveNet [24] , and WaveGAN [25] vocoders. We found Wave-GAN to perform best in our case, and it was used in our final E2E-TTS systems. We did not apply any additional speech preprocessing, such as filtering, normalization, and so on. In the Tacotron 2 system, the encoder module was modeled as a single bi-directional LSTM layer with 512 units (256 units in each direction), and the decoder module was modelled as a 5 Due to the strong correspondence between word spelling and pronunciation in Kazakh, we did not convert graphemes to phonemes. stack of two unidirectional LSTM layers with 1,024 units. The parameters were optimized using the Adam algorithm [26] with an initial learning rate of 10 −3 for 200 epochs. To regularize parameters, we set the dropout rate to 0.5. The Transformer system was modeled using six encoder and six decoder blocks. The number of heads in the selfattention layer was set to 8 with 512-dimension hidden states, and the feed-forward network dimensions were set to 1,024. The model parameters were optimized using the Adam algorithm with an initial learning rate of 1.0 for 200 epochs. The dropout rate was set to 0.1. For each speaker, we trained separate E2E-TTS models (i.e., single speaker model). All models were trained using the Tesla V100 GPUs running on an NVIDIA DGX-2 server. More details on model specifications and training procedures are provided in our GitHub repository 1 . To assess the quality of the synthesised recordings, we conducted subjective evaluation using the MOS measure. We performed a separate evaluation session for each speaker. In each session, the following three systems were compared: 1) Ground truth (i.e., natural speech), 2) Tacotron 2, and 3) Transformer. As an evaluation set, we randomly selected 50 sentences of various lengths from each speaker. These sentences were not used to train the models. The selected sentences were manually checked to ensure that each of them is a single complete sentence, and the speaker read them well (i.e., without disfluencies, mispronunciations, or background noise). The number of listeners was 50 in both evaluation sessions 6 . The listeners were instructed to assess the overall quality, use headphones, and sit in a quiet environment 7 . The evaluation sessions were conducted through the instant messaging platform Telegram [27] , since it is difficult to find native Kazakh listeners on other well-known platforms, such as Amazon Mechanical Turk [28] . We developed two separate Telegram bots for each speaker. The bots first presented a welcome message with the instruction and then started the evaluation process. During the evaluation, the bots sent a sentence recording with the transcript to a listener and received the corresponding evaluation score (see Figure 2 ). The recordings were rated using a 5-point Likert scale: 5 for excellent, 4 for good, 3 for fair, 2 for poor, and 1 for bad. Note that in Telegram, to send audio recordings, we had to convert them into MP3 format. We attracted listeners to participate in the evaluation sessions by advertising the project in social media, news, and open messaging communities on WhatsApp and Telegram. The listeners were allowed to listen to recordings several times, but they were not allowed to alter previous ratings once submitted. Additionally, the Telegram bots were keeping track of the listeners' status and ID. As a result, the listeners could take a break to continue at a later time, and were prevented from participating in the evaluation session more than once. For all the listeners, the evaluation recordings were presented in the same order and one at a time. However, at each time step, the bots randomly decided from which system to pick a recording. As a result, each listener heard each recording once only, and all the systems were exposed to all the listeners. Each recording was rated at least 8 and 10 times for the female and male speakers, respectively. At the end of evaluation, the bots thanked the listeners and invited them to fill in an optional questionnaire asking about their age, region (where a listener grew up and learned the Kazakh language), and gender information. The questionnaire results showed that the listeners varied in gender and region, but not in age (most of them were under 20). The subjective evaluation results are given in Table 3 . According to the results, the best performance is achieved by the Ground truth, as expected, followed by the Tacotron 2, and then the Transformer system for both speakers. Importantly, the best performing models for both speakers achieved above 4 in the MOS measure and are not too far from the Ground truth, i.e., 4% and 5% relative reduction for the female and male speakers, respectively. These results demonstrate the utility of our KazakhTTS dataset for TTS applications. We did not carry out an intensive hyper-parameter tuning for our E2E-TTS models, since it is outside the scope of this work, therefore, we speculate that the quality of models can be further improved. For example, our model training recipes are based on the LJ Speech, which is tuned for a dataset containing around 24 hours of audio, whereas our speakers' audio sizes are larger. We leave the exploration of the optimal hyper-parameter settings and detailed comparison of different TTS architectures for the Kazakh language as a future work. To identify error types made by E2E-TTS system, we manually analysed a 50-sentence evaluation set for both speakers. This analysis was conducted only for the Tacotron 2 system which achieved better MOS score than the Transformer. Specifically, we counted sentences with various error types. The analysis results are provided in Table 4 , showing that the most frequent error types for both speakers are mispronunciation and incomplete words. The mispronunciation errors are mostly due to incorrect stress, and the incomplete word errors mostly occur at the last word of a sentence, where the last letters of the word are trimmed. Interestingly, the total number of errors in the male speaker's recordings is considerably higher than the total number of errors in the female speaker's recordings. This might be one of the reasons for the lower MOS score achieved by the male speaker. This analysis indicates that there is still room for improvement and future work should focus on eliminating these errors. This paper introduced the first open-source Kazakh speech dataset for TTS applications. The KazakhTTS dataset contains over 93 hours of speech data (36 hours by female and 57 hours by male) consisting of around 42,000 recordings. We released our dataset under the Creative Commons Attribution 4.0 International License, which permits both academic and commercial use. We shared our experience by describing the dataset construction and TTS evaluation procedures, which might benefit other researchers planning to collect speech data for other lowresource languages. Furthermore, the presented dataset can also aid to study and build (e.g., pretrain) TTS systems for other Turkic languages. To demonstrate the use of our dataset, we built E2E-TTS models based on Tacotron 2 and Transformer architectures. The subjective evaluation results suggest that the E2E-TTS models trained on KazakhTTS are suitable for practical use. We also shared the pretrained TTS models and the ESPnet training recipes for both speakers 1 . In future work, we plan to further extend our dataset by collecting more data from different domains, such as Wikipedia articles and books, and introduce new speakers. We also plan to explore the optimal hyper-parameter settings for Kazakh E2E-TTS models, compare different TTS architectures, and conduct additional analysis. Text-to-speech synthesis Tacotron: Towards end-to-end speech synthesis Char2Wav: End-to-end speech synthesis Deep Voice: Real-time neural text-to-speech Towards end-to-end prosody transfer for expressive speech synthesis with tacotron Natural TTS synthesis by conditioning Wavenet on MEL spectrogram predictions FastSpeech: Fast, robust and controllable text to speech Neural speech synthesis with transformer network The LJ speech dataset LibriTTS: A corpus derived from LibriSpeech for text-to-speech AISHELL-3: A multi-speaker Mandarin TTS corpus and the baselines Superseded-CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit The Blizzard Challenge -2005: Evaluating corpus-based speech synthesis on common datasets Almost unsupervised text to speech and automatic speech recognition Semi-supervised training for improving data efficiency in endto-end speech synthesis End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning A crowdsourced opensource Kazakh speech corpus and initial speech recognition baseline Assembling the Kazakh language corpus End-to-end speech recognition in agglutinative languages Praat, a system for doing phonetics by computer ESPnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit A fast Griffin-Lim algorithm WaveNet: A generative model for raw audio Parallel Wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram Adam: A method for stochastic optimization Amazon Mechanical Turk (MTurk) We would like to thank our senior moderator Aigerim Boranbayeva for helping to train and monitor the transcribers. We also thank our technical editor Rustem Yeshpanov, PR manager Kuralay Baimenova, project cordinator Yerbol Absalyamov, and administration manager Gibrat Kurmanov for helping with other administrative and technical tasks.