AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |
Back to Blog
Free human like text to speech8/4/2023 ![]() ![]() The ultra-high-definition column offers the same voice from A Bite of China, a documentary about Chinese cuisine. These customized voices include Ai Xia for customer service, Ai Yuan for literature, and Xiao Xian for live broadcast scenarios. However, the DAMO Academy Speech Laboratory provides about 100 kinds of customized voices on Alibaba Cloud made by speech synthesis with various scenes, voices, and accents but without restrictions. The voice of Dong Dong is an exclusive copyright held by the customization side. ![]() After the tuned TTS model is obtained, the voice of Dong Dong relies on Alibaba Cloud and can be used for all-weather voice broadcasting during the Winter Olympics.Īny refined voice that users want (like the customization process of Dong Dong) can be customized through these four steps. In the comprehensive evaluation of naturalness, Dong Dong is 98% close to the original target speaker. Finally, combining the three models, Dong Dong can switch her styles between a formal host and a talk show host. In addition, the emotional TTS enriches the speaker's emotion and speaking style. Moreover, the HIFI-TTS model can make the sound quality of the synthesized audio sound better. The DAMO Academy Speech Laboratory wanted to achieve the customized effect with human-like expression and developed the KAN-TTS model to make the rhythm more natural. Step 4: Model Training, Tuning, and Deployment The filtered audio can be labeled automatically by Automatic Speech Recognition (ASR), but it also needs manual inspection and adjustment. The amount of filtered audio can ensure phoneme coverage. Unlike traditionally manual quality inspection, The DAMO Academy Speech Laboratory performs automated style consistency screening based on audio features to ensure high consistency of the recorded audio. The audio quality determines the upper limit of the customized voice. ![]() In addition, data quality inspection is required after the recording. Recording instructions are required during the recording process. ![]() The audio recorded by the speaker with different states may vary significantly. Then, according to the hosting material of the Winter Olympics and the content of general scenarios, the text that the speaker of Dong Dong needs to record is designed through the calculation of phoneme coverage. The DAMO Academy Speech Laboratory invited the target speaker that meets the recording requirements of Dong Dong to the studio to ensure that the audio effect is stable and high quality. Then, a comprehensive evaluation is conducted based on the original audio and the preliminary synthesis effect to determine the target speaker. The audio of other specific texts is synthesized according to the one or two auditions mentioned previously. It uses personalized text-to-speech (PTTS) to evaluate the synthesis effect. The DAMO Academy Speech Laboratory is different from the traditionally direct selection of speakers. “The voice sounds like an 18 or 19-year-old girl, who speaks standard Mandarin with a sweet tone and is sporty, fashionable, and lively, such as Zhang Zifeng, a Chinese actress.” Then, multiple speakers are selected to audition one or two specific sentences. The DAMO Academy Speech Laboratory calls it a voice portrait. Step 1: Speaker Selectionįirst of all, it is necessary to be clear about what kind of voice you need. How can you customize a voice with human-like expressions? Let's start with Dong Dong, the virtual host during the 2022 Beijing Winter Olympics. Coupled with those emerging requirements, TTS is required to turn text into corresponding speech and make synthesized speech more expressive, with rhythm, sound quality, and emotion closer to a real person's voice. Nowadays, self-media publishers require video dubbing, and virtual characters are needed to combine with 2D images (and even 3D modeling) to communicate with people. What is speech synthesis? As the name implies, it means to transform the text into text-to-speech (TTS). Released by Alibaba Cloud Research Center ![]()
0 Comments
Read More
Leave a Reply. |