2 Related Work
2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models
2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning
3 Hierspeech++ and 3.1 Speech Representations
3.2 Hierarchical Speech Synthesizer
4 Speech Synthesis Tasks
4.1 Voice Conversion and 4.2 Text-to-Speech
5 Experiment and Result, and Dataset
5.2 Preprocessing and 5.3 Training
5.6 Zero-shot Voice Conversion
5.7 High-diversity but High-fidelity Speech Synthesis
5.9 Zero-shot Text-to-Speech with 1s Prompt
5.11 Additional Experiments with Other Baselines
7 Conclusion, Acknowledgement and References
In this study, we propose HierSpeech++, a human-level zeroshot speech synthesis model in terms of naturalness and voice similarity. We present a novel and efficient hierarchical speech synthesis framework that consists of a hierarchical speech synthesizer, text-to-vec (TTV), and speech superresolution (SpeechSR) as illustrated in Fig 1. This framework facilitates the training of each model with scale-up in that we can simply utilize large-scale low-resolution speech data for voice cloning. The details are described in the following subsections.
We utilize downsampled audio at 16 kHz for speech synthesizer because the human voice frequency band ranges under 4 kHz, and it is necessary to use at least twice the highest component of the voice frequency for the reconstruction of the voice signal. Furthermore, a low-resolution ASR dataset can be used to train the speech synthesizer. For better perceptual quality, we upsample the audio from 16 kHz to 48 kHz for a post-processing using SpeechSR. For acoustic and semantic representations, we utilize a low-resolution representation of 50 Hz for efficient training.
3.1.1 Acoustic Representation
For acoustic representation, conventional TTS systems utilize the Mel-spectrogram as an intermediate acoustic feature, which is transformed from the waveform using a short-time Fourier transform (STFT). Recently, a neural audio codec was investigated in the TTS model, wherein the Mel-spectrogram was replaced with a trained audio codec that can be decoded to waveform signals. However, acoustic features comprise various attributes, including semantic information, such as pronunciation and context, as well as voice information, such as timbre, intonation, and recording environment. In this regard, it is difficult to directly infer these rich representations from text by exacerbating the one-to-many mapping problem, and this may result in mispronunciation, over-smoothed speech, and a lack of similarity. To reduce the aforementioned problems, HierSpeech [48] adopted a selfsupervised speech representation as an additional semantic representation to bridge the gap between text and acoustic features as described in the following subsection.
3.1.2 Semantic Representation
We utilize Wav2Vec 2.0 [2] to extract a continuous semantic representation from a waveform without any label. In recent years, many researches have adopted a self-supervised speech representation for semantic representation given that the representation from the middle layer of these models contains linguistic information learned in a self-supervised manner. Meanwhile, phonetic posteriorgrams (PPG) that are extracted from ASR models [56] or phoneme information [49] could be a good alternative for a rich resourced monolingual model. However, this decreases the expressiveness and robustness of the zero-shot voice cloning and multilingual speech synthesis scenarios. Unlike HierSpeech [48] which used XLS-R [1], we utilize a massively multilingual speech (MMS) [65] which is a pre-trained Wav2Vec 2.0 with a massive scale. MMS was trained with 1,406 language speech data and it was observed that MMS performed better than XLS-R in many down-stream tasks. For zeroshot cross-lingual speech synthesis, we extract the semantic representation from the middle layer of the MMS.
3.1.3 Style Representation
We use two global style representations for the prosody and voice style. For prosody style representation, we extract a global prosody representation from the Mel-spectrogram of the reference prosody prompt, which is conditioned to the TTV model. For voice style representation, we also extract a global prosody representation from the Mel-spectrogram of the reference voice prompt, and this is utilized for conditioning the hierarchical speech synthesizer. As we disentangle the semantic and acoustic modeling, we can transfer the prosody and voice styles in TTV and the hierarchical speech synthesizer separately. Although we extract both representations from the same Mel-spectrogram, each representation is trained to reflect each characteristic because each target representation contains semantic and acoustic information respectively
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
Authors:
(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.