paint-brush
HierSpeech++: A Fast and Strong Zero-Shot Speech Synthesizer for Text-to-Speechby@fewshot
New Story

HierSpeech++: A Fast and Strong Zero-Shot Speech Synthesizer for Text-to-Speech

tldt arrow

Too Long; Didn't Read

This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC). We verified that hierarchical speech synthesis frameworks could significantly improve the robustness and expressiveness of the synthetic speech.
featured image - HierSpeech++: A Fast and Strong Zero-Shot Speech Synthesizer for Text-to-Speech
The FewShot Prompting Publication  HackerNoon profile picture

Abstract and 1 Introduction

2 Related Work

2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models

2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning

3 Hierspeech++ and 3.1 Speech Representations

3.2 Hierarchical Speech Synthesizer

3.3 Text-to-Vec

3.4 Speech Super-resolution

3.5 Model Architecture

4 Speech Synthesis Tasks

4.1 Voice Conversion and 4.2 Text-to-Speech

4.3 Style Prompt Replication

5 Experiment and Result, and Dataset

5.2 Preprocessing and 5.3 Training

5.4 Evaluation Metrics

5.5 Ablation Study

5.6 Zero-shot Voice Conversion

5.7 High-diversity but High-fidelity Speech Synthesis

5.8 Zero-shot Text-to-Speech

5.9 Zero-shot Text-to-Speech with 1s Prompt

5.10 Speech Super-resolution

5.11 Additional Experiments with Other Baselines

6 Limitation and Quick Fix

7 Conclusion, Acknowledgement and References

Abstract

Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. However, they require a large-scale data and possess the same limitations as previous autoregressive speech models, including slow inference speed and lack of robustness. This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC). We verified that hierarchical speech synthesis frameworks could significantly improve the robustness and expressiveness of the synthetic speech. Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios. For text-to-speech, we adopt the text-to-vec framework, which generates a self-supervised speech representation and an F0 representation based on text representations and prosody prompts. Then, HierSpeech++ generates speech from the generated vector, F0, and voice prompt. We further introduce a high-efficient speech super-resolution framework from 16 kHz to 48 kHz. The experimental results demonstrated that the hierarchical variational autoencoder could be a strong zero-shot speech synthesizer given that it outperforms LLM-based and diffusion-based models. Moreover, we achieved the first human-level quality zero-shot speech synthesis. Audio samples and source code are available at https://github.com/sh-lee-prml/HierSpeechpp.


Index Terms - Text-to-Speech, Voice Conversion, Speech Super-resolution, Zero-shot Speech Synthesis, Zero-shot Voice Cloning.

1 INTRODUCTION

The advent of large language models (LLM) have facilitated the widespread adoption of LLM-based models in speech synthesis and audio generation tasks. Conventional speech synthesis frameworks have advanced significantly driven by the integration of new features such as neural audio codecs for discrete speech or audio units. Although there is still a room for improvement in LLM-based speech models, these models possess four major limitations: 1) the first drawback is their auto-regressive generative manner, which has a slow inference speed and lack of robustness, resulting in repeating, skipping, and mispronunciation; 2) they are highly dependent on the pre-trained neural audio codec or discrete speech unit; 3) the audio quality of these models puts the clock back before the advent of the strong end-to-end speech synthesis framework proposed in [35]; 4) they require a large-scale dataset to train the model.


VITS [35] successfully introduced an end-to-end (E2E) speech synthesis framework by adopting a variational autoencoder (VAE) augmented with normalizing flow and adversarial training. Driven by the ability to generate highquality waveform audio within a fully end-to-end training pipeline, the perceptual quality of synthetic speech is significantly better than that of two-stage speech synthesis models such as conventional text-to-speech (TTS) models and recent codec-based speech synthesis models. HierSpeech [48] further improved the reconstruction quality by adopting a hierarchical conditional VAE using self-supervised speech representation. They proposed a novel TTS frameworks, which can train the model without any text transcripts by leveraging self-supervised speech representation and hierarchical VAE. However, the E2E models have limitations in terms of zero-shot voice cloning. Although E2E models can synthesize speech with high-quality audio, their synthetic speech still has a little speaker similarity in zero-shot voice cloning scenarios, and their training processes require high computational complexity.


Meanwhile, diffusion-based speech synthesis models have also shown their strengths in terms of speaker adaptation. Diff-VC [64] introduced a conditional diffusion probabilistic model with a data-dependent prior for zero-shot voice conversion. Additionally, the effectiveness of diffusion models for speaker adaptation, including DDDM-VC [10], DiffhierVC [11], and UnitSpeech [32], has been proven. Although diffusion models exhibit a good adaptation performance, they have several limitations: 1) they have a slow inference speed with their iterative generation processes, and 2) they are vulnerable to noisy data for speaker adaptation. With noisy speech prompts, they may generate much more noisy speech mainly due to the powerful adaptation performance, and this may further results in the degradation of the perceptual audio quality. 3) Although diffusion models have shown strong generative performance, they still possess lower audio quality owing the train-inference mismatch of the two-stage generation between the ground-truth and generated Mel-spectrogram.


In this paper, we present HierSpeech++, a fast and strong zero-shot speech synthesis model that uses a hierarchical speech synthesis framework. By adopting the E2E speech synthesis frameworks to take the advantage of high-quality waveform generation, we solved the limitation of style adaptation by adopting a self-supervised speech representation as a semantic speech representation and bridging the gap between semantic and acoustic representation hierarchically. We propose a novel speech synthesis framework consisting of a hierarchical speech synthesizer, text-to-vec (TTV), and speech super-resolution (SpeechSR).


Based on HierVST [45], we introduce an improved hierarchical speech synthesizer using a hierarchical conditional VAE. To improve audio quality beyond perceptual quality, we adopt a dual-audio acoustic encoder in order to enhance the acoustic posterior and utilize a BigVGAN-based hierarchical adaptive generator with conditional and unconditional generation for better out-of-distribution generalization (zeroshot voice cloning). In addition, we adopt a source-filter theory-based multi-path semantic encoder to disentangle speech components and enhance the semantic prior for speaker-agnostic and speaker-related semantic information. By using a hierarchical VAE, we can connect and learn these representations hierarchically and infer the waveform audio by progressively adapting to the target voice style. For better adaptation and train-inference mismatch reduction, we introduce bidirectional normalizing flow Transformer networks using AdaLN-Zero. Without a text-speech paired dataset, we can simply scale-up the dataset to train a hierarchical speech synthesizer for zero-shot voice cloning.


For text-to-speech, we introduce a TTV that can generate a semantic representation and an F0 from text sequences. Owing to the semantic information that is extracted from selfsupervised learning, we can transfer prosody information that is irrelevant to voice style. By connecting the TTV and a hierarchical speech synthesizer, we can synthesize high-quality speech from text by hierarchically adapting the prosody and voice style even in zero-shot scenarios. We also propose a simple speech super-resolution framework to upsample high-resolution waveform audio from 16 kHz to 48 kHz. This can facilitate data accessibility for scaling up datasets in that we can utilize low-resolution speech data such as the automatic speech recognition (ASR) dataset to train the speech synthesizer and TTV models.


The main contributions of this study are as follows:


• For fast and strong zero-shot speech synthesis, we presented HierSpeech++, a novel fully-parallel hierarchical speech synthesis framework.


• Prosody and voice style can be transferred and controlled using a hierarchical speech synthesis framework.


• We also present SpeechSR which can upsample waveform audio from 16 kHz to 48 kHz for high-resolution speech synthesis and data scalability.


• HierSpeech++ achieved the first human-level quality for zero-shot text-to-speech and voice conversion tasks.


• Audio samples and source code are available at https: //sh-lee-prml.github.io/HierSpeechpp-demo/


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.