Style Prompt Replication: A Simple Trick That Helped Us In Our Journey

Written by fewshot | Published 2024/12/19
Tech Story Tags: style-prompt-replication | speech-synthesis | spr | hierspeech | voice-modeling | prosody-modeling | style-encoder | dna-replication

TLDRWe found a simple trick to transfer the style even with a one second speech prompt by introducing style prompt replication (SPR).via the TL;DR App

Table of Links

Abstract and 1 Introduction

2 Related Work

2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models

2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning

3 Hierspeech++ and 3.1 Speech Representations

3.2 Hierarchical Speech Synthesizer

3.3 Text-to-Vec

3.4 Speech Super-resolution

3.5 Model Architecture

4 Speech Synthesis Tasks

4.1 Voice Conversion and 4.2 Text-to-Speech

4.3 Style Prompt Replication

5 Experiment and Result, and Dataset

5.2 Preprocessing and 5.3 Training

5.4 Evaluation Metrics

5.5 Ablation Study

5.6 Zero-shot Voice Conversion

5.7 High-diversity but High-fidelity Speech Synthesis

5.8 Zero-shot Text-to-Speech

5.9 Zero-shot Text-to-Speech with 1s Prompt

5.10 Speech Super-resolution

5.11 Additional Experiments with Other Baselines

6 Limitation and Quick Fix

7 Conclusion, Acknowledgement and References

4.3 Style Prompt Replication

We found a simple trick to transfer the style even with a one second speech prompt by introducing style prompt replication (SPR). Similar to the DNA replication, we copy the same sequence of prompt as shown in Fig 7. The replicated prompt by n times is fed to the style encoder to extract the style representation. Specifically, because the prompt style encoder usually encounters a long sequence of prompts over 3s, synthetic speech from short prompts may be generated incorrectly. However, SPR can deceive the style encoder as it seems like long prompts, thus we can synthesize the speech even with 1s speech prompt.

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.


Written by fewshot | Spearheading research, publications, and advancements in few-shot learning, and redefining artificial intelligence.
Published by HackerNoon on 2024/12/19