Style Prompt Replication: A Simple Trick That Helped Us In Our Journey

We found a simple trick to transfer the style even with a one second speech prompt by introducing style prompt replication (SPR). Similar to the DNA replication, we copy the same sequence of prompt as shown in Fig 7. The replicated prompt by n times is fed to the style encoder to extract the style representation. Specifically, because the prompt style encoder usually encounters a long sequence of prompts over 3s, synthetic speech from short prompts may be generated incorrectly. However, SPR can deceive the style encoder as it seems like long prompts, thus we can synthesize the speech even with 1s speech prompt.

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.