paint-brush
SEAMLESSEXPRESSIVELM Transforms Speech Translation by Preserving Semantics and Speaker Vocal Styleby@speechsynthesis
New Story

SEAMLESSEXPRESSIVELM Transforms Speech Translation by Preserving Semantics and Speaker Vocal Style

by Speech SynthesisJanuary 29th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

SEAMLESSEXPRESSIVELM introduces a novel approach to expressive speech-to-speech translation, combining semantic accuracy with speaker vocal style preservation. Unlike earlier models requiring style-aligned data, this method leverages semantic tokens and chain-of-thought prompting, outperforming traditional cascaded models in efficiency and quality.
featured image - SEAMLESSEXPRESSIVELM Transforms Speech Translation by Preserving Semantics and Speaker Vocal Style
Speech Synthesis HackerNoon profile picture
0-item

Authors:

(1) Hongyu Gong, Meta AI ([email protected]);

(2) Bandhav Veluri, Meta AI ([email protected]).

Abstract and 1. Introduction

  1. Related Work
  2. Model
  3. Experiments
  4. Ablation Study
  5. Conclusion, Limitations, and Risks

Abstract

Expressive speech-to-speech translation (S2ST) is a key research topic in seamless communication, which focuses on the preservation of semantics and speaker vocal style in translated speech. Early works synthesized speaker style aligned speech in order to directly learn the mapping from speech to target speech spectrogram. Without reliance on style aligned data, recent studies leverage the advances of language modeling (LM) and build cascaded LMs on semantic and acoustic tokens. This work proposes SEAMLESSEXPRESSIVELM, a single speech language model for expressive S2ST. We decompose the complex source-to-target speech mapping into intermediate generation steps with chain-of-thought prompting. The model is first guided to translate target semantic content and then transfer the speaker style to multi-stream acoustic units. Evaluated on Spanish-to-English and Hungarian-to-English translations, SEAMLESSEXPRESSIVELM outperforms cascaded LMs in both semantic quality and style transfer, meanwhile achieving better parameter efficiency

1. Introduction

Translation is crucial for breaking down language barriers between diverse linguistic backgrounds, and speech-to-speech translation has attracted great interest of research community in fostering crosslingual spoken communication. Early approaches use encoder-decoder model architecture, and focus on semantic preservation when translating from source to target speech (Jia et al., 2019, 2022; Lee et al., 2022). The need of seamless communication raises awareness for preserving acoustic information, as speech carries richer information than semantics. For example, the way speech is uttered conveys emotions of the speaker. This motivates the research of S2ST with both semantic and speaker vocal style preserved (Barrault et al., 2023).


Language models (LM) with in-context learning capability demonstrate powerful language understanding and generation. The recent progress of discrete speech representations has bridged the gap between continuous speech waveform and discrete text tokens. Semantic tokens such as HuBERT (Hsu et al., 2021) and w2v-BERT units (Rubenstein et al., 2023) are learned to capture semantic information from speech, and acoustic tokens such as EnCodec (Défossez et al., 2022) and SoundStream (Zeghidour et al., 2022) encode fine-grained acoustic information with multiple codebooks. The advances in text language model and speech tokens lay the foundation for speech language modeling, which is successfully applied to speech generation tasks (Wang et al., 2023).


Recent expressive S2ST approaches such as VALL-E X (Zhang et al., 2023), PolyVoice (Dong et al., 2023) and AudioPaLM (Rubenstein et al., 2023) build a cascade of multiple speech LMs in speech-to-semantic translation and semantic-toacoustic generation respectively. One LM translates source speech to target semantic tokens such as HuBERT units, and other LMs generates acoustic units conditioned on semantic units and acoustic prompt. We note that multi-LM modeling is faced with common weaknesses of cascaded models such as computational inefficiency and error propagation.


Their design choices of separate LMs for semantic and acoustic generation could be explained by the lack of speaker style aligned speech data for model training. Expressive S2ST model essentially learns the mapping from source to target acoustic units. The end-to-end training would require speech pairs aligned in both semantics and style (Jia et al., 2022). Despite research efforts in largescale data mining to create semantically aligned data (Duquenne et al., 2023), style aligned speech is in general much less and costly to collect (Barrault et al., 2023). The training of cascaded LMs relaxes the requirements of crosslingual style alignments as the style preservation could be learned from monolingual speech.



Another challenge resulting in cascaded semantic and acoustic generation is the heterogeneity of semantic and acoustic tokens. The heterogeneity comes from the complimentary information in single-codebook semantic tokens and multi-codebook acoustic tokens. It is challenging to model different types of token generation in a single LM.


This work proposes SEAMLESSEXPRESSIVELM, a speech LM which models both semantic and acoustic generation for end-to-end expressive S2ST. In this work, we address the challenge of heterogeneous speech tokens by chain-of-thought (CoT) prompting, and the model adopts multi-step generation from semantic translation towards acoustic translation. As for training data, we only use semantically aligned speech without relying on speaker style aligned data. The trick is to randomly crop the target segment as the acoustic prompt for style preservation.


Table 1 makes a comparison of existing S2ST models and the proposed approach, and our contributions are summarized below:


• We propose a novel model to support endto-end S2ST with speaker style preservation. SEAMLESSEXPRESSIVELM demonstrates improved translation quality and parameter efficiency compared with the cascaded LMs.


• The model is trained with semantically aligned speech, and does not need aligned speech-text data or speaker aligned speech which are required by existing approaches.


• We show that semantic and acoustic tokens can be modeled by the same language model despite their heterogeneity.


• Our ablation study provides insights into how prompt designs affect speech language model in translation task.


This paper is available on arxiv under CC BY 4.0 DEED license.