Why Our Tiny Training Set Beat Giants in Cross-Lingual Speech Retrieval

tldt arrow

Too Long; Didn't Read

LLMs take the lead in speech-text retrieval, outperforming larger models with smarter training and cross-lingual adaptability.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Why Our Tiny Training Set Beat Giants in Cross-Lingual Speech Retrieval
Writings, Papers and Blogs on Text Models HackerNoon profile picture
0-item

Authors:

(1) Frank Palma Gomez, from Boston University and the work done by him and Ramon during their internship in Google Research and Google DeepMind respectively;

(2) Ramon Sanabria, The University of Edinburgh and the work done by him and Frank during their internship in Google Research and Google DeepMind respectively;

(3) Yun-hsuan Sung, Google Research;

(4) Daniel Cer, Google Research;

(5) Siddharth Dalmia, Google DeepMind and Equal Advising Contributions;

(6) Gustavo Hernandez Abrego, Google Research and Equal Advising Contributions.

Abstract and 1 Introduction

2 Method

3 Data and Tasks

4 Model

5 Experiments

6 Related Work

7 Conclusion

8 Acknowledgements and References

A Appendix

5 Experiments

We train our DE model to perform S2T, where the task is to retrieve the corresponding transcription given a speech sample. We train on the 21 languages from CoVoST-2 and evaluate our model using the S2T portion of FLEURS in 102 languages.

5.1 Speech-to-Text Retrieval

Table 1 shows the average R@1 and WER for S2T for 102 languages from FLEURS. We compare against the mSLAM DE model from Conneau et al. (2023), a model trained on 426k hours of S2T data in 51 languages and fine-tuned on FLEURS training data. Our model significantly outperforms the mSLAM DE baseline in R@1 and W ER metrics despite being trained with only 1/10 of the data and having been initialized from a text-only LLM. More importantly, our model was only trained on the 21 languages in CoVoST-2 and never fine-tuned on the FLEURS training data.


5.1.1 Seen-Unseen Breakdown


In Figure 2 we break down the R@1 scores based on seen and unseen languages during training. We find that our model performs best on the 20 languages that are within the training and evaluation data, but still perform significantly well on the remaining 82 unseen languages. We hypothesize


Figure 2: R@1 transcription retrieval for seen and unseen languages in the training set.


Table 2: FLEURS S2T (R@1) performance broken down by language groups. Bold represents better performance. Numbers in parenthesis represent the number of languages within the language group.


this is due to the vast textual multilingual data our backbone LLM has seen during pre-training.


5.1.2 Language Group Breakdown


Table 2 shows the R@1 language group breakdown for S2T on FLEURS. We find that although we only trained on 21 languages, our model significantly outperforms mSLAM DE in 13 of the 15 language groups. These results are consistent with the experiments in Hassid et al. (2023) which explore the effect of initializing speech language models from pre-trained LLMs.

5.2 Evaluating on Cross-Modal and Cross-Lingual Tasks

We evaluate on S2TT to gauge the cross-modal and cross-lingual capabilities of our model. We show that we can further improve S2TT by simply training on a mixture of S2T and translation data without using any S2TT training data.


Figure 3: BLEU scores for FLEURS zero-shot S2TT when training on Transcripts or Transcripts + Translations for PaLM 2 DE. Combining transcripts and translation data improves zero-shot S2TT retrieval.


5.2.1 Zero-Shot S2TT


Given the multi-lingual capabilities of our backbone language model, we explore if these capabilities are transferred after training our model contrastively on the S2T task. We hypothesize that our model should showcase cross-lingual and crossmodal capabilities due to the cross-modal training task and the cross-lingual capabilities of the backbone LLM. We evaluate S2TT in a zero-shot setting to assess our model’s performance retrieving English translations given a speech sample in another language. Using the FLEURS S2TT portion, we evaluate S2TT X → En in 4 languages: German, Polish, French, and Dutch.


Figure 3 shows BLEU S2TT performance using S2T CoVoST-2 in 21 languages. We call this setup Transcripts in Figure 3. Our results demonstrate that even when only training our model on speech and transcriptions, we can achieve some zero-shot S2TT performance and We find that S2TT BLEU scores are considerably higher for languages present S2T training data. For example, Polish was not in the S2T training therefore its BLEU scores are the lowest.


5.2.2 Improving S2TT with MT Data


To further improve our model’s cross-lingual performance, we add readily available translation data from Schwenk et al. (2019) to improve S2TT. For each batch, we combine 25% translation and 75% S2T data. Figure 3 shows comparison of only training on S2T (Transcripts) and combining S2T and translation data ( Transcriptions + Translations). We find that combining S2T and translation data significantly improves the S2TT BLEU scores in all 4 languages without training on S2TT data. This finding demonstrates that we can improve our models cross-lingual performance with highly accessible translation data without needing scarce and often expensive speech-totext translation training data.


The success of pre-trained LLMs have motivated the application of these models in different modalities. Lakhotia et al. (2021) transformed speech into pseudo-text units to introduce the task of generative spoken language modeling. Borsos et al. (2023) introduced a framework to generate audio with long-term consistency. Consequently, Hassid et al. (2023) showed that SpeechLMs benefit from being initialized from pre-train LLMs while Rubenstein et al. (2023) demonstrated that pre-trained LLMs can be adapted to various tasks that required text and speech understanding.


On the other hand, several works aim to build joint speech and text representations. Chung et al. (2021) introduced w2v-bert which combines masked language modeling and contrastive learning to create speech representations. Bapna et al. (2022) jointly pre-trains on speech and text from unsupervised speech and text data. Recently, Duquenne et al. (2023) employed separate speech and text encoders to generate embeddings in over 200 languages. Nevertheless, there is still a lack of understanding of whether joint speech and text representations can be built from a single encoder. We fill this gap by using pre-trained LLMs to jointly train on speech samples and their transcriptions to show that our approach is capable of speech-text matching in 102 languages.


This paper is available on arxiv under CC BY 4.0 DEED license.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks