paint-brush
SpeechVerse Unites Audio Encoder and LLM for Superior Spoken QAby@phonology

SpeechVerse Unites Audio Encoder and LLM for Superior Spoken QA

by Phonology TechnologyFebruary 6th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This section details the experimental setup for SpeechVerse, our unified speech language model. It describes using a 24-layer Conformer audio encoder paired with two LLMs—Flan-T5-XL and Mistral-7B variants—for spoken QA. The setup covers both two-stage (ASR pre-adaptation followed by cross-modal instruction fine-tuning with LoRA adapters) and single-stage training paradigms, all implemented with PyTorch and PyTorch-Lightning.
featured image - SpeechVerse Unites Audio Encoder and LLM for Superior Spoken QA
Phonology Technology HackerNoon profile picture
0-item

Part 1: Abstract & Introduction

Part 2: Background

Part 3: Attacks & Countermeasures

Part 4: Experimental Setup

Part 5: Datasets & Evaluation

Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations

Part 7: Results & Discussion

Part 8: Transfer Attacks & Countermeasures

Part 9: Conclusion, Limitations, & Ethics Statement

Part 10: Appendix: Audio Encoder Pre-training & Evaluation

Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness

Part 12: Appendix: Adaptive attacks & Qualitative Examples

4. EXPERIMENTAL SETUP

4.1 Models

We demonstrate our unified SLM architecture called SpeechVerse in Figure 3. It consists of two main components: audio encoder and large language model.


QA


Large Language Model We employ two types of publicly available pre-trained LLMs in our study: (1) encoder-decoder architecture based Flan-T5- XL (Chung et al., 2022) with 3 billion parameters, and (2) decoder-only architecture based Mistral-7bInstruct (Jiang et al., 2023) with 7 billion parameters. While both models can follow instructions, only the latter matches or exceeds the performance of a 13 billion parameter model like Llama-2 (Touvron et al., 2023). Notably, neither of the two LLMs are explicitly trained to be safe or harmless, so we safety-align their SLM counterparts and refer to them as S-FlanT5 and S-Mistral in this work. We also fine-tune Mistral explicitly on safetyaligned textual instruction data and refer to its SLM counterpart as S-Mistral-FT.


We would like to note that some of the popular LLMs like ChatGPT[3] and Claude 2.1[4] do not support audio inputs off-the-shelf. Further tuning of these text-only LLMs is necessary with paired audio-text data to enable the model to comprehend audio inputs. This requires access to the model’s gradients for fine-tuning. Therefore, we resort to the open-source Flan and Mistral models in this work. However, we also showcase jailbreaking attacks both in black-box (SpeechGPT) as well as in white-box settings.

4.2 Training

To enable SLMs to better comprehend the input audio, a two stage training paradigm is popularly adopted: modality pre-adaptation and cross-modal instruction fine-tuning (Zhang et al., 2023; Shu et al., 2023). In this work, we study SLMs trained with the two-stage paradigm as well as a singlestage paradigm by directly performing cross-modal instruction fine-tuning for Spoken QA application. We utilize Automatic Speech Recognition (ASR) as the modality pre-adaptation task. To the best of our knowledge, ours is the first study comparing the efficacies of the two paradigms.



We reduce the computational costs associated with long dimension of audio modality by employing 1-D convolutional layers on the outputs of the audio encoder (see Figure 3). For the two-stage training paradigm, the first stage involves training the convolutional layers and the audio encoder layer (combination) weights on ASR data, and the second stage involves further tuning them along with randomly initialized LoRA adapters (Hu et al., 2021; Mangrulkar et al., 2022) of LLM. For the single-stage training paradigm, we simply tune all the aforementioned trainable parameters for the Spoken QA application upon random initialization. At all times, the pre-trained audio encoder and LLM parameters are kept frozen. Overall, the total number of trainable parameters are approximately 27 million and 66 million when using Flan-T5-XL (3B) and Mistral-7B as backbone LLMs respectively.


Although the focus of this work is primarily to understand the robustness of SLMs for safety, fine-tuning SLMs with safety-aligned instruction data alone can lead to catastrophic forgetting of the LLM pre-trained efficacies, especially affecting the helpfulness of the SLM against harmless instructions (Zhao et al., 2023a). We address this problem by adopting experience replay technique (Wu et al., 2024) and incorporate general instruction tuning data during cross-modal instruction fine-tuning.


Implementation Details All the models in this work are trained using PyTorch (Paszke et al., 2019) and PyTorch-Lightning (William Falcon, 2019) libraries. We use Huggingface Transformers (Wolf et al., 2020) to initialize pre-trained LLMs. We stack two convolutional 1-D layers with lengths 3 each, and strides 2 and 1 respectively. During training, we use a batch size 512 and AdamW (Loshchilov and Hutter, 2019) optimizer with a learning rate of 5e-3. We employ LoRA adapters with rank 16 and alpha 10. We use Dropout (Srivastava et al., 2014) of 0.2 and 0.1 for convolutional layers and LoRA adapters respectively.



Authors:

(1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]);

(2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions;

(3) Srikanth Ronanki, AWS AI Labs, Amazon;

(4) Anshu Bhatia, AWS AI Labs, Amazon;

(5) Karel Mundnich, AWS AI Labs, Amazon;

(6) Saket Dingliwal, AWS AI Labs, Amazon;

(7) Nilaksh Das, AWS AI Labs, Amazon;

(8) Zejiang Hou, AWS AI Labs, Amazon;

(9) Goeric Huybrechts, AWS AI Labs, Amazon;

(10) Srikanth Vishnubhotla, AWS AI Labs, Amazon;

(11) Daniel Garcia-Romero, AWS AI Labs, Amazon;

(12) Sundararajan Srinivasan, AWS AI Labs, Amazon;

(13) Kyu J Han, AWS AI Labs, Amazon;

(14) Katrin Kirchhoff, AWS AI Labs, Amazon.


This paper is available on arxiv under CC BY 4.0 DEED license.

[2] We refer the reader to Appendix A.1 for more details on the audio encoder pre-training.


[3] https://openai.com/index/chatgpt


[4] https://www.anthropic.com/news/claude-2-1