paint-brush
Adversarial Settings and Random Noise Reveal Speech LLM Vulnerabilitiesby@phonology

Adversarial Settings and Random Noise Reveal Speech LLM Vulnerabilities

by Phonology TechnologyFebruary 6th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This section details the attack and countermeasure settings for SpeechVerse. Using a step size of 0.00001 and up to 100 iterations with early-stopping on unsafe responses, adversarial attacks are run with cross-entropy loss. Countermeasures use time-domain noise flooding (TDNF) at SNRs of 24, 30, 48, and 60 dB. Additionally, random WGN perturbations are applied as a baseline, repeating the process three times to assess model robustness.
featured image - Adversarial Settings and Random Noise Reveal Speech LLM Vulnerabilities
Phonology Technology HackerNoon profile picture
0-item

Part 1: Abstract & Introduction

Part 2: Background

Part 3: Attacks & Countermeasures

Part 4: Experimental Setup

Part 5: Datasets & Evaluation

Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations

Part 7: Results & Discussion

Part 8: Transfer Attacks & Countermeasures

Part 9: Conclusion, Limitations, & Ethics Statement

Part 10: Appendix: Audio Encoder Pre-training & Evaluation

Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness

Part 12: Appendix: Adaptive attacks & Qualitative Examples

4.5 Attack and countermeasure parameters

We use a step size of α = 0.00001 (Eq. 1), as we empirically found this setting leads to stable attack convergence. We experiment with only unconstrained attacks (without the Πx,ϵ operation in Equation 1) as we observed that even without them, the attacks were successful at high SNRs (rendering any constraints ineffective). We run the attack for a maximum of T=100 iterations using cross-entropy loss objective. We employ early-stopping at the first occurrence of an unsafe and relevant response, further using a human preference model[11] to filter out gibberish responses produced by the model during attacks. For the countermeasures, we


Table 1: Examples of model responses to both harmful and benign questions with corresponding safety, relevance and helpfulness labels.


experiment with several settings of TDNF by using four different SNR values: 24, 30, 48 and 60 dB.

4.6 Baseline: Random perturbations

We apply random perturbations at varying SPRs to understand if non-adversarial perturbations break the safety alignment of the LLMs. This serves as a simple baseline to characterize the robustness of the safety alignment of the models we consider. In particular, we apply WGN at 2 different SNRs for each of the audio files. We repeat this process 3 times and consider an audio jailbroken if any 1 of the 3 responses is unsafe and relevant.


Authors:

(1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]);

(2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions;

(3) Srikanth Ronanki, AWS AI Labs, Amazon;

(4) Anshu Bhatia, AWS AI Labs, Amazon;

(5) Karel Mundnich, AWS AI Labs, Amazon;

(6) Saket Dingliwal, AWS AI Labs, Amazon;

(7) Nilaksh Das, AWS AI Labs, Amazon;

(8) Zejiang Hou, AWS AI Labs, Amazon;

(9) Goeric Huybrechts, AWS AI Labs, Amazon;

(10) Srikanth Vishnubhotla, AWS AI Labs, Amazon;

(11) Daniel Garcia-Romero, AWS AI Labs, Amazon;

(12) Sundararajan Srinivasan, AWS AI Labs, Amazon;

(13) Kyu J Han, AWS AI Labs, Amazon;

(14) Katrin Kirchhoff, AWS AI Labs, Amazon.


This paper is available on arxiv under CC BY 4.0 DEED license.

[11] https://huggingface.co/OpenAssistant/ reward-model-electra-large-discriminator