paint-brush
SLMs Outperform Competitors Yet Suffer Rapid Adversarial Jailbreaksby@phonology

SLMs Outperform Competitors Yet Suffer Rapid Adversarial Jailbreaks

by Phonology TechnologyFebruary 6th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This section discusses the results and insights from evaluating our SpeechVerse SLMs. Our models outperform competitors like SpeechGPT, showing over 40% better safety and 20% improved helpfulness, thanks to effective ASR pre-adaptation. However, sample-specific white-box adversarial attacks reveal a striking vulnerability, achieving about 90% success—even with minimal perturbations—and requiring as few as 20 iterations for some models, highlighting the need for stronger defenses.
featured image - SLMs Outperform Competitors Yet Suffer Rapid Adversarial Jailbreaks
Phonology Technology HackerNoon profile picture
0-item

Part 1: Abstract & Introduction

Part 2: Background

Part 3: Attacks & Countermeasures

Part 4: Experimental Setup

Part 5: Datasets & Evaluation

Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations

Part 7: Results & Discussion

Part 8: Transfer Attacks & Countermeasures

Part 9: Conclusion, Limitations, & Ethics Statement

Part 10: Appendix: Audio Encoder Pre-training & Evaluation

Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness

Part 12: Appendix: Adaptive attacks & Qualitative Examples

5. Results & Discussion

In this section, we first analyze the safety alignment of several SLMs followed by the results of sample-specific and transfer-based attacks, and also show the effectiveness of the TDNF defense.

5.1 Safety-aligned SLMs

We compare the efficacies of different SLMs trained using the SpeechVerse architecture, against a public SLM models SpeechGPT (Zhang et al., 2023) in Table 2. In addition, we also compare the performance of text-only pre-trained LLMs out-ofthe-box. We also compare fine-tuned Flan-T5-XL (3B) and Mistral-7B LLMs, safety aligned with the textual form of Spoken QA data.


Our results demonstrate the superior performance of our SLM models compared to public models, closely matching the performance of the best text-only LLMs on safety and relevance. As hypothesized, SLM models pre-adapted with ASR match or outperform their counterparts on all metrics demonstrating a better recognition of speech modality. We observe that the helpfulness of the SLM models is limited by the abilities of the pretrained LLM, although tuned with general instruction data during cross-modal adaptation. Furthermore, using our training mechanisms, we observe that we can retain almost all the helpfulness of pre-trained LLMs, while additionally infusing the abilities of spoken instruction understanding as well as safety alignment into SLMs.[12] Compared to SpeechGPT (Zhang et al., 2023), our best model shows more than 40% improvement in safety and 20% in helpfulness, demonstrating better recognition quality and speech instruction following capability. Although other public models like LLASM (Shu et al., 2023) and Pengi (Deshmukh et al., 2023) also have the capability to perceive speech instructions, we found those models to be not sufficiently safety aligned and hence left them out from our benchmarking.

5.2 Sample-specific white-box attacks

In Table 3, we present results of random noise perturbations at two SNR values, along with samplespecific adversarial attacks on four in-house trained SLM models. We report results only on the samples that were originally found to be safe for each model (as reported in Table 2) out of the 360 audios considered. Random perturbations demonstrate limited effectiveness in jailbreaking most models, with attack success rate below 8% for all models. In contrast, adversarial perturbations achieve a success rate (∼90%) in all cases at ∼60dB SPR. This shows that carefully crafted perturbations, even at small magnitudes can cause the models to produce unsafe responses[13]. Therefore more sophisticated speech-specific attacks that have been proposed to produce imperceptible perturbations are not necessary (Schönherr et al., 2018).


In Figure 4, we plot the cumulative proportion of successful attacks as a function of the number of attack iterations. We see that different models exhibit varying levels of susceptibility to adversarial jailbreaking attacks. For example, 80% of the successful attacks require fewer than 20 iterations for Mistral-based models, whereas attacks on the FlanT5-based models require upto 40 iterations.




Authors:

(1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]);

(2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions;

(3) Srikanth Ronanki, AWS AI Labs, Amazon;

(4) Anshu Bhatia, AWS AI Labs, Amazon;

(5) Karel Mundnich, AWS AI Labs, Amazon;

(6) Saket Dingliwal, AWS AI Labs, Amazon;

(7) Nilaksh Das, AWS AI Labs, Amazon;

(8) Zejiang Hou, AWS AI Labs, Amazon;

(9) Goeric Huybrechts, AWS AI Labs, Amazon;

(10) Srikanth Vishnubhotla, AWS AI Labs, Amazon;

(11) Daniel Garcia-Romero, AWS AI Labs, Amazon;

(12) Sundararajan Srinivasan, AWS AI Labs, Amazon;

(13) Kyu J Han, AWS AI Labs, Amazon;

(14) Katrin Kirchhoff, AWS AI Labs, Amazon.


This paper is available on arxiv under CC BY 4.0 DEED license.

[12] We study the effect of excluding general instruction tuning data for SLM training in Appendix A.4.