paint-brush
Audio Encoder Pre-training and Evaluation Enhance SLM Safetyby@phonology

Audio Encoder Pre-training and Evaluation Enhance SLM Safety

by Phonology TechnologyFebruary 6th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This appendix details the pre-training and evaluation of our audio encoder for speech language models (SLMs). The encoder is a 24-layer Conformer with 300M parameters pre-trained using the BEST-RQ method on 300K hours of English audio. It uses random-projection quantization with 16 codebooks and a 40% masking ratio. For evaluation, Claude 2.1 automatically annotates model responses for safety, relevance, and helpfulness, achieving F1 scores above 80% when compared to manually labeled data.
featured image - Audio Encoder Pre-training and Evaluation Enhance SLM Safety
Phonology Technology HackerNoon profile picture
0-item

Part 1: Abstract & Introduction

Part 2: Background

Part 3: Attacks & Countermeasures

Part 4: Experimental Setup

Part 5: Datasets & Evaluation

Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations

Part 7: Results & Discussion

Part 8: Transfer Attacks & Countermeasures

Part 9: Conclusion, Limitations, & Ethics Statement

Part 10: Appendix: Audio Encoder Pre-training & Evaluation

Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness

Part 12: Appendix: Adaptive attacks & Qualitative Examples


APPENDIX

A.1 Audio Encoder Pre-training

Our audio encoder is a 24-layer Conformer model with feature dimension of 768 and attention head of 8. The total number of parameters of this encoder model is 300M. We adopt the BESTRQ (Chiu et al., 2022) method, which pre-trains the model to predict the masked speech signals with labels generated from a random-projection quantizer. The quantizer projects the speech inputs with a randomly initialized matrix, and performs a nearest-neighbor lookup in a randomly-initialized codebook. Neither the projection matrix nor the codebook is updated during pre-training. We build an internal pre-training dataset containing 300K hours English audios. The pre-training uses mask span of 10 with total effective masking ratio about 40%. The learning rate schedule follows the transformer learning rate schedule with peak value of 0.0005 and warm-up of 50K steps. AdamW optimizer is adopted with weight decay of 0.01. Since the encoder has 4 times temporal-dimension reduction, the quantization with random projections stacks every 4 frames for projections. We use 16 individual codeboooks, where the vocab size of each codebook is 8192 and the dimension is 16. The model is pre-trained for 500K steps in total.

A.2 Evaluation

We used Claude 2.1 as a tool to automatically annotate the response provided by a model for its safety and relevance. To determine the accuracy of the generated labels, we manually (done by the first 4 authors) annotated 100 such question-response pairs (obtained from our internal models) for safety and relevance, and used this as a “ground-truth” labelled set.


We use the following prompt template with Claude 2.1 to obtain safety annotations for SLMs.



Similarly, we use the following prompt template to obtain relevance annotations.



Table 7: Effectiveness of cross-prompt attacks. Metrics are averaged over the set of questions originally found to be safe for each model.


We experimented with several prompts separately for the safety and relevance annotation tasks using in-context examples, and chose the prompts that gave reasonable annotation performance (F1 score above 80%) compared to the aforementioned ground-truth labels. We follow a similar strategy to obtain the helpfulness annotations.


Given these prompt templates to automatically obtain the safety, relevance and helpfulness labels, we define the evaluation metrics as follows:


Safety rate: The proportion of questions for which the generated response is labelled as safe. Higher values indicate better safety alignment of the models.


Relevance rate: The proportion of questions for which the generated response is labelled as relevant to the question. Higher values indicate better alignment between the question and response.


Helpfulness rate: The proportion of questions for which the model produces useful responses.


Table 8: Effect of not including Alpaca TTS data in SLM cross-modal instruction fine-tuning stage.


Higher values indicate better utility of the models.


Authors:

(1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]);

(2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions;

(3) Srikanth Ronanki, AWS AI Labs, Amazon;

(4) Anshu Bhatia, AWS AI Labs, Amazon;

(5) Karel Mundnich, AWS AI Labs, Amazon;

(6) Saket Dingliwal, AWS AI Labs, Amazon;

(7) Nilaksh Das, AWS AI Labs, Amazon;

(8) Zejiang Hou, AWS AI Labs, Amazon;

(9) Goeric Huybrechts, AWS AI Labs, Amazon;

(10) Srikanth Vishnubhotla, AWS AI Labs, Amazon;

(11) Daniel Garcia-Romero, AWS AI Labs, Amazon;

(12) Sundararajan Srinivasan, AWS AI Labs, Amazon;

(13) Kyu J Han, AWS AI Labs, Amazon;

(14) Katrin Kirchhoff, AWS AI Labs, Amazon.


This paper is available on arxiv under CC BY 4.0 DEED license.