paint-brush
How Anchor Tokens Transform Sequence Information Compression in LLMs by@anchoring
122 reads

How Anchor Tokens Transform Sequence Information Compression in LLMs

by AnchoringOctober 10th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This section highlights how our anchor-based approach builds on prior work in in-context learning and prompt compression. Unlike task-specific methods like gist tokens, our model universally condenses sequence information into anchor tokens, improving efficiency across tasks. It also contrasts with memory-efficient attention mechanisms like FlashAttention and PagedAttention by targeting sequence compression rather than computational optimization.
featured image - How Anchor Tokens Transform Sequence Information Compression in LLMs
Anchoring HackerNoon profile picture

Authors:

(1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);

(2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);

(3) Derek F. Wong, University of Macau;

(4) Longyue Wang, Tencent AI Lab, and corresponding author.

Abstract and 1 Introduction

2 Related Work

3 Anchor-based Large Language Models

3.1 Background

3.2 Anchor-based Self-Attention Networks

3.3 Anchor-based Inference

4 Experiments and 4.1 Our Implementation

4.2 Data and Training Procedure

4.3 Evaluation

5 Results

6 Analysis

7 Conclusion, Limitations, Ethics Statement, and References


A More Experimental Results

B Data Settings

Our research is inspired by the recent investigation into the understanding of in-context learning (ICL) within LLMs by Wang et al. (2023). In their study, the authors delve into the underlying mechanisms of ICL, emphasizing the influence of label words in demonstration examples on information flow. They reveal that these label words serve as anchors, wherein semantic information converges into these anchors during inference, subsequently directing the LLMs’ final predictions. Motivated by their findings, our objective is to extend this feature to natural language modeling by guiding sequence information compression into manually designed anchor tokens, rather than solely relying on label words. This is crucial because natural language texts may not always contain an explicit label.


The most relevant method to our approach in the existing literature is the learning to compress prompts with gist tokens (Mu et al., 2023). Their approach centers around compressing task-specific prompts by fine-tuning the model using the proposed gist masking, thereby enforcing prompt compression. However, there are several crucial divergences between our study and theirs. Unlike their focus on compressing a task prompt, our objective lies in training the LLM to condense sequence information into the anchor tokens. Consequently, our approach can be universally applied to a range of tasks without requiring task-specific training, a feature not shared by gist tokens, as the anchor tokens are seamlessly incorporated into the model’s language modeling. Furthermore, our anchor-based attention masks account for information compression within a sequence and information interaction between sequences, thus extending beyond the mere compression of task prompts.


On the other hand, FlashAttention (Dao et al., 2022) and PagedAttention (Kwon et al., 2023) both present memory-efficient attention mechanisms for LLMs. While they focus on optimizing attention computation and subdividing attention processing, our proposed method offers a distinct approach that specifically targets the compression of sequence information into anchor tokens, making it orthogonal to these existing works.


This paper is available on arxiv under CC BY 4.0 DEED license.