Where does In-context Translation Happen in Large Language Models: Inference Efficiency

Written by computational | Published 2024/08/30
Tech Story Tags: large-language-models | context-masking-experiments | in-context-learning | machine-translation | translation-models | supervised-neural-mt-models | gpt-models | fine-tuning-llms

TLDRIn this study, researchers attempt to characterize the region where large language models transition from in-context learners to translation models.via the TL;DR App

Authors:

(1) Suzanna Sia, Johns Hopkins University;

(2) David Mueller;

(3) Kevin Duh.

Table of Links

5. Inference Efficiency

Speeding up transformer inference is of great interest to the community (Fournier et al., 2023). We highlight the potential of speeding up inference time as a direct consequence of identifying where task recognition occurs in the model and redundancy of self-attention processing. Our results indicate that we can achieve significant speedups in inference by removing the processing of context-tokens all-together after a certain point in the model, with little to no impact on downstream performance.

Then, for a model with nβ„“ layers, the amount of processing in terms of speed and memory saved is approximately (nβ„“ βˆ’ r)/nβ„“ Γ— (k/k + 1).

Using the example of LLAMA7B (32 layers), we see from Figure 2 that the model is very close to it’s ceiling score after processing the examples at layer 14 (β„“ = 14). If we no longer need to process examples after β„“ = 14, under a prompt size of 5 the savings are approximately 45%.

For instruction-tuned models which are typically deployed in production, even if we assume that no examples are provided, savings can be non-trivial as very long-form instructions are typically provided to the model in an attempt to control it’s behavior (prompt engineering).

This paper is available on arxiv under CC BY 4.0 DEED license.


Written by computational | Computational: We take random inputs, follow complex steps, and hope the output makes sense. And then blog about it.
Published by HackerNoon on 2024/08/30