Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash

Authors:

(1) Keivan Alizadeh;

(2) Iman Mirzadeh, Major Contribution;

(3) Dmitry Belenko, Major Contribution;

(4) S. Karen Khatamifard;

(5) Minsik Cho;

(6) Carlo C Del Mundo;

(7) Mohammad Rastegari;

(8) Mehrdad Farajtabar.

Table of Links

Abstract and 1. Introduction

2. Flash Memory & LLM Inference and 2.1 Bandwidth and Energy Constraints

2.2 Read Throughput

3 Load From Flash

3.1 Reducing Data Transfer

3.2 Improving Transfer Throughput with Increased Chunk Sizes

3.3 Optimized Data Management in DRAM

4 Results

4.1 Results for OPT 6.7B Model

4.2 Results for Falcon 7B Model

3 Load From Flash

This section addresses the challenge of conducting inference on devices where the available DRAM is substantially smaller than the size of the model. This necessitates storing the full model weights in flash memory. Our primary metric for evaluating various flash loading strategies is latency, dissected into three distinct components: the I/O cost of loading from flash, the overhead of managing memory with newly loaded data, and the compute cost for inference operations.

Our proposed solutions for reducing latency under memory constraints are categorized into three strategic areas, each targeting a specific aspect of the latency:

• Reducing Data Load: Aiming to decrease latency associated with flash I/O operations by loading less data[1].

• Optimizing Data Chunk Size: Enhancing flash throughput by increasing the size of data chunks loaded, thereby mitigating latency.

• Efficient Management of Loaded Data: Streamlining the management of data once it is loaded into memory to minimize overhead.

It is important to note that our focus is not on the compute aspect of the process, as it is orthogonal to the core concerns of our work. This delineation allows us to concentrate on optimizing flash memory interactions and memory management to achieve efficient inference on memory-constrained devices.

Finally, we will elaborate on the implementation of these strategies in subsequent sections.

This paper is available on arxiv under CC BY-SA 4.0 DEED license.

[1] It is notable that, by data we mean weights of the neural network. However, our developed techniques can be easily generalized to other data types transferred and used for LLM inference, such as activations or KV cache, as suggested by Sheng et al. (2023).

Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash

Too Long; Didn't Read

Table of Links

3 Load From Flash

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash

Too Long; Didn't Read

Table of Links

3 Load From Flash

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics