Rethinking AI Quantization: The Missing Piece in Model Efficiency

Written by disproportionate | Published 2025/03/06
Tech Story Tags: llm-quantization | parameter-heterogeneity | ai-model-optimization | mixed-precision-training | cherryq-algorithm | llm-performance | ai-efficiency | low-bit-quantization

TLDRQuantization helps reduce LLM memory demands, but existing methods overlook parameter heterogeneity. Learn how new approaches like CherryQ address this issue for better efficiency.via the TL;DR App

Authors:

(1) Wanyun Cui, Shanghai University of Finance and Economics, with equal contribution;

(2) Qianle Wang, Shanghai University of Finance and Economics, with equal contribution.

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Quantifying the Impact of Parameters on Model Performance & 4. Unified Mixed-Precision Training

5 Prevalence of Parameter Heterogeneity in LLMs

6 Quantization Experiments and 6.1 Implementation Details

6.2 Effect of Base LLM Quantization

6.3 Effect of Chat LLM Quantization

6.4 Comparison of Parameter Selection Criteria, Conclusion, & References

2. Related Work

Quantization Strategies for LLMs Various quantization strategies have been proposed in the literature to reduce the precision of weights and activations while maintaining acceptable accuracy. These strategies can be broadly categorized into post-training quantization and quantization-aware training [14]. Post-training quantization methods, such as OBD, OBS, and GPTQ, directly quantize the pre-trained model without fine-tuning [15, 10, 8]. On the other hand, quantization-aware training methods, such as LLM-QAT [18], incorporate quantization operations into the training process to jointly optimize the quantized model. Some works also explore mixed-precision quantization [13] and adaptive quantization bins [7] to achieve a better trade-off between accuracy and efficiency.

Outliers in Language Model Quantization The idea of modeling parameter outliers in LLM quantization is not new. The exploration of outliers primarily includes the perspectives of magnitude [18, 7] and activations [4, 6]. For example, from the magnitude perspective, QLoRA assumes that parameters follow a Gaussian distribution [7] and designs information-theoretically optimal quantized bins based on this assumption. [18] keeps outlier parameters in 16-bit precision. From the activation perspective, [17] migrates the outlier amplifier to subsequent modules through an equivalent transformation. Additionally, SqueezeLLM also measures outliers from the perspective of parameter impact [13]. To the best of our knowledge, our work is the first to systematically reveal the outliers (heterogeneity) of parameter impact across different models, and we show a more pronounced imbalance in parameter impacts compared to magnitudes (ยง 6.4). Furthermore, we propose a method to unify outlier (cherry) parameter optimization and normal parameter optimization, addressing the optimization challenges of heterogeneous parameters.

This paper is available on arxiv under CC BY 4.0 DEED license.


Written by disproportionate | Disproportionate: a state of discord, where life's rhythm is disrupted and harmony is lost.
Published by HackerNoon on 2025/03/06