Authors:
(1) Wanyun Cui, Shanghai University of Finance and Economics, with equal contribution;
(2) Qianle Wang, Shanghai University of Finance and Economics, with equal contribution.
3 Quantifying the Impact of Parameters on Model Performance & 4. Unified Mixed-Precision Training
5 Prevalence of Parameter Heterogeneity in LLMs
6 Quantization Experiments and 6.1 Implementation Details
6.2 Effect of Base LLM Quantization
6.3 Effect of Chat LLM Quantization
6.4 Comparison of Parameter Selection Criteria, Conclusion, & References
In the experimental section, we demonstrate the effectiveness of the CherryQ for both base LLMs and chat LLMs. We also compare different cherry parameter selection criteria to highlight the impact-based heterogeneity.
Parameter Representation: Based on the observation that cherry parameters occupy a very small proportion, for each row of parameters in each parameter matrix, we only consider the top 1/256 parameters with the highest impact as cherry parameters and retain their FP16 precisions. For example, the parameter matrix size of LLaMA2-7B is 4096 × 4096. So we select 16 parameters with the highest impact for each row, which results in 4096 × 16 parameters as cherry parameters. Additionally, to recover the complete parameter matrix, an INT16 is required to record the column index for each cherry parameter. So each cherry parameter need 32 bits.
Quantization Datasets: For the quantization of the base LLMs, we follow [8] to use C4 [20] as the training data. We select the first 4 partitions of C4 and choose data with a length of ≥ 2048 tokens, resulting in a total of 50k samples of 2048 token. For the chat LLMs, since Vicuna-1.5 [5] is obtained by supervised fine-tuning based on ShareGPT [5], we also use the ShareGPT dataset for training. We utilize a total of 20k training samples from ShareGPT for QAT and Cherry.
Baselines We compare our method with various quantization methods, including QAT [18], GPTQ [8], SqueezeLLM [13], OminiQuant [21], and AWQ [17]. For OminiQuant and AWQ, we use their results reported in [21]. For SqueezeLLM, we use the results in its original paper [13]. For GPTQ, its 4-bit model is obtained from the open-source [1]. Due to the lack of a 3-bit GPTQ model, we quantize the model ourselves via the implementation of Auto-GPTQ [2]. Since CherryQ is based on QAT, for fair comparisons, the implementation of QAT is the same as CherryQ, except that it does not handle cherry parameters.
This paper is available on arxiv under CC BY 4.0 DEED license.
[1] https://huggingface.co/TheBloke
[2] https://github.com/AutoGPTQ/AutoGPTQ