paint-brush
Can ChatGPT-Style Models Survive Quantization?by@disproportionate

Can ChatGPT-Style Models Survive Quantization?

tldt arrow

Too Long; Didn't Read

Applying quantization to chat-based LLMs comes with challenges. See how different techniques impact conversational AI and what methods preserve the best response quality.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - Can ChatGPT-Style Models Survive Quantization?
Disproportionate Techstack  HackerNoon profile picture
0-item

Authors:

(1) Wanyun Cui, Shanghai University of Finance and Economics, with equal contribution;

(2) Qianle Wang, Shanghai University of Finance and Economics, with equal contribution.

Abstract and 1 Introduction

2 Related Work

3 Quantifying the Impact of Parameters on Model Performance & 4. Unified Mixed-Precision Training

5 Prevalence of Parameter Heterogeneity in LLMs

6 Quantization Experiments and 6.1 Implementation Details

6.2 Effect of Base LLM Quantization

6.3 Effect of Chat LLM Quantization

6.4 Comparison of Parameter Selection Criteria, Conclusion, & References

6.3 Effect of Chat LLM Quantization

We conduct experiments on Vicuna-1.5 [5]. We apply 3-bit quantization with group size=128 for CherryQ and other baselines.


Evaluation To assess the performance of quantized open-ended chat models, we employ a pairwise comparison on the Vicuna-bench [26], which consists of 80 test samples. We compare the responses generated by the quantized models against those generated by the original 16-bit Vicuna-1.5. The evaluation is performed using GPT-4, which automatically classifies the quantized model’s response as “win”, “tie”, or “lose” relative to the FP16 model’s response. To get rid of the ordering effect of the evaluation, we follow [17] to compare the responses with both orders, leading to 160 trials.


Figure 3 presents the results of the pairwise comparison for each quantized model against its FP16 counterpart. The results demonstrate that CherryQ consistently outperforms other quantization baselines in preserving the performance of chat models. It achieves the highest number of wins and ties against the FP16 models, while minimizing the number of losses.


Table 3: Performance of different 3-bit quantization methods on Huggingface OpenLLM for LLaMA2- 7B and LLaMA2-13B.


Figure 3: Comparison of 3-bit quantized models to FP16 Vicuna-1.5. (Left) Comparisons to Vicuna1.5-7B. (Right) Comparisons to Vicuna-1.5-13B. CherryQ even shows competitive quality compared to the 16-bit counterpart.


Notably, 3-bit CherryQ achieves a slightly better win-tie-lose ratio over the FP16 Vicuna model, indicating that the 3-bit quantized model performs on par with or even better than the FP16 model. As intuitively CherryQ cannot surpass the target 16 bit model, we think the result suggests that CherryQ maintains almost all its performance even at 3 bit, making GPT-4 hard to distinguish the quality of low-bit and FP16 models.


This paper is available on arxiv under CC BY 4.0 DEED license.