Can ChatGPT-Style Models Survive Quantization?

Authors:

(1) Wanyun Cui, Shanghai University of Finance and Economics, with equal contribution;

(2) Qianle Wang, Shanghai University of Finance and Economics, with equal contribution.

Table of Links

Abstract and 1 Introduction

5 Prevalence of Parameter Heterogeneity in LLMs

6 Quantization Experiments and 6.1 Implementation Details

6.2 Effect of Base LLM Quantization

6.3 Effect of Chat LLM Quantization

6.4 Comparison of Parameter Selection Criteria, Conclusion, & References

6.3 Effect of Chat LLM Quantization

We conduct experiments on Vicuna-1.5 [5]. We apply 3-bit quantization with group size=128 for CherryQ and other baselines.

Evaluation To assess the performance of quantized open-ended chat models, we employ a pairwise comparison on the Vicuna-bench [26], which consists of 80 test samples. We compare the responses generated by the quantized models against those generated by the original 16-bit Vicuna-1.5. The evaluation is performed using GPT-4, which automatically classifies the quantized model’s response as “win”, “tie”, or “lose” relative to the FP16 model’s response. To get rid of the ordering effect of the evaluation, we follow [17] to compare the responses with both orders, leading to 160 trials.

Figure 3 presents the results of the pairwise comparison for each quantized model against its FP16 counterpart. The results demonstrate that CherryQ consistently outperforms other quantization baselines in preserving the performance of chat models. It achieves the highest number of wins and ties against the FP16 models, while minimizing the number of losses.

Notably, 3-bit CherryQ achieves a slightly better win-tie-lose ratio over the FP16 Vicuna model, indicating that the 3-bit quantized model performs on par with or even better than the FP16 model. As intuitively CherryQ cannot surpass the target 16 bit model, we think the result suggests that CherryQ maintains almost all its performance even at 3 bit, making GPT-4 hard to distinguish the quality of low-bit and FP16 models.

This paper is available on arxiv under CC BY 4.0 DEED license.

Can ChatGPT-Style Models Survive Quantization?

Too Long; Didn't Read

Companies Mentioned

Coin Mentioned

Table of Links

6.3 Effect of Chat LLM Quantization

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

Can ChatGPT-Style Models Survive Quantization?

Too Long; Didn't Read

Companies Mentioned

Coin Mentioned

Table of Links

6.3 Effect of Chat LLM Quantization

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics