Machine Learning

2-Bit Vptq: 6.5x small llms while keeping the accuracy of 95%

2-bit runting value 70b llms on 24 GB GPU

Looking at the data science
Produced with chatgpt

Recent improvements for small amount of llms, such as AQLM and Autoround, now reflects acceptable levels of good jobs, especially large models. That means, 2-bit rate introduces losses of visible accuracy in many situations.

One promising algorithm of the small amount of small amount is VPTQ (MIT License), proposed by Microsoft. In October 2024 and since the best performance and efficiency of the highest models.

In this article, we will:

  1. Review the VPTQ NeanT's algorithm.
  2. Demonstrate how to use VPTQ models, many are already available. For example, we can easily find a small variety of Llama 3.3 70b, Llama 3.1 405b, and QWEN2.5 72b.
  3. Measure these models and discuss results to understand when VPTQ models can be a good choice of llms in production.

Amazingly, 2-bit rate with VPTQ is likely to achieve in comparison to the first 16-bit model in the activities such as MMLU. In addition, it enables to run llama 3.1 405B in one GPU, while using a small memory rather than a 70b model!

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button