Large Language Models (LLMs) are all the rage these days! They have revolutionized areas like natural language processing, speech recognition and lately computer vision and Generative AI such as SORA, enabling machines to understand and generate outputs – text, images and video - with unprecedented accuracy and fluency. However, one of the most critical challenges in deploying LLMs is their expensive memory requirements, for both training and inference. To train these LLMs requires humongous computing and energy resources costing hundreds of thousands of dollars. This puts them out of reach of everyone except large corporations that can afford these resources.
Some amazing recent developments in compression methods have now changed the situation dramatically. One can now create small versions of LLMs that can be deployed on cheap hardware and achieve performance which is not too much worse than the large LLMs.
One of the ingredients are a number of recent methods to quantize models. Previous methods for quantization have needed to use calibration data. This can be problematic because of access to data, bias and poor quality and because it takes very long – from several hours to days. A new method called HQQ is able to do quantization without the needs for calibration data and running orders of magnitude faster – even 100 times faster, so huge networks can be quantized in minutes. The magic behind HQQ is to bring some classic techniques from convex optimization. They approach the problem by formulating an optimization problem involving a sparsity inducing norm. Next, they’re able to decompose the optimization problem into two sub-problems, which are solved in alternating fashion. The nice thing is that both sub-problem have direct closed form solutions which are classics in the convex optimization literature. Thus the algorithm does not need to do any gradient iterations, but simply applies the closed form solutions alternatingly. The whole procedure converges in very few steps, explaining the huge gain speed.
While such post-training quantization can work with INT 8 precision, going to extreme precision can really degrade performance. One would now like to do some fine tuning to regain performance. However, fine tuning the huge model is prohibitively expensive. Moreover, gradient descent used for model training will observe zero gradients nearly everywhere with low precision, so it can’t make meaningful updates to the quantized weights. Here a technique called QLoRA has been very effective. LoRA “Low-Rank Adaptation of Large Language Models”. doesn’t train the whole large language model at all, but instead adds “adaptors”, which are very small matrices (generally smaller than 1% of the full model) that are trained, while keeping the rest of the model constant. QLoRA is LoRA used with a quantized model.
The results are quite dramatic! The authors of HQQ and their collaborators report that directly applying 1-bit quantization to small models like the Llama2-7B yields suboptimal results. However, when the model is fine-tuned, its output quality improves substantially. Remarkably, the fine-tuned 1-bit base model surpasses the performance of Quip# 2-bit , despite being trained on only ~2.8K samples with a context-window of 1024.
2-bit: when given more specialized data, a 2-bit model can perform very well. In fact, the base Llama2-7B 2-bit model with a version of HQQ and QLoRA outperforms the full-precision model on wikitext. The chat model outperforms its full-precision version on GSM8K when given enough math and reasoning data.
QLoRA is part of a larger toolkit of methods called “Parameter Efficient Fine Tuning” (PEFT) which train only a very small part of the full network (often less than 1%). With PEFT methods, it becomes possible to do fine tuning of LLMs on smaller compute resources and in more reasonable time frames – minutes instead of hours or days – and recover performance almost to the level of the original models.
These are very exciting days when a combination of such techniques promises to deliver LLMs that can be used on consumer hardware, and greatly expand the range of innovations they can enable. Embedl is excited to be part of enabling this transformation!