Professor Devdatt Dubhashi, Chief Scientific Officer and co-founder is Professor in the Data Science and AI Division of the Department of Computer Science and Engineering at Chalmers University of Technology and co-founder of Embedl. He received his Ph.D. in Computer Science from Cornell University USA and was a postdoctoral fellow at the Max Planck Institute for Computer Science in Saarbrueken Germany. He was with BRICS (Basic Research in Computer Science, a center of the Danish National Science Foundation) at the University of Aarhus and then on the faculty of the Indian Institute of Technology (IIT) Delhi before joining Chalmers in 2000. He has led several national projects in machine learning and has been associated with several EU projects. He has been an external expert for the OECD report on “Data-Driven Innovation”. He has published regularly in the premier machine learning and AI venues such as NIPS, ICML, and AAAI.
Large Language Models (LLMs) are all the rage these days! They have revolutionized areas like natural language processing, speech recognition and lately computer vision and Generative AI such as SORA, enabling machines to understand and generate outputs – text, images and video - with unprecedented accuracy and fluency. However, one of the most critical challenges in deploying LLMs is their expensive memory requirements, for both training and inference. To train these LLMs requires humongous computing and energy resources costing hundreds of thousands of dollars. This puts them out of reach of everyone except large corporations that can afford these resources.
Some amazing recent developments in compression methods have now changed the situation dramatically. One can now create small versions of LLMs that can be deployed on cheap hardware and achieve performance which is not too much worse than the large LLMs.
The results are quite dramatic! The authors of HQQ and their collaborators report that directly applying 1-bit quantization to small models like the Llama2-7B yields suboptimal results. However, when the model is fine-tuned, its output quality improves substantially. Remarkably, the fine-tuned 1-bit base model surpasses the performance of Quip# 2-bit , despite being trained on only ~2.8K samples with a context-window of 1024.
2-bit: when given more specialized data, a 2-bit model can perform very well. In fact, the base Llama2-7B 2-bit model with a version of HQQ and QLoRA outperforms the full-precision model on wikitext. The chat model outperforms its full-precision version on GSM8K when given enough math and reasoning data.
QLoRA is part of a larger toolkit of methods called “Parameter Efficient Fine Tuning” (PEFT) which train only a very small part of the full network (often less than 1%). With PEFT methods, it becomes possible to do fine tuning of LLMs on smaller compute resources and in more reasonable time frames – minutes instead of hours or days – and recover performance almost to the level of the original models.
These are very exciting days when a combination of such techniques promises to deliver LLMs that can be used on consumer hardware, and greatly expand the range of innovations they can enable. Embedl is excited to be part of enabling this transformation!