The wonderous ChatGPT continues to amaze – the new multimodal version could figure out that a picture was taken in Gothenburg city by reasoning about the text in the image and the presence of a tram. It can hold conversations in multiple languages, easily switching between them, pass various exams like the SATs performing better than most human test takers and explain a joke about Einstein’s theory of relativity from a cartoon. More seriously, there are plenty of reports about how ChatGPT is increasing productivity in writing, summarizing or coding with co-pilots.
However, all these applications still involve the use of a system hosted in one place by a BigTech company with the resources to do so. The benefits of the technology could be much more widespread if the underlying LLM technology could be deployed on small devices within applications running in homes and offices. People want to directly interact with an AI application on a mobile device, in their vehicles, or in a doctor’s office. The latency would be smaller giving the user a better experience, it could be customized to their needs and it would keep their personal information secure and private. Another major benefit would be that it would not guzzle up resources on a planet threatening scale! As Mazzucato writes: “Google’s global datacentre and Meta’s ambitious plans for a new AI Research SuperCluster (RSC) further underscore the industry’s energy-intensive nature, raising concerns that these facilities could significantly increase energy consumption.”
To increase access to the power of LLMs, BigTech has started to offer products on consumer devices: Microsoft has introduced Office 365 Co-pilot, which uses AI hardware in both the cloud and locally, where possible, to help users across the Windows OS. Google has launched the Gecko version of the Palm 2 model. It is so lightweight that it can work on mobile devices and is fast enough for interactive applications on-device, even offline. Meta has released the LLAMA generative AI model, which has a version consisting of only 7B parameters intended for edge devices.
Even the microcomputer company Raspberry Pi plans to sell an AI chip. It’s integrated with Raspberry Pi’s camera software and can run AI-based applications like chatbots natively on the tiny computer.
Raspberry Pi partnered with chipmaker Hailo for its AI Kit, which is an add-on for its Raspberry Pi 5 microcomputer that will run Hailo’s Hailo-8L M.2 accelerator. The kits will be available “soon from the worldwide network of Raspberry Pi-approved resellers” for $70. Hailo CEO and co-founder Orr says that its accelerator’s “power consumption is below 2W and is passively cooled.” The accelerator offers 13 tera operations per second (TOPS), which is lower than chips planned for AI laptops like Intel’s 40 TOPS Lunar Lake processors.
There are, however, some challenges that must be addressed to bring the power of LLMs to edge devices. First and foremost is the edge devices' computational and memory constraints. This is the biggest hurdle to running large AI apps on the edge: such devices have significantly fewer computational power and memory than cloud servers. The largest of the Llama models released by Meta contain 70B parameters which would not be possible to run directly on a commercially available gaming GPU (with 24 GB memory) by NVIDIA. This means that AI models need to be optimized for smaller devices.
Heterogeneity is another big issue: edge devices come in various shapes and sizes, with different capabilities and limitations. Along with the AI revolution, there has also been a hardware revolution with amazing innovation in creating new chips, many specially designed for AI algorithms. This is both a great opportunity to explore the price-performance tradeoff for different applications, but also makes it difficult for application developers to deliver AI solutions that can run across many devices. A robust AI solution that supports AI models to run efficiently across a wide range of devices is key. Tools to help develop such solutions for the edge are indispensable if the masses need to harness the power of LLMs.
Embedl develops tools and algorithms to compress deep neural networks. These tools and techniques are directly applicable to the compression of LLMs, so naturally, we explored optimization for LLMs using our tools. However, optimizing such models comes with a few caveats.
LLMs take a long time to train and any compression technique that drastically changes the model and requires a full retraining, e.g., Neural Architecture Search (NAS), would be infeasible. Therefore post-training compression techniques, e.g., quantization come out to be the best low-hanging fruit when it comes to optimizing LLMs for the edge. There is very little drop in performance (as measured by the so-called HellaSwag benchmark score) when quantizing a Llama model from BF16 to INT8 or even INT4. However, the quantized version of the model is small enough to fit in the memory of the latest Raspberry Pi 5 and run it on the CPU (although inference is quite slow). It also runs much faster on a GPU but only if efficient implementations exist on the particular hardware for the quantized operations.
It is still possible to compress a larger model, prune parts of it, and recover its performance with fine-tuning. We observed that a 13B parameter model could be compressed and quantized to get a ~3x speedup in inference with a small drop in performance. Therefore smaller versions of LLMs are possible and feasible. This is also reflected in the recent push by OpenAI, NVIDIA and Hugging Face towards the so-called Small Language Models (SLMs) with their release of GPT-4o-mini, Mistral Nemo and SmoLM respectively. We however observed that simple compression of larger models such as Llama 13B were consistently outperformed by their scaled variants, e.g., Llama 7B when both were fine tuned and quantized. Such comparisons were missing from several of the research articles that we explored on the topic of LLM compression.
Nevertheless, our exploration with LLMs reveals some exciting opportunities for the future - in principle, compressing LLMs and adapting them for the edge is feasible and can be made simple with the right tools. This can significantly reduce the cost and computational requirements to run them on edge devices. In this regard, tools that make it simple to compress such models, make changes to them, and make it possible for developers to implement latest ideas from the research community could be the key to proliferate the use of LLMs on edge devices and accelerate the development of SLMs to tackle specific problems.
Many enterprises are rushing to embrace LLMs, but most get a shock when they realize the costs involved, which can be mind boggling with the energy and other resource demands. The LLM explosion may not reap benefits for society at large unless these costs can be constrained and reduced. Optimizing LLMs for edge devices holds the key to bring the power of LLMs and revolutionize a wide range of applications.