Mixture-of-Experts (MoE) has emerged as one of the most promising approaches for scaling up large generative AI models. By selectively activating only a subset of parameters at each step, MoE offers a clever way to increase model capacity without a proportional rise in computational cost.1 This is the architectural shift that distinguishes the recently released – and remarkably powerful – Llama 4 apart from its predecessor, Llama 3.2
Despite these advances, a significant gap remains between state-of-the-art generative AI and the ability to deploy such models directly on constrained edge devices.1 While the smallest version of Llama 4 can reportedly run on a single H100 GPU2 (with aggressive quantization), that is still far from being suitable for running locally on edge devices due to the architectural challenges involved.
With models like Llama 4 now available publicly, the question naturally follows: how can such highly capable MoE models be used on edge devices?
On April 5, Meta introduced Llama 4, its most advanced suite of models in the Llama ecosystem – positioning it as a leading force in the field of generative AI. The release introduces the first open-weight natively multimodal models Llama 4 Maverick and Llama 4 Scout, which stands out for its industry-leading context length of 10 million tokens. The launch also features a preview of Llama 4 Behemoth, which is expected to be Meta’s most powerful model to date and one of the most intelligent LLMs in the world.2
Llama 4 was pre-trained on a diverse dataset of over 30 trillion tokens, more than twice the volume used for Llama 3.2. This included data from 200 languages, with over 100 languages each exceeding 1 billion tokens, significantly improving the model’s multilingual capabilities.2 Perhaps most notably, Llama 4 is Meta’s first open-weight release to use a MoE architecture. This marks a major architectural shift from Llama 3, enabling higher compute efficiency during training and inference and delivering higher output.2
A typical deep learning model works like this: A single model receives an input and produces an output. This makes deployment on edge relatively straightforward – as long as the single model can be compressed to run on the available edge device efficiently, it can be deployed.
MoE-models work differently. It follows a divide-and-conquer principle, where only specialized sub-models, known as experts, handle different types of inputs.1 Each input is first sent to a gating mechanism, also called a “router”, which acts as a coordinator. The router analyzes and divides the input into subtasks and activates only a relevant subset of experts, depending on the input. The selected experts process the subtasks, and their outputs are weighted and combined based on relevance.1
This means that not all the knowledge of the model is used at all times. Unlike traditional LLMs, where every token activates all of the model’s parameters, MoE models only activate a subset of parameters that are needed for a given token.1 Llama 4 Maverick, for example, has 400 billion total parameters, but only 17 billion are active during inference.2 This allows access to the intelligence of a much larger model at a fraction of the computational cost. Since only the relevant parameters of the model are activated for each input, both the computation cost and latency are significantly reduced.1 As a consequence MoE models can scale in size and capacity without inference cost increasing proportionally. Furthermore, overall precision and adaptability are improved as experts are trained for specific tasks.1
MoE models are powerful and efficient when deployed in data centers or cloud environments, but porting and deploying this architecture to edge devices is significantly more challenging. One of the biggest challenges is that even though not all parameters in an MoE model are active at the same time, the full set of expert parameters still needs to be stored and accessible during inference.4 This places heavy demands on both storage and memory footprint. Most edge devices simply lack sufficient RAM and storage to handle that.4 For instance, the model Mixtral-8x7B developed by the French AI company Mistral AI, uses only 12,9B active parameters per token 3, but the full model occupies 87 GB of memory.4 As a comparison, the smallest version of Llama 4 can run on a single H100 GPU with 80 GB memory using int4 quantization.1 NVIDIA’s Jetson AGX Orin, a common edge platform, has 64 GB of memory – meaning even the smallest Llama 4 version would require further compression to fit.4
The limited memory of edge devices presents another major challenge: loading and running MoE architectures becomes both complex and increases latency.4 Since there isn’t enough memory to keep all experts loaded simultaneously, the system must dynamically load and unload the experts during inference.4 When all the model parameters representing the experts are not loaded in memory, they must be fetched dynamically, which incurs additional time in data transfer even if the computation is fast. Due to the limited memory bandwidth on edge hardware, this loading step introduces significant latency in the inference pipeline 4, making the interaction slow and clunky. In the case of Mixtral-8x7B, expert loading dominates the total inference time: approximately 85.5% on an RTX 4090 and as much as 94.5% on Jetson Orin, while the actual computation accounts for only a small fraction.4
To mitigate the latency without sacrificing accuracy, one could try prefetching likely experts in advance – but this strategy is unreliable, as it is difficult to accurately predict which experts will be needed, and the latency introduced by loading the wrong expert is higher than not prefetching at all.4 As a result, latency remains a major bottleneck for edge devices.4
Quantization – the method of representing numerical values using fewer bits – is an effective way to reduce the size of model parameters and thereby lower memory usage. The risk with applying quantization uniformly across a model is accuracy degradation, but MoE models allow for a better opportunity for quantization.4 Since not all experts in an MoE model contribute equally, it’s possible to only quantize the less important ones, leading to a minimal impact on accuracy. In one study, quantizing fewer than 20% of the experts led to less than a 1% drop in accuracy.4 Therefore, selectively applying quantization to experts with lower importance in offloading scenarios can offer a significant reduction in loading costs and memory usage – without compromising overall model quality.4 At Embedl we leverage such mixed precision quantization for performance improvements across different edge hardware and toolchains.
Another alternative solution is pruning, where the less important weights in the model are removed to reduce its size. There is a specific pruning method designed for MoE architectures; Each expert in a MoE model consists of a layer of neural network weights, and these weights are evaluated based on three factors: how much of the weight contributes to the output (weight magnitude), how strongly the weight is activated for a given input (input activation strength), and how often that particular expert is selected by the model (router weights).5 Each weight is assigned a score based on these criteria, which determines whether it can be safely removed. This particular MoE pruning process is one-shot and requires no retraining.5 At Embedl, we explore such structured pruning methods to optimize performance for constrained edge environments while preserving model accuracy.
These optimization methods can be complemented by a third solution: expert-wise knowledge distillation. To recover any minor performance loss from pruning- or quantization-optimization, the model is trained on the output of a pretrained, large and powerful MoE model – typically one that runs in the cloud rather than on the edge.5 The larger model acts as a teacher to guide the compressed model, learning from the teacher’s “expertise” while being light enough to run on edge devices.5 A study showcased that pruning combined with this approach enabled up to 50% sparsity while preserving up to 99% of the original model’s accuracy.5 Llama 4 Behemoth is an example of a model that serves as a teacher model for Llama 4 Maverick and Scout, making it possible for those smaller models to operate with far fewer parameters.1 Meta has also used a combination of pruning and knowledge distillation to release mobile-friendly Llama 3.2 models.6 At Embedl, we leveraged these techniques to further compress Llama 3.2 3B Instruct models for faster deployment on Qualcomm devices.
To conclude, model compression techniques such as quantization or pruning, combined with knowledge distillation can compress large models such as Llama 4 while maintaining accuracy. If we are able to compress the smallest Llama 4 model by another 20%, we could potentially fit it in a Jetson AGX Orin edge device. We have been able to successfully compress and deploy Llama 3.2 3B Instruct models in the past further compressing the models far above 20%. Therefore we are excited about the possibilities of compressing the Llama 4 model for the edge with the Embedl Model Optimization SDK.
References
1: https://ieeexplore.ieee.org/document/10707053
2: https://ai.meta.com/blog/llama-4-multimodal-intelligence/
3: https://mistral.ai/news/mixtral-of-experts
4: https://arxiv.org/pdf/2411.01433
5: https://arxiv.org/html/2410.12013v1
6: https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/