The battle of the small language models (SLMs) just heated up with the release of the Gemma 3 models by the Gemma team at Google Deepmind on March 12. Half a year ago, Meta released the Llama3.2 set of models with 1B and 3B models which were state of the art at those scales. The newly released Gemma 3 models are in the 1B to 27B range and are specially trained for improved mathematics, reasoning, and chat abilities. These models are also targeted to run on standard consumer-grade hardware such as phones, laptops, and low-end GPUs.
Embedl engineers rolled up their sleeves to compare how they match up.
Architecture
First, let’s look at the architecture. Note that when counting parameters Llama 3.2 only counts the model body and Gemma 3 includes the whole head. So Llama 3.2 is a 1.26B model including head.
Gemma 3 favors a thin and deep model, while Llama 3.2 favors wide and shallow. This has an impact on inference acceleration, see below where LLama 3.2 despite having more parameters is more parallelizable.
Post-training innovations
Pre-trained models Gemma models are turned into instruction tuned models with an improved post-training approach that uses an improved version of knowledge distillation from a large IT teacher, along with a RL fine tuning phase based on improved versions of best-in-n and averaged distillation methods. They also use a variety of reward functions for RL to improve helpfulness, math, coding, reasoning, instruction following, and multilingual abilities. This includes learning from weight averaged reward models trained with human feedback data, code execution feedback, and ground-truth rewards for solving math problems (with some techniques borrowed from DeepSeek). Data filtering is done to minimize toxic behaviour or hallucinations.
Performance Comparison: Llama 3.2 1B vs. Gemma 3 1B
We compared the two models on standard evaluation benchmarks for output quality and on-device performance.
Official sources
Despite several benchmarks being released for Gemma 3, there is little overlap between the benchmarks released in the official sources. We were able to only identify two benchmarks which evaluate the two models under equal settings, see Table 1. No official numbers of generation speed are comparable, see section on Embedl-generated benchmarks for on-device benchmarks.
Model |
Size |
MMLU (5-shot) |
GSM8K (8-shot, CoT) |
Llama 3.2 1B |
1.26B |
49.3% |
44.4% |
Gemma 3 1B |
1B |
38.8% |
62.8% |
Table 1: Comparison of Llama 3.2 and Gemma 3 in model size and performance. Knowledge and reasoning (MMLU) and Mathematical problem solving (GSM8K) performance. COT stands for Chain-Of-Thought prompting.
Table 1 suggests
- Llama 3.2 1B to have an edge in broad knowledge/reasoning tasks while
- Gemma 3 1B exhibits a stronger math reasoning capability, outperforming Llama 3.2 1B on GSM8K. This highlights Gemma’s strength in math (likely due to targeted post-training for reasoning).
Embedl Generated Benchmarks
Since there are few published sources for direct comparisons, we compared the two models fairly under the following conditions:
- Batch size is limited to one, as is the case in many edge applications
- The inference environment is pytorch and the Hugging Face transformers library, with efficient CUDA kernels, KV cache etc.
- The throughput is measured based on the number of tokens decoded per second (tokens per second)
1 Source: huggingface.co model card (Table: Instruction Tuned Models) . 2 Source: storage.googleapis.com technical report (Table 18) 3 Source: huggingface.co model card (Table: Instruction Tuned Models)
When deploying a model to edge devices, it is critical to leverage quantization. We note that both models perform similarly in float32 precision while this first release of Gemma 3 does not yet fully leverage the bfloat16 support available in the transformers library. This is likely to change in the future as support is expanded. For reference, we also include a version of Llama 3.2 optimized with the Embedl SDK, which includes 4-bit quantization as well as additional optimization techniques proprietary to the SDK, pushing the performance way beyond the other models.
As mentioned, these numbers are based on an environment using Hugging Face transformers. For a more comprehensive overview of the on-device LLM landscape, we recommend going to the Embedl Hub (https://hub.embedl.com/), where we offer quality benchmarks for a wide variety of edge devices and inference toolchains.
Conclusion
In summary, Llama 3.2 1B shows stronger performance on broad knowledge benchmarks, but lags behind on math and coding tasks. Meanwhile, Gemma 3 1B demonstrates strengths in math and coding, outperforming Llama 3.2 1B GSM8K and HumanEval. These results reflect each model’s training focus: Gemma 3 1B’s post-training enhancements give it an edge in reasoning and code generation, whereas Llama 3.2 1B retains higher general knowledge accuracy at the cost of those specialized skills. In terms of on-device performance, both models perform similarly in float32 precision while this first release of Gemma 3 does not yet fully leverage bfloat16. The version of Llama 3.2 optimized with the Embedl SDK is the fastest model, and we are looking forward to exploring Gemma 3 more in detail in the coming weeks!