Since the release of ChatGPT in late 2022, OpenAI quickly established itself as the front-runner in the race for the most intelligent AI large language model (LLM). Early versions of Google's Gemini lagged behind in both reasoning and performance, but with the release of Gemini 2.5 Pro on the 25th of Mars, the tables have turned.
After years of research into reinforcement learning, chain-of-thought prompting, and refined post-training, Google launched Gemini 2.5 Pro. The model is designed to reason before responding and delivers stronger performance and higher accuracy on more complex tasks.1 It builds on previous Gemini strengths, such as native multimodality and an extended context window supporting up to 1 million tokens – equivalent to 30,000 lines of code or a 700,000-word book.2 Gemini 2.5 Pro marks the first time an LLM becomes usable for long-context writing.3 With this release, Google has beaten OpenAI in several major industry AI benchmarks, especially in coding, mathematics, and science. The figures below illustrate benchmark comparisons between Gemini 2.5 Pro and OpenAI's comparable multimodal models ChatGPT-4o and o1. OpenAI advertises ChatGPT-4o as a high-intelligence model for complex tasks4 and o1 as a model that "thinks before it answers," optimized for programming and science.5
LLM Arena Overall Score
Figure 1: A benchmark that rates LLMs through blind user tests to assess their cumulative performance across various tasks, including instruction following, mathematics, creative writing, coding, and language comprehension, based on human judgment. Source: https://lmarena.ai/?leaderboard
Context Length
Figure 2: Comparison of the number of tokens accepted in the context window by each model. Source: https://aistudio.google.com/prompts/new_chat, https://openai.com/api/pricing/
Livebench.AI Index Categories
Figure 3: Benchmarks designed to evaluate AI models while addressing test set contamination and ensuring objective assessments through regularly updated questions from recent information sources. Source: https://livebench.ai/#/
Humanity Last Exam
Figure 4: Multimodal benchmarks designed to evaluate AI models at the frontier of human knowledge. Accuracy measures the proportion of correct responses; higher numbers are desired. Calibration error evaluates the alignment between the model's predicted confidence levels and the actual correctness of its response; lower, lower numbers are desired. Source: https://agi.safe.ai/
As reflected in the benchmark results, Google has not only caught up in the race against OpenAI – it is arguably leading by a considerable margin. One likely explanation is that training better LLMs requires larger quantities of computes and access to the highest-quality data. In this regard, no other tech company comes close to Google thanks to its access to data through the comprehensive web content collected by Google Search. Google has also internally developed its own custom hardware specifically optimized for both training and inference of large-scale AI models; Tensor Processing Units (TPUs)6, whereas OpenAI relies heavily on NVIDIA’s increasingly scarce supply of Graphics Processing Units (GPUs).7
While benchmark scores reflect model performance and capabilities, another often overlooked dimension in the race between top LLMs is what these models cost in terms of energy and carbon emissions. One of the key factors determining how energy-intensive an LLM is lies in the hardware.8 More powerful and efficient chips can perform the same number of floating point operations per second (FLOP) using significantly less energy.9 In this regard, Google also has a clear advantage: its TPU v4 consumes approximately 3,5 times less energy than NVIDIA’s H100 GPU 10,11 used by OpenAI in the development and deployment of advanced generative AI models.12
Even as chips become more energy-efficient, the total energy consumption of LLMs continues to rise exponentially – mainly due to the inference phase as it runs constantly at a massive scale.9 When ChatGPT was released in 2022, it reportedly consumed more energy in just one month of use than during its entire training phase.8 As models grow smarter and more capable, user demand and inference load increase accordingly. Furthermore, the smarter these cutting-edge models become – with advanced reasoning and multimodality – they require more FLOPs, which are executed every single time the model is used.9 Even with efficient hardware, utilization quickly hits 100% on GPUs and TPUs, meaning energy gains stop – while the total energy load continues to accelerate with large-scale inference.8
An additional concern is that actual hardware utilization is often far from ideal. In one study, the average hardware utilization of the tested models fell below 50%.8 At Embedl, we have observed even lower levels of utilization in practical deployments – adding yet another source of energy inefficiency to the downside of the LLM race.
As the problem of energy consumption of large language models continues to grow, the demand for smarter models shows no sign of slowing. This presents a paradox: the pursuit of enhanced accuracy and reasoning overshadows the significant energy costs associated with LLMs. Currently, there are no rules that tech giants are required to follow in the AI race. No binding standards or regulations exist to measure, cap, or compare the energy consumption of large AI models in a fair way across organizations. The major benchmarks focus primarily on output quality, leaving energy usage in the shadows.
Yet, research shows that performance gains in AI don’t have to come at the cost of higher energy usage. When development efforts focus on algorithmic optimization rather than scaling compute power, LLMs can maintain high inference performance with stable or even reduced energy usage.9 In fact, it has been shown that models developed just behind the most cutting-edge models can achieve nearly the same level of accuracy while requiring significantly fewer FLOPs.9 This suggests that energy-efficient AI is fully achievable, but as long as the focus solely remains on scaling smartness, these efforts are left on the shelf.
Here is where Embedl steps in. Our innovations target the most energy-critical phase of AI: inference. Embedl enables efficient inference on hardware with limited computational resources by shrinking model size and improving hardware utilization, which boosts computational efficiency. In this way, Embedl helps dramatically reduce the energy cost of deploying state-of-the-art AI models at scale – without sacrificing performance.
Interested in Reading More?
As a practical example of what efficient inference can look like, our recent comparison of Llama 3.2 and Gemma 3 shows how Embedl’s SDK optimization accelerated on-device performance. For further insights, see our full benchmark analysis here.
References
1: https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#gemini- 2- 5-thinking
2: https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#sundar-note
3: https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/oQdzQvKHw8JyXbN87
4: https://openai.com/api/pricing/
5: https://openai.com/index/learning-to-reason-with-llms/
6: https://cloud.google.com/tpu
7: https://techcrunch.com/2025/02/27/openai-ceo-sam-altman-says-the-company-is-out-of-gpus/
8: https://ieeexplore.ieee.org/document/10549890
9: https://doi.org/10.1016/j.suscom.2023.100857
10: https://cloud.google.com/blog/topics/systems/tpu-v4-enables-performance-energy-and-co2e-efficiency-gains
11: https://www.nvidia.com/en-us/data-center/h100/
12: https://nvidianews.nvidia.com/news/nvidia-hopper-gpus-expand-reach-as-demand-for-ai-grows