Did xAI lie about Grok 3’s benchmarks?

MT HANNACH
23 Min Read
Disclosure: This website may contain affiliate links, which means I may earn a commission if you click on the link and make a purchase. I only recommend products or services that I personally use and believe will add value to my readers. Your support is appreciated!

The debates on AI’s references – and how they are reported by the AI ​​- laboratories spread in the public.

This week, an OpenAi employee accused The company AI d’Elon Musk, XAI, of the publication of the deceptive reference results for its latest IA model, Grok 3. One of the co-founders of Xai, Igor Babushkin, insisted that the company was on the right.

The truth is somewhere between the two.

In a Publish on Xai’s blogThe company has published a graph showing the performances of Grok 3 on AIM 2025, a collection of mathematical questions difficult for a recent examination of mathematics of the invitation. Some experts have questioned the validity of the love as a reference AI. Nevertheless, the versions like 2025 and older of the test are commonly used to probe the mathematical capacity of a model.

The Xai graph has shown two variants of Grok 3, Grok 3 Reason Beta and Grok 3 Mini Reasoning, beating the most efficient model of Openai, O3-minuteOn AIME 2025. But OpenAI employees on X quickly pointed out that XAI graph did not include the AMA 2025 score from O3-mini-High to “Cons @ 64”.

What could you ask? Well, it is short for “consensus @ 64”, and it essentially gives a 64 model trying to respond to each problem in a reference and takes the most frequently generated answers as the final responses. As you can imagine, Cons @ 64 tends to stimulate the reference scores of models a little, and omit it from a graph could give the impression that a model exceeds another while in reality , this is not the case.

Grok 3 Reasoning Beta and Grok 3 Mini Reasoning Scores for Like 2025 at “@ 1” – which means that the first score that the models have obtained on the reference – fall below the score of O3 -Mini -High. Grok 3 Reashing Beta also always follows if slightly behind Openai O1 model Setted on “average” computer science. However, Xai is Grok 3 advertising Like “the most intelligent AI in the world”.

Babykin Articulated on x This Openai has published deceptive reference graphics in the past – although graphics comparing the performance of its own models. A more neutral part in the debate has set up a more “precise” graphic showing almost the performance of all models at Cons @ 64:

But as a researcher in AI Nathan Lambert underlined in a postPerhaps the most important metric remains a mystery: the cost of calculation (and monetary) that took each model to obtain its best score. This simply shows how most IA markers do little on the limits of the models – and their strengths.


Share This Article
Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *