Did xAI lie about Grok 3’s benchmarks?

Last updated: February 23, 2025 2:30 am

By MT HANNACH

23 Min Read

Disclosure: This website may contain affiliate links, which means I may earn a commission if you click on the link and make a purchase. I only recommend products or services that I personally use and believe will add value to my readers. Your support is appreciated!

The debates on AI’s references – and how they are reported by the AI - laboratories spread in the public.

This week, an OpenAi employee accused The company AI d’Elon Musk, XAI, of the publication of the deceptive reference results for its latest IA model, Grok 3. One of the co-founders of Xai, Igor Babushkin, insisted that the company was on the right.

The truth is somewhere between the two.

In a Publish on Xai’s blogThe company has published a graph showing the performances of Grok 3 on AIM 2025, a collection of mathematical questions difficult for a recent examination of mathematics of the invitation. Some experts have questioned the validity of the love as a reference AI. Nevertheless, the versions like 2025 and older of the test are commonly used to probe the mathematical capacity of a model.

The Xai graph has shown two variants of Grok 3, Grok 3 Reason Beta and Grok 3 Mini Reasoning, beating the most efficient model of Openai, O3-minuteOn AIME 2025. But OpenAI employees on X quickly pointed out that XAI graph did not include the AMA 2025 score from O3-mini-High to “Cons @ 64”.

What could you ask? Well, it is short for “consensus @ 64”, and it essentially gives a 64 model trying to respond to each problem in a reference and takes the most frequently generated answers as the final responses. As you can imagine, Cons @ 64 tends to stimulate the reference scores of models a little, and omit it from a graph could give the impression that a model exceeds another while in reality , this is not the case.

Grok 3 Reasoning Beta and Grok 3 Mini Reasoning Scores for Like 2025 at “@ 1” – which means that the first score that the models have obtained on the reference – fall below the score of O3 -Mini -High. Grok 3 Reashing Beta also always follows if slightly behind Openai O1 model Setted on “average” computer science. However, Xai is Grok 3 advertising Like “the most intelligent AI in the world”.

Babykin Articulated on x This Openai has published deceptive reference graphics in the past – although graphics comparing the performance of its own models. A more neutral part in the debate has set up a more “precise” graphic showing almost the performance of all models at Cons @ 64:

Hilarious how some people see my intrigue as an attack on Openai and others as an attack on Grok when in reality it is a deep propaganda
(I actually believe that Grok looks good there, and TTC Chicanery of Openai behind O3-Mini- * High * -pass @ “” “1” “” deserves more exam.) https://t.co/djqljpcjh8 pic.twitter.com/3wh8foufic

– Teortaxes ▶ ️ (Deepseek 推特🐋铁粉 2023 – ∞) (@teortaxestex) February 20, 2025

But as a researcher in AI Nathan Lambert underlined in a postPerhaps the most important metric remains a mystery: the cost of calculation (and monetary) that took each model to obtain its best score. This simply shows how most IA markers do little on the limits of the models – and their strengths.

Did xAI lie about Grok 3’s benchmarks?

Leave a Reply Cancel reply

Follow US

Must Read

Could deeptech serve as Europe’s path to autonomy from the US?

Halle Bailey Drops THIS Message Amid DDG Co-Parenting Drama

Stocks Slip as US Shutdown Concerns Add to Risks: Markets Wrap

Atletico Madrid boss Diego Simeone ‘wants to believe’ Julian Alvarez penalty decision was honestly made

Ukrainian refugees in Edmonton skeptical of ceasefire talks

More Links

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Follow US

Must Read

The Daily Newsletter