The debates on AI’s references – and how they are reported by the AI - laboratories spread in the public.
This week, an OpenAi employee accused The company AI d’Elon Musk, XAI, of the publication of the deceptive reference results for its latest IA model, Grok 3. One of the co-founders of Xai, Igor Babushkin, insisted that the company was on the right.
The truth is somewhere between the two.
In a Publish on Xai’s blogThe company has published a graph showing the performances of Grok 3 on AIM 2025, a collection of mathematical questions difficult for a recent examination of mathematics of the invitation. Some experts have questioned the validity of the love as a reference AI. Nevertheless, the versions like 2025 and older of the test are commonly used to probe the mathematical capacity of a model.
The Xai graph has shown two variants of Grok 3, Grok 3 Reason Beta and Grok 3 Mini Reasoning, beating the most efficient model of Openai, O3-minuteOn AIME 2025. But OpenAI employees on X quickly pointed out that XAI graph did not include the AMA 2025 score from O3-mini-High to “Cons @ 64”.
What could you ask? Well, it is short for “consensus @ 64”, and it essentially gives a 64 model trying to respond to each problem in a reference and takes the most frequently generated answers as the final responses. As you can imagine, Cons @ 64 tends to stimulate the reference scores of models a little, and omit it from a graph could give the impression that a model exceeds another while in reality , this is not the case.
Grok 3 Reasoning Beta and Grok 3 Mini Reasoning Scores for Like 2025 at “@ 1” – which means that the first score that the models have obtained on the reference – fall below the score of O3 -Mini -High. Grok 3 Reashing Beta also always follows if slightly behind Openai O1 model Setted on “average” computer science. However, Xai is Grok 3 advertising Like “the most intelligent AI in the world”.
Babykin Articulated on x This Openai has published deceptive reference graphics in the past – although graphics comparing the performance of its own models. A more neutral part in the debate has set up a more “precise” graphic showing almost the performance of all models at Cons @ 64:
Hilarious how some people see my intrigue as an attack on Openai and others as an attack on Grok when in reality it is a deep propaganda
(I actually believe that Grok looks good there, and TTC Chicanery of Openai behind O3-Mini- * High * -pass @ “” “1” “” deserves more exam.) https://t.co/djqljpcjh8 pic.twitter.com/3wh8foufic– Teortaxes ▶ ️ (Deepseek 推特🐋铁粉 2023 – ∞) (@teortaxestex) February 20, 2025
But as a researcher in AI Nathan Lambert underlined in a postPerhaps the most important metric remains a mystery: the cost of calculation (and monetary) that took each model to obtain its best score. This simply shows how most IA markers do little on the limits of the models – and their strengths.