OpenAI’s o3 shows remarkable progress on ARC-AGI, sparking debate on AI reasoning

Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more

The latest from OpenAI o3 model achieved a breakthrough that surprised the AI research community. o3 achieved an unprecedented score of 75.7% on the ultra-difficult ARC-AGI benchmark under standard compute conditions, with a high-compute version achieving 87.5%.

Although ARC-AGI’s achievements are impressive, they do not yet prove that the code to use general artificial intelligence (AGI) was cracked.

Corpus of abstract reasoning

The ARC-AGI benchmark is based on the Corpus of abstract reasoningwhich tests an AI system’s ability to adapt to new tasks and demonstrate fluid intelligence. ARC is composed of a set of visual puzzles that require understanding of basic concepts such as objects, boundaries, and spatial relationships. While humans can easily solve ARC puzzles with very little demonstration, current AI systems struggle to solve them. ARC has long been considered one of the most difficult metrics in AI.

*Example of an ARC puzzle (source: arcprize.org)*

ARC was designed so that it cannot be fooled by training models on millions of examples in the hopes of covering all possible combinations of puzzles.

The benchmark is composed of a public training set containing 400 simple examples. The training set is complemented by a public evaluation set which contains 400 more difficult puzzles as a way to assess the generalizability of AI Systems. The ARC-AGI Challenge contains private and semi-private test sets of 100 puzzles each, which are not shared with the public. They are used to evaluate candidate AI systems without running the risk of releasing the data to the public and contaminating future systems with prior knowledge. Additionally, the competition sets limits on the amount of calculations participants can use to ensure that puzzles are not solved through brute force methods.

A breakthrough in solving new tasks

o1-preview and o1 scored a maximum of 32% on ARC-AGI. Another method developed by a researcher Jeremy Berman used a hybrid approach, combining Claude 3.5 Sonnet with genetic algorithms and a code interpreter to achieve 53%, the highest score before o3.

In a blog postFrançois Chollet, the creator of ARC, described o3’s performance as “a surprising and significant increase in AI capabilities, demonstrating a new ability to adapt to tasks never before seen in GPT family models.” .

It is important to note that using more calculations on previous generations of models did not achieve these results. As a reminder, it took 4 years for models to go from 0% with GPT-3 in 2020 to just 5% with GPT-4o in early 2024. Although we don’t know much about o3’s architecture, we can be sure that it is not orders of magnitude larger than its predecessors.

*Performance of the different models on ARC-AGI (source: arcprize.org)*

“This is not just an incremental improvement, but a real breakthrough, marking a qualitative shift in AI’s capabilities compared to the previous limitations of LLMs,” Chollet wrote. “o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain.”

It should be noted that o3’s performance on ARC-AGI comes at a high cost. On the low compute budget, the model costs between $17 and $20 and 33 million tokens to solve each puzzle, while on the high compute budget, the model uses approximately 172 times more compute and billions of tokens per problem . However, as inference costs continue to decrease, we can expect these numbers to become more reasonable.

A new paradigm in LLM reasoning?

The key to solving new problems is what Chollet and other scientists call “program synthesis.” A thinking system should be able to develop small programs to solve very specific problems and then combine these programs to solve more complex problems. Classical language models have absorbed a lot of knowledge and contain a rich set of internal programs. But they lack compositionality, which prevents them from solving puzzles beyond their training scope.

Unfortunately, there is very little information about how o3 works under the hood, and here scientists’ opinions differ. Chollet speculates that o3 uses a type of program synthesis that uses chain of thought (CoT) and a search mechanism combined with a reward model that evaluates and refines solutions as the model generates tokens. It is similar to what open source reasoning models have explored in recent months.

Other scientists like Nathan Lambert from the Allen Institute for AI suggest that “o1 and o3 may actually be just advanced passes of a single language model.” The day o3 was announced, Nat McAleese, a researcher at OpenAI, posted on this o1 was “only an LLM trained with RL. o3 is powered by a further extension of the RL beyond o1.

The same day, Denny Zhou of Google think tank DeepMind called the combination of current search and reinforcement learning approaches a “dead end.”

“The most beautiful thing about LLM reasoning is that the thought process is generated in an autoregressive way, rather than relying on searching (e.g. mcts) in the generation space, whether by a well model adjusted or a carefully crafted prompt,” he said. posted on.

Although the details of how o3 reasons may seem insignificant compared to the advancement of ARC-AGI, they may very well define the next paradigm shift in LLM education. There is currently a debate over whether the laws of scaling LLMs via training data and computation have hit a wall. Whether scaling test time depends on better training data or different inference architectures can determine the next path forward.

No AGI

The name ARC-AGI is misleading and some have equated it with the AGI resolution. However, Chollet emphasizes that “ARC-AGI is not a litmus test for AGI.”

“Passing ARC-AGI is not the same as achieving AGI and, in fact, I don’t think o3 is AGI yet,” he writes. “o3 still fails at some very simple tasks, indicating fundamental differences with human intelligence.”

Additionally, he notes that o3 cannot learn these skills autonomously and relies on external verifiers during inference and human-labeled chains of reasoning during training.

Other scientists have pointed out flaws in the results reported by OpenAI. For example, the model was fine-tuned on the ARC training set to achieve state-of-the-art results. “The solver should not need much specific ‘training,’ either on the domain itself or on each specific task,” the scientist writes. Melanie Mitchell.

To test whether these models have the type of abstraction and reasoning for which the ARC benchmark was created, Mitchell suggests “seeing whether these systems can adapt to variations on specific tasks or to reasoning tasks using the same concepts, but in other areas than ARC. »

Chollet and his team are currently working on a new benchmark that poses a challenge for o3, potentially reducing its score to less than 30%, even with a high compute budget. Meanwhile, humans would be able to solve 95% of puzzles without any training.

“You will know AGI is there when the exercise of creating tasks that are easy for ordinary humans but difficult for AI becomes simply impossible,” Chollet writes.

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you insight into what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thank you for subscribing. Learn more VB newsletters here.

An error has occurred.