These researchers used NPR Sunday Puzzle questions to benchmark AI ‘reasoning’ models

Every Sunday, the host NPR Will Shortz, the New York Times crossword guru, arrives at Quiz thousands of listeners in a long -standing segment called the Sunday puzzle. While being written to be resolved without Also Much pre-creation, the burners are generally difficult even for qualified competitors.

This is why some experts believe that they are a promising way to test the limits of AI problem -solving capacity.

In a recent studyA team of researchers from Wellesley College, Oberlin College, from Texas University to Austin, Northeastern University, Charles University and Startup Cursor has created an AI reference using puzzle episodes Sunday. The team claims that their test revealed surprising ideas, such as these models of reasoning – O1 of Openai, among others – “sometimes abandon” and provide answers that they know are not correct.

“We wanted to develop a benchmark with problems that humans can understand with only general knowledge,” Techncrunch, a member of the faculty in computer science, Arjun Guha, a member of the computer faculty and one of the co-authors study.

The AI industry is currently in a comparative analysis dilemma. Most tests commonly used to assess the AI models probe for skills, such as competence on mathematics and science issues at the doctorate level, which are not relevant to the average user. Meanwhile, many landmarks – even Benchmarks released relatively recently – Quickly approach the saturation point.

The advantages of a game of public radio quiz like The Sunday Puzzle is that it does not test esoteric knowledge, and the challenges are formulated so that the models cannot rely on “memory by heart” To resolve them, said Guha.

“I think what makes these problems difficult is that it is really difficult to make significant progress on a problem until you resolved it – this is when everything click together to both, “said Guha. “This requires a combination of insight and an elimination process.”

No reference is perfect, of course. The Sunday puzzle is only centered and English. And because the quizs are accessible to the public, it is possible that models formed on them can “cheat” in a sense, although Guha says that he has not seen proofs.

“New questions are published every week, and we can expect the latest questions to be really invisible,” he added. “We intend to keep the reference fresh and follow how model performance changes over time.”

On the reference of researchers, which includes around 600 Sunday puzzle puzzle puzzle, models of reasoning such as DEEPSEEK O1 and R1 surpass the rest. The reasoning of the models are insisted in depth before giving results, which helps them Avoid some of the traps which normally stumble on the AI models. The compromise is that the reasoning models take a little more time to reach solutions – usually seconds to a few more minutes.

At least one model, the Deepseek R1, gives solutions that he does not know about some of the questions from the Sunday puzzle. R1 will indicate that “I give up”, followed by an incorrect response seemed to be at random – a behavior to which man can certainly relate.

The models make other weird choices, such as giving a bad answer only to retract it immediately, try to remove them better and fail again. They are also stuck to “think” forever and give absurd explanations for answers, or they immediately arrive at a good answer, but continue to consider alternative responses without obvious reason.

“On difficult problems, R1 literally says that it becomes” frustrated “,” said Guha. “It was funny to see how a model emulates what a human could say. It remains to be seen how “frustration” in reasoning can affect the quality of the results of the model. »»

Benchmark NPR — R1 becomes “frustrated” on a question in the challenge of the Sunday puzzle.Image credits:Guha et al.

The most efficient model on the reference is O1 with a 59%score, followed by the recently published O3-min Setted on a high “reasoning effort” (47%). (R1 has obtained 35%.) In a next step, researchers plan to expand their tests to additional reasoning models, which, they hope, will help identify the areas where these models could be improved.

“You do not need a doctorate to be good for reasoning, so it should be possible to design marks of reasoning that do not require knowledge at the doctorate level,” said Guha. “A reference with wider access allows a wider set of researchers to understand and analyze the results, which can in turn lead to better solutions in the future. In addition, as advanced models are more and more deployed in parameters that affect everyone, we think that everyone should be able to intuate what these models are – and are not capable. »»