Not every AI prompt deserves multiple seconds of thinking: how Meta is teaching models to prioritize

MT HANNACH
7 Min Read
Disclosure: This website may contain affiliate links, which means I may earn a commission if you click on the link and make a purchase. I only recommend products or services that I personally use and believe will add value to my readers. Your support is appreciated!

Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more


Models of models like OPENAI O1 And Deepseek-R1 Have a problem: they think too much. Ask them a simple question such as “What is 1 + 1?” And they will think for several seconds before answering.

Ideally, like humans, AI models should be able to know when giving a direct response and when spending time and additional resources to reason before responding. A new technique presented by researchers to Meta Ai and the University of Illinois Chicago Famate models to allocate inference budgets according to the difficulty of the request. This results in faster responses, reduced costs and a better allocation of calculation resources.

Depth resolution 1 + 1

Expensive reasoning

Great languages ​​(LLM) models can improve their performance on reasoning problems when they produce longer reasoning chains, often called “thought chain»(COT). The success of the COT has led to an entire range of time -making techniques that encourage the model to “think” longer in the problem, to produce and examine several responses and to choose the best.

One of the main ways used in reasoning models is to generate several responses and choose the one that reproduces most often, also known as “majority vote” (MV). The problem with this approach is that the model adopts uniform behavior, dealing with each invite as a difficult reasoning problem and spending unnecessary resources to generate several responses.

Intelligent reasoning

The new article offers a series of training techniques that make reasoning models more effective to respond. The first step is the “sequential vote” (SV), where the model prohibits the reasoning process as soon as a response appears a number of times. For example, the model is invited to generate a maximum of eight answers and to choose the answer that appears at least three times. If the model receives the simple request mentioned above, the first three answers will probably be similar, which will trigger time, save time and calculate resources.

Their experiences show that SV surpasses classic MVs in mathematics competition problems when it generates the same number of responses. However, SV requires additional instructions and a generation of tokens, which puts it equally with MV in terms of token / precision ratio.

SV surpasses MV on the number of responses but corresponds to the number of tokens (source: Arxiv)

The second technique, the “adaptive sequential vote” (ASV), improves SV by encouraging the model to examine the problem and generate several responses when the problem is difficult. For simple problems (as invited 1 + 1), the model simply generates a single answer without going through the voting process. This makes the model much more effective to manage simple and complex problems.

Learning to strengthen

Although SV and ASV improve the effectiveness of the model, they require a lot of data marked by hand. To mitigate this problem, researchers offer “the optimization of policies with a budgetary limit” (IBPO), a strengthening learning algorithm which teaches the model to adjust the duration of the traces of reasoning according to the difficulty of the request.

Ibpo is designed to allow LLMS to optimize their responses while remaining in a budgetary inference constraint. The RL algorithm allows the model to exceed the gains obtained thanks to training on manually labeled data by constantly generating ASV traces, by assessing the responses and choosing results that provide the correct response and the optimal inference budget.

Their experiences show that Ibpo improves the Pareto front, which means for a fixed inference budget, a model formed on IBPO surpasses the other basic lines.

IBPO (green circles) surpasses the other basic lines on the Pareto front (Source: Arxiv)

The results are in the context of researchers warning that the current AI models hit a wall. Businesses find it difficult to find quality training data and explore alternative methods to improve their models.

A promising solution is the learning of strengthening, where the model receives an objective and authorized to find its own solutions as opposed to a supervised fine adjustment (SFT), where the model is formed on examples labeled manually.

Surprisingly, the model often finds solutions that humans have not thought. It is a formula that seems to have worked well for Deepseek-R1who challenged the domination of the AI ​​laboratories based in the United States.

Researchers note that “Methods based on incentive and SFT based on absolute improvement and efficiency, supporting the conjecture that SFT alone does not allow self-correction capacities. This observation is also partially supported by simultaneous work, which suggests that such self-correction behavior automatically emerges during RL rather than created manually by incentive or SFT. »»

Share This Article
Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *