Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more
A new paper by researchers from Google search And University of California, Berkeley, Demonstrates that an astonically simple test scaling approach can increase the reasoning capacities of large -language models (LLM). The key? The research based on sampling, a technique that is based on the generation of several responses and the use of the model itself to check them.
The basic discovery is that even a minimalist implementation of research based on sampling, using random sampling and self-truth, can raise the reasoning performance of models like Gemini 1.5 Pro beyond that of O1-PRESEWAL on popular benchmarks. The results may have important implications for business applications and question the hypothesis that highly specialized training or complex architectures are always necessary to obtain high -level performance.
The limits of scaling current time calculation
The current popular method of scaling up time in LLM is to form the model by learning strengthening to generate longer responses with traces of reflection chain (COT). This approach is used in models such as OPENAI O1 And Deepseek-R1. Although beneficial, these methods generally require substantial investments in the training phase.
Another testing method is “self-coherence”, where the model generates several responses to the request and chooses the answer that appears more often. Self-coherence reaches its limits when managing complex problems, as in these cases, the most repeated answer is not necessarily the right one.
Sample -based search offers a simpler and very scalable alternative to testing time testing: let the model generate several responses and select the best via a verification mechanism. Sampling -based research can complete other testing strategies for calculating test time and, as researchers write in their article, “it also has the unique advantage of being embarrassing parallel and allowing a arbitrarily scaling: it is enough to sample more answers.”
More importantly, sampling -based research can be applied to any LLM, including those that have not been explicitly trained for reasoning.
How the sampling based research works
Researchers focus on minimalist implementation of research based on sampling, using a language model to generate candidate responses and check them. It is a “self-truth” process, where the model assesses its own outings without relying on external responses through the crossing or symbolic verification systems.

The algorithm works in a few simple steps:
1 – The algorithm begins by generating a set of candidate solutions to the problem given using a language model. This is done by giving the model the same invites several times and using a non -zero temperature parameter to create a diverse set of responses.
2 – The candidate’s response undergoes a verification process in which the LLM is invited several times to determine if the answer is correct. The verification results are then averaged to create a final verification score for the answer.
3— The algorithm selects the highest response as a final response. If several candidates are near the other, the LLM is invited to compare them by pair and choose the best. The answer that earns the most pairs comparisons is chosen as a final response.
Researchers have examined two key axes for testing testing:
Sampling: The number of responses generates the model for each input problem.
Verification: the number of verification scores calculated for each solution generated
How sampling based on other techniques
The study revealed that the performance of the reasoning continues to improve with the research based on the sampling, even when the calculation of the test time is set up far beyond the point where the self-coherence saturated.
On a sufficient scale, this minimalist implementation considerably increases the precision of reasoning on reasoning references as loves and mathematics. For example, the performances of Gemini 1.5 PRO have exceeded that of O1-PREVIEW, which was explicitly trained on reasoning problems, and Gemini 1.5 Flash exceeded Gemini 1.5 Pro.

“This not only underlines the importance of research based on the sampling of scale capacity, but also suggests the usefulness of research based on sampling as a simple reference on which other testing strategies testing and measuring real improvements in model research capacities,” write researchers.
It should be noted that if the research -based sampling results are impressive, costs can also become prohibitive. For example, with 200 samples and 50 sample verification steps, a RAI request will generate around 130 million tokens, which costs $ 650 with Gemini 1.5 Pro. However, this is a very minimalist approach to sampling research, and it is compatible with the optimization techniques proposed in other studies. With more intelligent sampling and verification methods, the inference costs can be considerably reduced using smaller models And generate fewer tokens. For example, using Gemini 1.5 Flash to verify, costs fall to $ 12 per question.
Effective self-record strategies
There is an underway debate on the question of whether the LLM can check their own answers. The researchers identified two key strategies to improve self-treatment using the calculation of the test time:
Comparison directly the candidates of the response: Disagreements between candidate solutions strongly indicate potential errors. By providing the auditor with multiple responses to compare, the model can better identify errors and hallucinations, by approaching a central LLM weakness. Researchers describe it as an “implicit scaling” body.
Rewriting specific to the task: The researchers propose that the optimal output style of an LLM depends on the task. The chain of thoughts is effective in solving reasoning tasks, but the answers are easier to check when written in a more formal and mathematically conventional style. The verifiers can rewrite the candidates’ responses in a more structured format (for example, theorem-limma-resistant) before the evaluation.
“We plan that the self-tenification capacities of the models quickly improve in the short term, because the models learn to take advantage of the principles of the implicit scale and the adequacy of the output style, and lead to improved scaling rates for research based on sampling,” write researchers.
Implications for real world applications
The study shows that a relatively simple technique can obtain impressive results, potentially reducing the need for complex and costly model architectures or training regimes.
It is also an evolutionary technique, allowing companies to increase performance by allocating more computing resources to sampling and verification. It also allows developers to push border tongue models beyond their limits on complex tasks.
“Given that it completes other strategies for scaling the testing time, is parallelisable and allows an arbitrarily scaling, and admits simple implementations which are manifestly effective, we expect the search for sampling to play a crucial role because the language models are responsible for solving increasingly complex problems with increasingly important composition budgets.”