How test-time scaling unlocks hidden reasoning abilities in small language models (and allows them to outperform LLMs)

Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more

According to a new study by Shanghai Ai Laboratory. The authors show that with the right tools and techniques of scaling up testing, an SLM with 1 billion parameters can surpass a 405B LLM on complex mathematical references.

The possibility of deploying SLM in complex reasoning tasks can be very useful because companies are looking for new ways to use these new models in different environments and applications.

Test scales explained

Testing testing (TTS) is the process of donating additional LLMS calculation during inference to improve their performance on various tasks. Main reasoning models, such as OPENAI O1 And Deepseek-R1Use “internal TTS”, which means that they are trained to “think” slowly by generating a long chain of thought chain (Bed) tokens.

An alternative approach is “TTS external”, where the performance of the model is improved (as its name implies) external aid. The external TTS is suitable for the reuse of output models for reasoning tasks without refining them more. An external TTS configuration is generally made up of a “policy model”, which is the main LLM generating the answer and a process reward model (PRM) which assesses the responses of the policy model. These two components are coupled together via a sampling or research method.

The simplest configuration is “Best Of-N”, where the strategy model generates several answers and the PRM selects one or more best answers to compose the final answer. More advanced external TTS methods use research. In the “beam search”, the model breaks down the answer into several steps.

For each stage, it samples several responses and performs them via PRM. He then chooses one or more appropriate candidates and generates the next step in the answer. And, in the “search for diversified verification trees” (DVT), the model generates several branches of responses to create a more diverse set of candidate responses before synthesizing them in a final response.

*Different test scale methods (Source: Arxiv)*

What is the right scaling strategy?

The choice of the right TTS strategy depends on several factors. The authors of the study conducted a systematic study of how different policy models and PRMs affect the effectiveness of TTS methods.

Their results show that efficiency largely depends on policy and PRM models. For example, for small strategy models, research-based methods surpass the best-of-one. However, for major policy models, the best of N is no longer effective because the models have better reasoning capacities and do not need a reward model to check each step of their reasoning.

Their results also show that the right TTS strategy depends on the difficulty of the problem. For example, for small strategy models with less than 7b settings, the best of N works better for easy problems, while the beam search works better for more difficult problems. For policy models that have between 7b and 32b parameters, a diverse tree search works well for easy and medium problems, and the beam search works better for difficult problems. But for major policy models (72B and Plus parameters), the best of N is the optimal method for all levels of difficulty.

Why the small models can beat large models

*SLM surpass great models in mathematics and love-24 (source: arxiv)*

Based on these results, developers can create Calculated TTS strategies This takes into account the policy model, the PRM and the difficulty of the problem to make the best use of the calculation budget to solve reasoning problems.

For example, researchers discovered that a LAMA-3.2-3B The model with the Optimal Compute TTS strategy surpasses the LLAMA-3.1-405B on MATH-500 and AIME24, two complex mathematical references. This shows that SLM can surpass a model of 135 times larger when using the optimal TTS strategy in calculation.

In other experiences, they found that a Qwen2.5 model with 500 million parameters can surpass GPT-4O With the right strategy TTS calculated. Using the same strategy, the 1.5B distilled version of Deepseek-R1 a outperformed O1-Preview and O1-Mini on Math-500 and loves24.

When taking into account the training and inference calculates budgets, the results show that with optimal calculation scaling strategies, SLM can surpass larger models with 100-1000x flops less.

The researchers’ results show that TTS calculated considerably improves the reasoning capacities of language models. However, as the policy model increases, improving TT gradually decreases.

“This suggests that the effectiveness of TTS is directly linked to the reasoning capacity of the political model,” write the researchers. “More specifically, for models with low reasoning capacities, scaling testing time calculates substantial improvement, while for models with strong reasoning capacities, gain is limited.”

The valid study that SLM can work better than larger models when applying optimal calculation testing methods. Although this study focuses on mathematical references, researchers plan to extend their study to other reasoning tasks such as coding and chemistry.

Daily information on business use cases with VB daily

If you want to impress your boss, VB Daily has covered you. We give you the interior scoop on what companies do with a generative AI, from regulatory changes to practical deployments, so that you can share information for a maximum return on investment.

Read our Privacy Policy

Thank you for subscribing. Find out more VB Newsletters here.

An error occurred.