Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more
In a new case study, Hugging Face researchers demonstrated how small language models (SLM) can be configured to outperform much larger models. Their results show that a Llama 3 model with 3B parameters can outperform the 70B version of the model in complex mathematical problems.
The cuddly face has fully documented the entire process and provides a roadmap for companies who want to create their own custom reasoning models.

Scaling compute at test time
The work is inspired by OpenAI o1which uses extra “thinking” to solve complex math, coding, and reasoning problems.
The key idea behind models like o1 is to scale “computation at test time”, which effectively means using more calculation cycles during inference to test and verify different answers and reasoning paths before to produce the final answer. Scaling the calculation at test time is particularly useful when there is not enough memory to run a large model.
Since o1 is a private model and OpenAI has remained tight-lipped about its inner workings, researchers have speculated about how it works and attempted to reverse engineer the process. There are already several open alternatives to o1.
Hugging Face’s work is based on a The DeepMind study published in Augustwhich studies the trade-offs between inference time and pre-training computation. The study provides comprehensive guidelines on how to balance training and inference computing to achieve the best results for a fixed budget.
In addition to using additional inference computation time, the success of the technique relies on two key elements: a reward model that evaluates the SLM’s responses and a search algorithm that optimizes the path taken to refine its responses.

Different reasoning algorithms
The simplest way to use test time scaling is “majority voting”, in which the same prompt is sent to the model multiple times and the one with the highest vote is chosen. In simple problems, majority voting can be useful, but its gains quickly plateau in complex reasoning problems or in tasks where errors are consistent across generations.
A more advanced method of reasoning is “Best-of-N”. In this technique, the SLM generates multiple answers, but instead of a majority vote, a reward model is used to evaluate the answers and choose the best one. “Weighted Best-of-N,” a more nuanced version of this method, takes consistency into account to choose responses that are both safe and more frequent than others.
The researchers used a “process reward model” (PRM) that evaluates the response of the SLM not only based on the final response, but also based on the multiple steps it goes through to get there. Their experiments showed that weighted Best-of-N and PRMs provided the Lama-3.2 1B close to the level of Llama-3.2 8B on the difficult MATH-500 benchmark.

Adding a search
To further improve the model’s performance, the researchers added search algorithms to the model’s reasoning process. Instead of generating the response in a single pass, they used “beam search,” an algorithm that guides the model response process step by step.
At each step, the SLM generates several partial responses. The search algorithm uses the reward model to evaluate the answers and chooses a subset that is worth exploring in more detail. The process is repeated until the model exhausts its inference budget or reaches the correct answer. This way, the inference budget can be reduced to focus on the most promising answers.
Researchers have found that although beam search improves model performance on complex problems, it tends to perform worse than other techniques on simple problems. To address this challenge, they added two additional elements to their inference strategy.
The first was Diverse Verifier Tree Search (DVTS), a variation of beam search that ensures that the SLM does not get stuck in false reasoning paths and diversifies its response branches. Second, they developed a “computation-optimal scaling strategy,” as suggested in the DeepMind article, which dynamically chooses the best test time scaling strategy based on the difficulty of the entry problem.
The combination of these techniques allowed the Llama-3.2 1B to punch above its weight and far outperform the 8B model. They also found that the strategy was scalable, and when applied to the Llama-3.2 3B, they were able to outperform the much larger Model 70B.

Not yet a perfect solution
Scaling the calculation at test time changes the cost dynamics of the model. Businesses now have the ability to choose where to allocate their IT resources. For example, if you are short on memory or can tolerate slower response times, you can use a small model and spend more cycles of inference time to generate more accurate responses.
However, testing time scaling also has its limitations. For example, in the experiments conducted by Hugging Face, the researchers used a specially trained Llama-3.1-8B model as a PRM, which requires running two models in parallel (even though it is much more resource efficient than the 70B model). Researchers recognize that the holy grail of test time scaling is to have “self-checking,” where the original model verifies its own response instead of relying on an external verifier. This is an open area of research.
The testing time scaling technique presented in this study is also limited to problems whose answer can be clearly assessed, such as coding and mathematics. Creating reward models and checkers for subjective tasks such as creative writing and product design requires further research.
But what is clear is that test time scaling generated lots of interest and activity and we can expect more tools and techniques to emerge in the coming months. Businesses would do well to keep an eye on the changing landscape.