Microsoft’s new rStar-Math technique upgrades small models to outperform OpenAI’s o1-preview at math problems

Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more

Microsoft doubles the potential of small language models (SLM) with the unveiling of rStar-Matha new reasoning technique that can be applied to small models to improve their performance on mathematical problems using reasoning techniques – performance similar to, and in some cases better than, the o1-preview d model ‘OpenAI.

Although it is still in the research phase – as stated in a article published on the pre-evaluation site arXiv.org and credited to eight authors from Microsoft, Peking University and Tsinghua University in China – the technique has been applied to several smaller open source models, including Microsoft’s Phi-3 mini and the Qwen-1, Alibaba’s 5B (a 1.5 billion parameter model), and Qwen-7B (a 7 billion parameter model). It showed improved performance on all of them, even surpassing OpenAI’s previously most advanced model at the level MATHEMATICS (word problem solving) third-party benchmark of 12,500 questions covering various branches such as geometry and algebra, and all difficulty levels.

Ultimately, according to a article on Face HugsThe researchers plan to make their code and data available on Github at https://github.com/microsoft/rStaralthough one of the paper’s authors, Li Lyna Zhang, wrote in the comments on the Hugging Face post that the team “is still undergoing the internal review process for an open source release.” As such, “the repository remains private at this time. Please stay tuned!

Community members expressed excitement, calling the innovations “impressive” and praising the blend of Monte Carlo Tree Search (MCTS) and step-by-step reasoning. One speaker highlighted the simplicity and usefulness of using Q values for step notation, while others speculated about future applications in geometric proofs and symbolic reasoning.

This news closely follows the open source of The Phi-4 model from Microsoft, a smaller AI system of 14 billion parameters now available on Hugging Face under the permissive MIT license.

While the Phi-4 version expanded access to small, high-performance models, rStar-Math presents a specialized approach: using smaller AI systems to achieve cutting-edge results in mathematical reasoning.

rStar-Math works by using several different models and components to help a small target model “self-evolve”

The key to rStar-Math is that it leverages Monte Carlo Tree Search (MCTS), a method that mimics human “deep thinking” by iteratively refining, step-by-step, solutions to mathematical problems.

The researchers used MCTS because it “decomposes complex math problems into simpler, single-step generation tasks, thereby reducing the difficulty” for smaller models.

However, they did not simply apply MCTS as other researchers have done. Instead, in a stroke of genius, they also instruct the model they trained to always output its “chain of thought” reasoning steps as natural language descriptions. And Python code.

They required that the model include natural language responses as Python code comments, and that only output using Python be used to train the model.

The researchers also trained a “policy model” to generate mathematical reasoning steps and a process preference model (PPM) to select the most promising steps for solving problems, and improved them both over the course of four cycles of “self-evolution”, with each model. improve the other.

For their starting data, the researchers said they used “747,000 math word problems from publicly available sources,” along with their solutions, but generated new steps to solve them with the two models described above.

Record results

After four cycles of self-evolution, rStar-Math has reached important milestones:

• On the Reference MATHthe accuracy of Qwen2.5-Math-7B model increased from 58.8% to 90.0%, outperforming OpenAI o1-preview.

• On the American Invitational Mathematics Examination (AIME)he solved 53.3% of problems, ranking in the top 20% of high school competitors.

These results highlight the power of SLM in managing complex mathematical reasoning, traditionally dominated by larger systems.

Smaller is better?

In recent years, AI innovation has been largely driven by the scaling of language models, with increasing parameters seen as a way to improve performance. Yet the high costs associated with these massive models, from computing resources to power consumption, have raised questions about their scalability.

Microsoft offers an alternative path, focused on efficiency. The release of rStar-Math further underscores this commitment by demonstrating how SLMs can rival – and in some cases exceed – the capabilities of their larger counterparts.

Microsoft’s dual releases of Phi-4 and the rStar-Math paper suggest that compact, specialized models can provide powerful alternatives to the industry’s largest systems.

Additionally, by outperforming larger competitors in key benchmarks, these models challenge the idea that bigger is always better. They open the door for mid-sized organizations and academic researchers to access cutting-edge capabilities without the financial or environmental burden of massive models.

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you insight into what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thank you for subscribing. Learn more VB newsletters here.

An error has occurred.