OpenAI confirms new frontier models o3 and o3-mini

Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more

OpenAI is slowly inviting selected users to test a brand new set of reasoning models named o3 and o3 mini, successors to the just-used o1 and o1-mini models. went into full release earlier this month.

OpenAI o3, so named to avoid copyright issues with phone company O2 and because CEO Sam Altman says the company “has a history of being really bad at names,” was announced today today on the last day of the “12 Days of OpenAI” live broadcasts.

Altman said the two new models would initially be released to selected third-party researchers to security testingwith o3-mini expected by the end of January 2025 and o3 “soon after”.

“We see this as the beginning of the next phase of AI, where you can use these models to do increasingly complex tasks that require a lot of reasoning,” Altman said. “For the last day of this event, we thought it would be fun to move from one frontier model to another.”

The announcement comes just a day after Google unveiled and allowed the public to use its new Gemini 2.0 Flash Thinking modelanother rival “reasoning” model that, unlike the OpenAI o1 series, allows users to see the steps of its “thinking” process documented in text bullets.

The release of Gemini 2.0 Flash Thinking and now the o3 announcement show that the competition between OpenAI and Google, and the broader field of AI model providers, is entering a new and intense phase as they not only offer LLM or multimodal models, but also advanced models. also models of reasoning. These may be more applicable to more difficult problems in science, math, technology, physics and more.

Best performance on third-party benchmarks to date

Altman also said that the o3 model is “incredible at coding” and that benchmarks shared by OpenAI support this, showing that the model even exceeds o1’s performance on programming tasks.

• Outstanding coding performance: o3 outperforms o1 by 22.8 percentage points on SWE-Bench Verified and achieves a Codeforces score of 2727, surpassing the OpenAI Chief Scientist’s score of 2665.

• Mastery of mathematics and science: o3 scores 96.7% on the AIME 2024 exam, missing only one question, and scores 87.7% on the GPQA Diamond, far surpassing the performance of human experts.

• Border markers: The model sets new records on tough tests like EpochAI’s Frontier Math, solving 25.2% of problems where no other model exceeds 2%. On the ARC-AGI test, o3 triples the score of o1 and exceeds 85% (as verified live by the ARC Prize team), which represents an important step in conceptual reasoning.

Deliberative alignment

Alongside these advancements, OpenAI has strengthened its commitment to security and alignment.

The company presented new research on deliberative alignmenta technique that has helped make o1 its most robust and aligned model to date.

This technique embeds human-written security specifications into models, allowing them to reason explicitly about these policies before generating responses.

The strategy aims to address common security issues in LLMs, such as vulnerability to jailbreak attacks and excessive denial of benign prompts, by equipping models with chain-of-thought (CoT) reasoning. This process allows models to recall and apply security specifications dynamically during inference.

Deliberative alignment improves on previous methods such as reinforcement learning from human feedback (RLHF) and constitutional AI, which rely on security specifications only for label generation rather than integrate policies directly into models.

By fine-tuning LLMs on security-related prompts and their associated specifications, this approach creates models capable of reasoning about policies without relying heavily on human-labeled data.

Results shared by OpenAI researchers in a new article not peer reviewed indicate that this method improves performance on security criteria, reduces harmful results, and ensures better compliance with content and style guidelines.

The key findings highlight the advancements of the o1 model compared to its predecessors such as the GPT-4o and other cutting-edge models. Deliberative alignment allows the o1 series to excel in jailbreak resistance and provide safe completions while minimizing excessive denials on benign prompts. Additionally, the method facilitates off-distribution generalization, demonstrating robustness in multilingual and coded jailbreak scenarios. These improvements align with OpenAI’s goal of making AI systems safer and more interpretable as their capabilities increase.

This research will also play a key role in aligning o3 and o3-mini, ensuring their abilities are both powerful and responsible.

How to request access to the o3 and o3-mini tests

Applications for early access are now open on the OpenAI website and will close on January 10, 2025.

Candidates must fill out an online form form which asks them for a variety of information, including research direction, past experience, and links to previously published papers and their code repositories on Github, and selects which of the models – o3 or o3-mini – they also want to test like what they plan to use them for.

Selected researchers will have access to o3 and o3-mini to explore their capabilities and contribute to security assessments, although OpenAI’s form warns that o3 will not be available for several weeks.

Researchers are encouraged to develop robust assessments, create controlled demonstrations of high-risk capabilities, and test models on impossible scenarios with widely adopted tools.

This initiative builds on the company’s established practices, including rigorous internal security testing, collaborations with organizations such as the US and UK AI Safety Institutes, and its readiness framework.

OpenAI will review applications on a rolling basis, with selections beginning immediately.

A new leap forward?

The introduction of o3 and o3-mini marks a quantum leap in AI performance, particularly in areas requiring advanced reasoning and problem-solving capabilities.

With their exceptional results in coding, mathematics, and conceptual benchmarks, these models highlight the rapid progress being made in AI research.

By inviting the broader research community to collaborate on security testing, OpenAI aims to ensure that these capabilities are deployed responsibly.

Watch the stream below:

https://www.youtube.com/watch?v=SKBG1sqldyIU

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you insight into what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thank you for subscribing. Learn more VB newsletters here.

An error has occurred.