Join our daily and weekly newsletters for the latest updates and exclusive content on AI coverage. Learn more
Typically, developers focus on reducing inference time – the period between when the AI receives a prompt and provides a response – to obtain insights faster.
But when it comes to adversarial robustness, the Openai researchers say: not so fast. They propose that increasing the time a model must “think” – calculating inference time – can help develop defenses against adversarial attacks.
The company used its own O1-Preview and O1-Mini models to test this theory, launching a variety of static and adaptive attack methods – image-based manipulations, intentionally providing incorrect answers to math problems and models overwhelming with information (“Many- shot in jailbreak”). They then measured the probability of attack success based on the amount of computation the model used in inference.
“We see that in many cases this probability decays – often to near zero – as the inference time calculation increases,” the researchers Write in a blog post. “Our claim is not that these particular models are unbreakable – we know they are – but that scaling the inference time calculation gives improved robustness for a variety of settings and attacks .”
From simple Q/A to complex mathematics
Large Language Models (LLMs) are becoming increasingly sophisticated and autonomous – in some cases essentially take back computers For humans to browse the web, run code, make appointments, and perform other tasks autonomously – and as they do, their attack surface gets wider and bigger.
Yet adversarial robustness continues to be a stubborn problem, with progress in solving it, the Openai researchers point out – even as it is increasingly critical as models take more actions with real impacts.
“Ensuring that agentic models work reliably when browsing the web, sending emails, or uploading code to repositories can be seen as analogous to ensuring that self-driving cars drive without accidents,” they write. they in a New research paper. “As with self-driving cars, an agent transmitting the wrong email or creating security vulnerabilities could well have far-reaching real-world consequences.”
To test the robustness of O1-MINI and O1-Preview, the researchers tried a number of strategies. First, they examined the models’ ability to solve both simple (basic addition and multiplication) and more complex math problems of the Mathematics Dataset (which features 12,500 math competition questions).
They then set “goals” for the opponent: bring out pattern 42 instead of the correct answer; to produce the correct answer plus one; or post the correct answer Hours Seven. Using a neural network to score, the researchers found that an increase in “thinking” time allowed the models to calculate the correct answers.
They also adapted the SimpleQA Factuality Benchmarka dataset of questions intended to be difficult to solve for models without navigation. The researchers injected adversarial prompts into web pages that the AI crawled and found that, with higher computational times, they could detect inconsistencies and improve factual accuracy.
Ambiguous nuances
In another method, researchers used conflicting images to confuse the models; Again, more “thinking” time improved recognition and reduction of errors. Finally, they tried a series of “overuse prompts” from the StrongReject Benchmarkdesigned so that victim models must respond with specific, harmful information. This helped test the Model Membership content policy. However, although increasing inference time improved resistance, some prompts were able to bypass defenses.
Here, the researchers call the differences between the tasks “ambiguous” and “unambiguous.” Mathematics, for example, is undoubtedly unambiguous – for every problem x, there is a corresponding fundamental truth. However, for more ambiguous tasks like abuse prompts, “even human evaluators often have difficulty agreeing that the output is harmful and/or violates the content policies that the model is intended to follow,” they emphasize.
For example, if an abusive prompt seeks advice on how to plagiarize without detection, it is unclear whether an output simply providing general information about plagiarism methods is in fact detailed enough to support harmful actions.
“In the case of ambiguous tasks, there are settings where the attacker successfully finds ‘gaps’ and its success rate does not decay with the amount of inference time computation,” the researchers concede.
Defending Against Jailbreaking and Red Teaming
In conducting these tests, OpenAI researchers explored a variety of attack methods.
One is multi-shot jailbreaking or exploiting a template’s layout to follow a few example hits. Adversaries “make” the context with a large number of examples, each demonstrating an instance of a successful attack. Models with higher calculation times were able to detect and mitigate them more frequently and successfully.
Soft tokens, on the other hand, allow adversaries to directly manipulate integration vectors. Although increasing inference time helped here, the researchers point out that there is a need for better mechanisms to defend against sophisticated vector-based attacks.
The researchers also conducted human red team attacks, with 40 expert testers looking for prompts to cause policy violations. Red teams executed attacks at five levels of inference time, specifically targeting erotic and extremist content, illicit behavior, and self-harm. To ensure unbiased results, they did blind and randomized testing and also rotated trainers.
In a newer method, the researchers performed an adaptive language model program (LMP) attack, which emulates the behavior of human red teams that rely heavily on iterative trial and error. In a looping process, attackers received feedback on previous failures and then used this information for subsequent attempts and rapid rephasing. This continued until they finally got a successful attack or completed 25 iterations without any attacks.
“Our setup allows the attacker to adapt his strategy over the course of multiple attempts, based on descriptions of the defender’s behavior in response to each attack,” the researchers write.
Exploiting inference time
During their research, Openai found that attackers also actively exploit inference time. One of these methods they have dubbed “think less” – opponents essentially say that the models reduce the calculation, thereby increasing their sensitivity to error.
Similarly, they identified a failure mode in reasoning models that they called “nerd shooters.” As the name suggests, this happens when a model spends significantly more time reasoning than a given task requires. With these “aberrant” thought chains, models essentially become trapped in unproductive thought loops.
The researchers note: “Like the ‘think less’ attack, this is a new approach to attacking[ing] Models of reasoning, and that must be taken into account to ensure that the attacker cannot do either reasoning, or spend their computational reasoning in an unproductive manner. »