ByteDance’s UI-TARS can take over your computer, outperforms GPT-4o and Claude

MT HANNACH
9 Min Read
Disclosure: This website may contain affiliate links, which means I may earn a commission if you click on the link and make a purchase. I only recommend products or services that I personally use and believe will add value to my readers. Your support is appreciated!

Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more


A new AI agent has emerged from TikTok’s parent company to take control of your computer and perform complex workflows.

A bit like Anthropic Computer useByteDance’s new UI-TARS understands graphical user interfaces (GUIs), applies reasoning and takes autonomous actions, step by step.

Trained on approximately 50 billion tokens and offered in parameter versions 7B and 72B, PC/MacOS agents achieve peak performance (SOTA) on more than 10 GUI tests in terms of performance, perception, implementation grounded and overall agent capabilities, consistently beating OpenAI’s GPT-4o, Claude and Google’s Gemini.

“Through iterative training and reflection tuning, UI-TARS continually learns from its mistakes and adapts to unforeseen situations with minimal human intervention,” write researchers from ByteDance and Tsinghua University in a paper . new research paper.

Source: Arxiv

How UI-TARS explains its thinking

UI-TARS works across desktop, mobile, and web applications, using multimodal input (text, images, interactions) to understand visual environments.

Its user interface has two tabs: one on the left showing its step-by-step “thinking” and a larger one on the right where it extracts files, websites and applications and automatically takes action.

For example, in a demo video released today, the model is asked to “Search for round trip flights from SEA to New York on the 5th and back on the 10th of next month and filter by price in ascending order.”

In response, UI-TARS goes to the Delta Airlines website, fills in the “from” and “to” fields, clicks on the relevant dates, and sorts and filters by price, explaining each step in its think box before acting.

In another scenario, it is asked to install an autoDocstring extension in VS Code. Here is his step-by-step thinking as he completes this task:

  • It first reports that “to proceed with installing the autoDocstring extension in VS Code, I must first open the VS Code application.”
  • Acknowledging that the VS Code window is not fully loaded, the template notes that “it is necessary to wait until the application has finished initializing before proceeding with the installation of the autoDocstring extension.” This ensures that all required functionality is available for the next steps.
  • He then explains that he needs to access the Extensions view in VS Code, which can be done by clicking on a tab in the left sidebar.
  • Running into a small problem, he explains: “The previous attempt to click on the ‘Extensions’ tab may not have been precise enough, so I’m going to try clicking on it again to make sure the view Extensions opens correctly.”
  • Once in the Extensions view, he types “autoDocstring” before concluding: “Since the installation process has already started, no further action is required at this time. I will wait for the installation to complete and confirm its success once the process is complete.

Outperform your competitors

By various criteria, the researchers report that UI-TARS consistently outperformed OpenAI’s GPT-4o; Claude-3.5-Sonnet from Anthropic; Gemini-1.5-Pro ​​and Gemini-2.0; four Qwen Models; and many academic models.

For example, in VisualWebBench – which measures a model’s ability to anchor web elements, including web page quality assurance and optical character recognition – UI-TARS 72B scored 82.8%, outperforming GPT-4o (78.5%) and Claude 3.5 (78.2%).

It also performed significantly better on the WebSRC (understanding semantic content and layout in web contexts) and ScreenQA-short (understanding complex mobile screen layouts and web structure) benchmarks. UI-TARS-7B achieved the best scores of 93.6% on WebSRC, while UI-TARS-72B achieved 88.6% on ScreenQA-short, outperforming Qwen, Gemini, Claude 3.5 and GPT-4o.

“These results demonstrate the superior perception and comprehension capabilities of UI-TARS in web and mobile environments,” the researchers write. “Such perceptual capability lays the foundation for agent tasks, where an accurate understanding of the environment is crucial for task execution and decision-making.”

UI-TARS also showed impressive results in ScreenSpot Pro and ScreenSpot v2, which evaluate a model’s ability to understand and locate elements in GUIs. Additionally, researchers tested its capabilities for planning multi-step actions and low-level tasks in mobile environments, and compared it to OSWorld (which rates open-ended computing tasks) and AndroidWorld (which scores agents autonomously on 116 programming tasks in 20 mobile applications).

Source: Arxiv
Source: Arxiv

Under the hood

To help it take step-by-step actions and recognize what it sees, UI-TARS was trained on a large-scale dataset of screenshots that analyzed metadata, including description and element type, visual description, bounding boxes (positional information), element function. and text from various websites, applications and operating systems. This allows the model to provide a complete and detailed description of a screenshot, capturing not only the elements, but also the spatial relationships and overall layout.

The model also uses state transition captioning to identify and describe the differences between two consecutive screenshots and determine whether an action, such as a mouse click or keyboard entry, has occurred. Meanwhile, the Set of Marks (SoM) prompt allows it to overlay distinct marks (letters, numbers) on specific regions of an image.

The model is equipped with short- and long-term memory to manage tasks at hand while retaining historical interactions to improve subsequent decision-making. The researchers trained the model to perform both System 1 (fast, automatic, and intuitive) and System 2 (slow and deliberate) reasoning. This allows for multi-step decision making, “thoughtful” thinking, recognition of milestones and correction of errors.

The researchers emphasized that it is essential that the model be able to maintain consistent goals and use trial and error to hypothesize, test and evaluate potential actions before completing a task. They introduced two types of data to support this: error correction data and post-reflection data. For error correction, they identified errors and labeled corrective actions; for post-reflection, they simulated the recovery steps.

“This strategy ensures that the agent not only learns to avoid errors, but also dynamically adapts when they occur,” the researchers write.

Clearly, UI-TARS has some impressive capabilities, and it will be interesting to see its use cases evolve in the increasingly competitive AI agent space. As the researchers note: “Looking ahead, while native agents represent a significant step forward, the future lies in the integration of active and lifelong learning, where agents drive autonomously their own learning through continuous and real interactions. »

The researchers point out that Claude Computer Use “performs very well in web-based tasks but has significant difficulty with mobile scenarios, indicating that Claude’s GUI operating ability has not performed well transferred to the mobile domain.

In contrast, “UI-TARS shows excellent performance both on the website and in the mobile domain. »

Share This Article
Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *