Ultracua: An agent-based model of Agents that breaks the gap between General-Purpose Agents and API-based APIs

Computerized agents are limited to primitives. They click, they write, they scroll. Long action chains create location errors and pollution steps. Apple researchers present Ultracuaa base model that creates a hybrid action environment that allows the agent to combine low-level GUI actions with high-level organizational calls. The model chooses a cheap and reliable movement for each step. The method improves efficiency and reduces steps in osworld, and transfer to WifagentaRena without special Windows training.

What are the hybrid verbs that change?
Hybrid verbs treat tools like first-class verbs. A TOOL CALL that combines multi-step functionality as a single function with a clear signature and docstring. Clicking or pressing a key is still available when no editing method is available. The agent learns to trade-off between both modes. The goal is to reduce cascade errors and reduce the number of steps. The research team positions this as a bridge between GUI-only agents and Cuntirm agents.


Limited tool acquisition
Ultracua builds its own tool library with an automated pipeline. The program extracts keyboard shortcuts and commands from software documents. The program includes open source implementation from agent tools. The program also uses coding agents to integrate new tools. Each tool is a lovely display that hides a long gui sequence. The research team reports coverage across all 10 desktop domains with 881 tools. The biggest buckets include vs code with 135 tools and libreoffice writer with 123 tools. Thunderbird and Gimp also have in-depth coverage.


Informative careers and trajectories
Training requires basic observation and stable rewards. The Ultracua uses a dual-engine. The first pipeline names atomic verifiers for browsers, files, images, and program state, and executes operations that satisfy those entities. The first pipeline evaluates the OS and proposes tasks that match the context that is validated at that time. The result is 17,864 certified jobs in 10 domains such as Chrome, LibreOffice, Gimp, VS Code, and many other applications. Chrome has 2,826 functions. The libreoffice suite recognizes 5,885 functions. Multi App functions up to 2,113.


The release of various agents produces successful hybrid trajectories. The editor uses OpenAi O3 for decision making. Lounger uses GTA1-7B for intuitive virtual birth. The rollout reveals about 26.8k trajectories that show when you used the tool and when you worked in the Gui. These indicators are the core of the segment.
Training method
The training has two phases. Phase 1 is well supervised. The models train three echoes with a learning rate of 2e-5 on effective trajectories. The loss is used in response to the wisdom to avoid weighing the first steps. Phase 2 is true online education. Models train 150 steps at a learning rate of 1e-6 on certified tasks graded by difficulty. Policy use follows a different grpo with a higher Clip, and removes KL synchronization and format rewards. The reward includes the sparse result that results in the term of using the tool. Tests use nvidia h100 gpus. Context is kept close to 32K by controlling the number of displayed tools.
Results in osworld
Ultracua develops success in the second scale of 7B and 32B. Under the budget of 15 steps, Ultracua-32B achieves a success rate of 41.0 percent. Opecua-32B reaches 29.7 percent. A total gain of 11.3 points. Ultracua-7b reaches 28.9 percent. UI-Tars-1.5-7b reaches 23.4 percent. SINS IS DONE BEFORE THE FREE INTERIOR. A breakdown of each domain showing consistent lifting across Chrome, authoring, VS Code, and cross-application functions. The average steps down against the bases. These shifts indicate better action selection than multiple attempts alone.




Cross-platform transfer to WindowsAgentaRena
Ultracua only trains on Ubuntu based on OSWorld Data. The model is then tested at Wifagentarena. Ultracua-7b achieves a success rate of 21.7 percent. This outperforms UI-Tars-1.5-7b by 18.1 percent and the Qwen2 base trained on Windows data by 13.5 percent. The result suggests that hybrid action strategies are learned by transferring one platform to other platforms. The paper highlights this as the ZERO Shot Platform Greatratic.


Key acquisition
- Ultracua organizes a hybrid action space that allows a single agent to switch between GUI calls and good tool calls, which reduces long error-prone objects.
- The library's research team is easy-to-use with an automated pipeline and doubles with a synthetic data engine, revealing 17,000 Plus proven jobs for training and testing jobs.
- The training follows a two-phase recipe, to guide the beauty in effective trajectories hybrid then the reinforcement of the internet to learn the guaranteed activities in the guaranteed activities, which is to build where to call the tools that have evolved into Gui.
- In Osworld, Ultracua reports a relative improvement of 22 percent with base models and 11 steps less, showing gains in reliability and efficiency.
- The 7b model achieves a success rate of 21.7 percent in WindowsAgentaRena without special training for Windows, which shows the cutting edge of the Hybrid Action policy.
Ultracua moves computer agents from briteritive action chains an brick to a hybrid action policy, which combines rui primitives with programsm calls It measures tools with an automated pipe and breaks them with a synthetic data Engine that presents 17,000 activities and is able to fine-tune and online reinforcement of supported signals. The reported results include a relative improvement of 22 percent in OSWorld with 11 fewer steps, and a 21.7 percent success in WindowsAgentaRena without special training for Windows, which shows the delivery of a specific policy.
Look Paper here. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.

Michal Sutter is a data scientist with a Master of Science in Data Science from the University of PADOVA. With a strong foundation in statistical analysis, machine learning, and data engineering, Mikhali excels at turning complex data into actionable findings.
Follow Marktechpost: Add us as a favorite source on Google.



