What are 'agents using computers'? From the web to OS-a Technical Changer

Tl; dr: The agents are used with UI agents driven by UI pretends to pretend to be the uniform software. Basics on Osworld started at 12.24% (man 72.36%); Claude Sonnet 4.5 Now reports 61.4%. Gemini 2.5 Computer Use It leads to several web benches (Online Mind2web 69.0%, Webvoyager 88.9%) but He is not scheduled with OS. The next institution of the institution The intensity of OS level, The lower extremity of secondbesides Hard security policieswith transparent / testing testing / testing from open public.
Definition
Computer Gui agents) Language Models Language Language Language, Scroll to Uninstall Screen (click, Skroll, Scroll, Scroll) to complete the activities to illegal apps and browsers. Community use includes anthropic computer use, computer use of Google 2.5, and Agent's computer-operating agent.
Control Loop
LOOP Runtime: (1) Screenshot + Province, (2) Arrange the following action with the Spatial / Semantic Skide, (4) Verify the Depressed Activity, (4) Confirm and re-attempt. Scripture sessions of the actual level of action and Guarderails; The audited harnesses are checked to synchronize comparison.
Benchmark Vandcape
- Osworld (HKU, APR 2024): 369 Actual Desktop Tasks / Web Tasks exceeding OS I / O File and various performance operations. Extract, Human 72.36%, The best model 12.24%.
- Play condition (2025): Instexisenas Claude Sonnet 4.5 preface 61.4% in Osworld (sub-human but jumping large from 42.2%).
- Web-Live-Web Represons: Google Gemini 2.5 Computer Use preface 69.0% online-mint2web (Official Caredage), 88.9% in you Bomboyager, 69.7% on Androidworld; Current model Designated for browser including not yet designed to control OS-Level control.
- Online-Mind2web Spec: 300 jobs in all 136 live websites; The results are guaranteed by Prenceton / Hal and the HF Public Location.
Construction elements
- Understanding & Basis: Screenshots for some times, OCR / Extraction, localization, linking.
- Editing: A multi-action policy by recovery; often trained / trained RL of UI control.
- Action Schema: verbs tied (
click_at,type,key_combo,open_app), specialized Benchmark release to protect the shortcuts of tools. - Testing HARNESS: Live-Web / VM Sandges have a third-line test text and recycling texts.
Enterprise Snapshot
- Anthropic: Using a computer API; Sonnet 4.5 at 61.4% OSWORLD; Documents emphasize the accurate Pixel foundation, retrieving, and security verification.
- Google Depmind: Gemini 2.5 Use API + Model card with Online Mind2web 69.0%, Webvoyager 88.9%, Androidworld 69.7%Latency ratings, and the safety of security.
- Open: To preview the study in charge users of users' performance users, enabled by A The agent using a computer; A separate system card and distance developer with the answers API; Availability is limited / preview.

When they looked forward: Web → OS
- Fewer- / Tiling of Sliping Ships: The nearest period of time is the simulation of a strong work from one display (Screen Screen Settings). Manage as a valid research claim, not a complete resolved product.
- Latency Share Budgets: Keeping direct deception, actions should come inside 0.1-1 s HCI breads; Current stacks are more often than this because of the vision and plans over. Expect engineering in the rising (Diff Independence), cache-ACR cache, and action maturity.
- OS-Level range: File discussions, multi-windows focus, non-dom 'policies, and policies have added ways to failing to only browser providers only. “Gemino's Browser”, not, not OS-Segenlized OS “emphasizes the following step.
- Security: Fasting injection is a web content, dangerous actions, and data exfiltration. Model cards describe the list of Allow to allow / ensure, verification, and banned backgrounds; Expect contracts of typewriter and “acknowledging gates in steps.
Active Building Notes
- Begin with Browser-first Agent uses a written schematic and a certified Harness (eg Online-Mind2web).
- Add memory: POST-clear conditions, screen validation, as well as the remedies of a long-distance transportation.
- Manage Metric Doubts: Choose the main boards tested or the tops from the side of the scriptures they reported; OSWORLD uses the derivative test to run.
Open research and control
Bending face of the face Smollsoperator It provides an open training recipe after improving the small VLM into Gui-FIsed-Fised-Fisered Owabs / Startups that prioritize the recent records of the refund records.
Healed Key
- Computer (GUI) agents are operated by the VLM programs that see screens and remove screens by UI verbs (click / scroll) to use unauthorized apps; Socialism Current include anthropic computer usage, use of Google Gemini 2.5 Computer Use, and Computer Agency using a computer at Openai.
- Osworld (HKU) Benchmarks 369 Real desktop functionals / web functions with execution assessment; In people's fertilization they get 72.36% while the best model reaches 12,24%, highlighting global posts and process.
- Anthropic Claude Sonnet 4.5 Counts 61.4% in Osworld-sub-SUMAN But a large jump from the consequences of the previous Sonnet.
- Gemini 2.5 Computer Use Leads several Web-Web-Web-Web-Web Benches
- Opelai Operator is a first-time study model that is powered by the agent's model (cua) using Agent (Cua) using screenshots interactive screenshots; Availability always limited.
- Open-Source Trajolectoral: Guarding Smol2operPers face offers a prominent training pipe that converts a small VLM into a Gui-Finded operator, putting schemas in action and dataset.
References:
Benchmarks (Osworld & Online-Mind2web)
Anthropic (Using Computer & Sonnet 4.5)
Google Depmind (Gemini 2.5 Computer Use)
Open (operator / cua)
Open Source: Smoke Smolosoper's face

Michal Sutter is a Master of Science for Science in Data Science from the University of Padova. On the basis of a solid mathematical, machine-study, and data engineering, Excerels in transforming complex information from effective access.
Follow MarkteachPost: We have added like a favorite source to Google.



