How to build a fully functional computing agent that thinks, plans, and executes virtual actions using spatial ai models

In this tutorial, we build an advanced computing agent from scratch that can be used, and also perform virtual action using the open source model. We create a customized miniature desktop, equip it with a tool display, and design an intelligent agent that can analyze its environment, decide on actions such as clicking or typing, and execute them step by step. At the end, we see how the agent defines goals such as opening emails or showing how the local language model can simulate the active thinking and execution of tasks. Look Full codes here.
!pip install -q transformers accelerate sentencepiece nest_asyncio
import torch, asyncio, uuid
from transformers import pipeline
import nest_asyncio
nest_asyncio.apply()
We set up our environment by including important libraries such as converters, accelerators, and nested Asyncio, which enable us to create local models and asynchronous operations seamlessly in colob. We optimize the runtime so that future components of our agent can run smoothly without external dependencies. Look Full codes here.
class LocalLLM:
def __init__(self, model_name="google/flan-t5-small", max_new_tokens=128):
self.pipe = pipeline("text2text-generation", model=model_name, device=0 if torch.cuda.is_available() else -1)
self.max_new_tokens = max_new_tokens
def generate(self, prompt: str) -> str:
out = self.pipe(prompt, max_new_tokens=self.max_new_tokens, temperature=0.0)[0]["generated_text"]
return out.strip()
class VirtualComputer:
def __init__(self):
self.apps = {"browser": " "notes": "", "mail": ["Welcome to CUA", "Invoice #221", "Weekly Report"]}
self.focus = "browser"
self.screen = "Browser open at bar focused."
self.action_log = []
def screenshot(self):
return f"FOCUS:{self.focus}nSCREEN:n{self.screen}nAPPS:{list(self.apps.keys())}"
def click(self, target:str):
if target in self.apps:
self.focus = target
if target=="browser":
self.screen = f"Browser tab: {self.apps['browser']}nAddress bar focused."
elif target=="notes":
self.screen = f"Notes AppnCurrent notes:n{self.apps['notes']}"
elif target=="mail":
inbox = "n".join(f"- {s}" for s in self.apps['mail'])
self.screen = f"Mail App Inbox:n{inbox}n(Read-only preview)"
else:
self.screen += f"nClicked '{target}'."
self.action_log.append({"type":"click","target":target})
def type(self, text:str):
if self.focus=="browser":
self.apps["browser"] = text
self.screen = f"Browser tab now at {text}nPage headline: Example Domain"
elif self.focus=="notes":
self.apps["notes"] += ("n"+text)
self.screen = f"Notes AppnCurrent notes:n{self.apps['notes']}"
else:
self.screen += f"nTyped '{text}' but no editable field."
self.action_log.append({"type":"type","text":text})
We describe the main components, a lightweight local model, and virtual computing. We use Flan-T5 as our logic engine and create a redesigned desktop that can open applications, display screens, and respond to typing and clicking with actions. Look Full codes here.
class ComputerTool:
def __init__(self, computer:VirtualComputer):
self.computer = computer
def run(self, command:str, argument:str=""):
if command=="click":
self.computer.click(argument)
return {"status":"completed","result":f"clicked {argument}"}
if command=="type":
self.computer.type(argument)
return {"status":"completed","result":f"typed {argument}"}
if command=="screenshot":
snap = self.computer.screenshot()
return {"status":"completed","result":snap}
return {"status":"error","result":f"unknown command {command}"}
We present the Cormputertool user interface, which acts as a communication bridge between the agent's consultation and the virtual desktop. We define high-level operations such as click, type, and screenshot, which enable the agent to interact with the environment in a structured way. Look Full codes here.
class ComputerAgent:
def __init__(self, llm:LocalLLM, tool:ComputerTool, max_trajectory_budget:float=5.0):
self.llm = llm
self.tool = tool
self.max_trajectory_budget = max_trajectory_budget
async def run(self, messages):
user_goal = messages[-1]["content"]
steps_remaining = int(self.max_trajectory_budget)
output_events = []
total_prompt_tokens = 0
total_completion_tokens = 0
while steps_remaining>0:
screen = self.tool.computer.screenshot()
prompt = (
"You are a computer-use agent.n"
f"User goal: {user_goal}n"
f"Current screen:n{screen}nn"
"Think step-by-step.n"
"Reply with: ACTION ARG THEN .n"
)
thought = self.llm.generate(prompt)
total_prompt_tokens += len(prompt.split())
total_completion_tokens += len(thought.split())
action="screenshot"; arg=""; assistant_msg="Working..."
for line in thought.splitlines():
if line.strip().startswith("ACTION "):
after = line.split("ACTION ",1)[1]
action = after.split()[0].strip()
if "ARG " in line:
part = line.split("ARG ",1)[1]
if " THEN " in part:
arg = part.split(" THEN ")[0].strip()
else:
arg = part.strip()
if "THEN " in line:
assistant_msg = line.split("THEN ",1)[1].strip()
output_events.append({"summary":[{"text":assistant_msg,"type":"summary_text"}],"type":"reasoning"})
call_id = "call_"+uuid.uuid4().hex[:16]
tool_res = self.tool.run(action, arg)
output_events.append({"action":{"type":action,"text":arg},"call_id":call_id,"status":tool_res["status"],"type":"computer_call"})
snap = self.tool.computer.screenshot()
output_events.append({"type":"computer_call_output","call_id":call_id,"output":{"type":"input_image","image_url":snap}})
output_events.append({"type":"message","role":"assistant","content":[{"type":"output_text","text":assistant_msg}]})
if "done" in assistant_msg.lower() or "here is" in assistant_msg.lower():
break
steps_remaining -= 1
usage = {"prompt_tokens": total_prompt_tokens,"completion_tokens": total_completion_tokens,"total_tokens": total_prompt_tokens + total_completion_tokens,"response_cost": 0.0}
yield {"output": output_events, "usage": usage}
We developed compamagent, which acts as an intelligent system controller. We plan to think about the goals, decide which things to take, take them out by using the tool's interface, and record each interaction as a step in his decision-making process. Look Full codes here.
async def main_demo():
computer = VirtualComputer()
tool = ComputerTool(computer)
llm = LocalLLM()
agent = ComputerAgent(llm, tool, max_trajectory_budget=4)
messages=[{"role":"user","content":"Open mail, read inbox subjects, and summarize."}]
async for result in agent.run(messages):
print("==== STREAM RESULT ====")
for event in result["output"]:
if event["type"]=="computer_call":
a = event.get("action",{})
print(f"[TOOL CALL] {a.get('type')} -> {a.get('text')} [{event.get('status')}]")
if event["type"]=="computer_call_output":
snap = event["output"]["image_url"]
print("SCREEN AFTER ACTION:n", snap[:400],"...n")
if event["type"]=="message":
print("ASSISTANT:", event["content"][0]["text"], "n")
print("USAGE:", result["usage"])
loop = asyncio.get_event_loop()
loop.run_until_complete(main_demo())
We bring it all together with demo functionality, where the agent interprets the user's request and performs tasks on the virtual computer. We see you expressing thoughts, completing commands, updating the visual screen, and finding your goal in a clear, step-by-step manner.
In conclusion, we have implemented a core computing agent capable of independent reasoning and communication. We prove that local language models such as Flan-T5 can dynamically simulate desktop-level automation within a secure, text-based sandbox. This project helps us to understand the construction of white behind Intent Agents such as agents who use computers, prevent natural language recognition through the control of virtual tools. It lays a solid foundation for extending these capabilities toward Real-World, Multimodal, and Automated Defenses.
Look Full codes here. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.
AsifAzzaq is the CEO of MarktechPost Media Inc.. as a visionary entrepreneur and developer, Asifi is committed to harnessing the power of social intelligence for good. His latest effort is the launch of the intelligence media platform, MarktechPpost, which stands out for its deep understanding of machine learning and deep learning stories that are technically sound and easily understood by a wide audience. The platform sticks to more than two million monthly views, which shows its popularity among the audience.
Follow Marktechpost: Add us as a favorite source on Google.



