Building Visual Agents That Can Navigate the Web Automatically | by Luís Roque | January, 2025
A step-by-step guide to creating virtual agents that can navigate the web automatically
This post was written in collaboration with Rafael Guedes.
In an era of rapid growth in artificial intelligence, the current topic is the rise of agent AI. These AI systems use large-scale linguistic models (LLMs) to make decisions, organize, and interact with other agents or humans.
When we wrap LLM with a role, a set of tools, and a specific goal, we create what we call an agent. By focusing on a well-defined goal and accessing appropriate APIs or external tools (such as search engines, databases, or browser links – more about this later), agents can automatically explore ways to achieve their goals. Therefore, agent AI opens a new paradigm where multiple agents can deal with complex, multi-step workflows.
John Carmack and Andrej Karpathy recently discussed the topic on X (formerly Twitter) which inspired this article. Carmack pointed out that AI-powered assistants can push apps to reveal features through text-based communication. In this world, LLMs talk to a command-line interface wrapped under a user interface (GUI), sidestepping the complexity of vision-based navigation (which exists because we humans need it). Karpathy raises a valid point that advanced AI systems can be better…