Machine Learning

From Local LLM to Agent Using Tool

nimda 2 hours ago

0 1 7 minutes read

local LLM. Good.

But after the first few conversations, you may wonder: what else can I do with it?

However, what about making a local LLM agent using a specific tool?

In this post, we'll explore how to turn a local LLM into a tool-based agent. Specifically, we will use

Gemma 4 model (a user-friendly alternative) like our local LLM
Ollama by working for a local LLM
OpenAI Agents SDK for the runtime of the agent
Tavily web search MCP as one example of an external tool

We will build a deep research agent that can search the web, collect evidence, and compile an answer with quotes, given a user query.

By the end of the post, you'll have a working deep space exploration agent and a reusable implementation pattern for turning a space model into a space AI agent.

Figure 1. Structure of local agent. (Photo by author)

If you're interested in a local code agent setup, I previously covered Gemma 4 + OpenCode. In this post, we focus on the most common pattern of connecting the local model to the agent runtime and external tools.

1. Setup the Local Agent Stack

We need to prepare 4 pieces before we write the code: Ollama, Gemma 4 (specifically Gemma 4 E4B model), OpenAI Agents SDK, and Tavily MCP.

First, let's install Ollama.

For Windows, you can download the installer from the official Ollama website:

Or use it winget in PowerShell:

winget install Ollama.Ollama

On Linux, Ollama can be installed with:

"curl -fsSL  | sh"

After installation, please check:

ollama --version

On Windows, remember to launch Ollama from the Start menu. Once it's up and running, the API endpoint is available.

Next, we draw the local model. Here, we use the Gemma 4 E4B variant:

ollama pull gemma4:e4b

Gemma 4 has several variations. The E4B model is perfect for our purpose, as it is designed with edge/local agent workflow in mind. My machine has an NVIDIA RTX 2000 Ada Laptop GPU with about 8 GB VRAM. If your machine is very hardwired, you can try a simple E2B variant:

ollama pull gemma4:e2b

Next, we need the agent runtime library. For that, we use the OpenAI Agents SDK:

pip install openai-agents

You will also need an OpenAI compatible client:

pip install openai

Something to note here: later, we'll point the client to Ollama's local endpoint, so this doesn't mean we're sending model calls to OpenAI.

Finally, we need a Tavily MCP repository. In case you haven't used it before, Tavily is a search API designed for LLM applications. In this post, we use its own MCP server so that the agent can search the web.

You will need to first create a Tavily account and obtain an API key. On the Tavily platform, you can directly generate an MCP link with the following format:

Now we are ready.

Using Tavily here is not a sponsored option; is used here as a single simple MCP tool, the same pattern can work with other MCP compatible tools as well.

In fact, the entire stack here is not the only option. Instead of using Ollama, you can use the local model with LM Studio or llama.cpp. Instead of Gemma 4 models, you can try other models from, eg, the Qwen family. For the agent framework, and we have options from Google or Anthropic. You can also connect different MCP tools instead of Tavily. I use this combination simply because I am familiar with that stack. But the main takeaway from this lesson is the typical example of a local agent.

2. Configure the Local Survey Agent

With the OpenAI Agents SDK, this is the latter Agent something we need to name:

from agents import Agent

agent = Agent(
    name="Local Research Agent",
    instructions=RESEARCH_AGENT_INSTRUCTIONS,
    model=model,
    mcp_servers=[tavily_server],
    mcp_config={"include_server_in_tool_names": True},
)

Let's break down each part.

2.1 Model

First, the model.

from openai import AsyncOpenAI
from agents import OpenAIChatCompletionsModel

MODEL_NAME = "gemma4:e4b"
OLLAMA_BASE_URL = "

client = AsyncOpenAI(
    api_key="ollama",
    base_url=OLLAMA_BASE_URL,
)

model = OpenAIChatCompletionsModel(
    model=MODEL_NAME,
    openai_client=client,
)

We start by building a client that points to an OpenAI compatible Ollama endpoint.

Then, we use OpenAIChatCompletionsModel wrapping a Gemma model into a model object. This allows the agent SDK to use that model inside the agent loop.

Note that i api_key="ollama" the value is just a placeholder. Ollama doesn't really need a real OpenAI API key. We use it because the client expects this field.

2.2 Command

Next, we describe the instructions for the agent with the desired behavior of the study:

from datetime import datetime

CURRENT_DATE = datetime.now().strftime("%B %d, %Y")

# Note that this instruction is iterated with AI
RESEARCH_AGENT_INSTRUCTIONS = f"""
[Role]
You are a concise research assistant.

[Task]
Answer the user's question by turning it into a small web research task. 
Use the current date when interpreting time-sensitive questions: {CURRENT_DATE}.

[Research behavior]
Start with one targeted search query.
For recommendation or comparison questions, complete this research loop before answering: 
first identify the main options, then search for comparison context, then synthesize a recommendation.

Use follow-up searches when the first results are insufficient, conflicting, or only cover part of the question.

Prefer relevant and credible sources, and track which source supports each important claim.

Before answering, check whether the gathered evidence is enough to support the conclusion.

[Expected output]
Give a direct answer first, then briefly explain the evidence behind it. 
Include source links for key factual claims.

[Rules]
Do not rely on memory for facts that may have changed.
Do not invent missing details.
Keep the answer concise.
""".strip()

2.3 Tools

We now equip the agent with a web search tool. In this case, we use the Tavily search engine with MCP:

from agents import Agent, Runner
from agents.mcp import MCPServerStreamableHttp

TAVILY_MCP_URL = "YOUR_TAVILY_MCP_URL"

async with MCPServerStreamableHttp(
    name="tavily",
    params={"url": TAVILY_MCP_URL},
) as tavily_server:
    tools = await tavily_server.list_tools()

    print("Available Tavily tools:")
    for tool in tools:
        description = (tool.description or "").replace("n", " ")
        print(f"- {tool.name}: {description[:120]}")

    agent = Agent(
        name="Local Research Agent",
        instructions=RESEARCH_AGENT_INSTRUCTIONS,
        model=model,
        mcp_servers=[tavily_server],
        mcp_config={"include_server_in_tool_names": True},
    )

    result = await Runner.run(agent, RESEARCH_QUESTION, max_turns=MAX_TURNS)

This code block does three things:

Opens a connection to Tavily's MCP server with async with MCPServerStreamableHttp(...) as tavily_server: Once connected, Tavily will expose its available tools in the Agent SDK.
We create an agent object inside the MCP context. Note that we have mcp_servers=[tavily_server]which attaches Tavily's MCP tools to the agent.
Finally we have an agent result = await Runner.run(agent, RESEARCH_QUESTION, max_turns=MAX_TURNS). The content manager is important here because MCP communication only works internally async with block.

mcp_config={"include_server_in_tool_names": True} mostly read in line. Without it, the name of the tool will only appear as tavily_search. With it, the name of the tool will appear as mcp_tavily__tavily_search. This makes it more clear that the tool call originated through the Tavily MCP server.

3. Develop a Research Question

Now that the agent is configured, let's test it with one practical question:

“Which June 23, 2026 World Cup match had the biggest share of the group stage, and why?”

To check what happened, I print the combined trace:

def compact(value: object, limit: int = 220) -> str:
    text = str(value).replace("n", " ")
    return text if len(text) <= limit else text[:limit] + "..."


for step, item in enumerate(result.new_items, start=1):
    raw_item = getattr(item, "raw_item", None)
    raw_type = getattr(raw_item, "type", "")
    raw_name = getattr(raw_item, "name", "")
    raw_output = getattr(raw_item, "output", "")

    print(
        f"{step:02d} | {type(item).__name__} | "
        f"{raw_type or raw_name} | {compact(raw_output or raw_item)}"
    )

In my run, the trace looked like this:

01 | ToolCallItem | function_call | ResponseFunctionToolCall(arguments='{"query":"World Cup 2026 group stage matches June 23, 2026 stakes"}', name='mcp_tavily__tavily_search', ...)
02 | ToolCallOutputItem |  | {'call_id': ..., 'output': ...}
03 | MessageOutputItem | message | ResponseOutputMessage(... final answer ...)

This allows us to see the behavior of the agent directly. In this case, Gemma's local model decided to call the Tavily search tool, the Agent SDK made that tool call, and passed the results to the model. Then, the model produced the final answer.

To see the final answer, we can print:

print(result.final_output)

Here is what the agent produced:

The match with the biggest group-stage stakes on June 23, 2026, was Colombia vs. DR Congo.

Why:
According to FIFA reporting, this specific match was highlighted as a critical moment where Colombia advanced into the knockout phase of the tournament. 
The article notes that Daniel Munoz scored the first goal for Colombia during this Group K fixture, which directly contributed to their progression in the competition.

Evidence
- FIFA: An article titled "Colombia v Congo DR Group K FIFA World Cup 2026" specifically reports on a key moment from this match, stating that Munoz's goal helped fire Colombia into the knockout phase.
  Source: 

- Yahoo Sports: Confirms the fixture and result for that date: Colombia defeated DR Congo.
  Source:

Note that the agent performed only one search round this time, as the search results already contain enough evidence for the model to respond. For complex queries, multiple rounds of search and reasoning will be required, and our current framework naturally supports that.

4. Wrapping up

The local LLM does not have to remain as an interview model.

In this post, we have generated the Gemma 4 E4B model locally with Ollama, and then put the model into the agent runtime provided by the OpenAI Agents SDK, and provide the agent with a web search tool to find information on the Internet to answer user questions.

From here, you can easily extend this pattern with more rigorous research instructions or build a clearer workflow for planning, if you want to continue working towards deeper research, or you can connect the agent to additional MCP tools for many other use cases.

Fun layout!