ANI

A simple Agentic Calling tool with Gemma 4

# Introduction

In a recent Machine Learning Mastery article, we built an agent calling tool that reached outsidemeaning it pulls weather, news, currency rates, and time from public APIs. That article covered half of the pattern well, but left the more interesting half on the table: an agent that reasons about its location, tests its machine, and loads a mind that it doesn't trust itself to do. It could be argued that this is closer to a real “agent”.

This article continues where the previous one left off. We'll give Gemma 4 two new tools – a sandboxed local file system explorer and a restricted Python interpreter – and watch the model decide, on its own, when to look around and when to compute.

Topics we will cover include:

  • Why a “functional” toolkit needs more than web APIs to be interesting
  • How to build a file system inspection tool with strong firewalls
  • How to call a Python interpreter on a model without giving it keys on your machine
  • How the same orchestration loop from before accesses these new capabilities

I highly recommend that you first read this article before proceeding.

# From Conversation to Agency

If the only tools you provide for the language model are read-only web APIs, you're essentially still a chatbot, albeit with the ability to access better information. The model receives the information, determines which API to ping, and parses the JSON response into a paragraph. There is no real concept of the environmentthere is no condition to be tested, no result to be consulted; it's a situation more akin to retrieving an improved generation than actual agency.

Agency, in the practical sense that practitioners use the term, appears when the model begins to interact with the system in which it operates. That can mean reading from the local file system, executing code, modifying files, calling other procedures, or any combination of those. The moment the tool can do something other than return a clean string from the remote service, the model should start querying about itself: what files are there, what does this number actually equal, what is in this folder before I say it contains anything.

Gemma 4 family, and especially gemma4:e2b the edge variant we've been using, is small enough to run locally on a laptop while being capable enough in structured output to reliably drive this type of loop. That combination is what makes the local-agent pattern interesting in the first place. The complete code for this tutorial can be found here.

# Construction Reuse

The orchestration loop from the previous tutorial is unchanged. We define Python functions, expose them in JSON schema, pass the registration to Ollama along with the user information, catch any tool_calls block on response, perform the requested operation on the field, enter the result as a tool-role message, and then query the model again to compile the final response. The same call_ollama assistant, likewise TOOL_FUNCTIONS dictionary, same available_tools The schema array from the previous tutorial all appear.

What is changing is the nature of the tools themselves. Where the previous batch was more thin clients than remote APIs, the ones we're going to build now both run code on the machine. That changes the design problem from “how do I pass this response” to “how do I make sure the model can't, or accidentally, do something it shouldn't be allowed to do.”

# Tool 1: Sandboxed Filesystem Explorer

The first tool, list_directory_contentsgives the model the ability to see what files are present in a given folder. This sounds like a small thing until you remember it os.listdir accepts any string, including /, ~again ../../etc. A clumsy implementation may move the “curiosity” of the model directly to your API keys.

The design option here is to pin the safe base directory at the beginning of the script and reject any request that is resolved outside of it:

# Security: confine list_directory_contents to this base directory and its descendants
# Set to the current working directory when the script starts
SAFE_BASE_DIR = os.path.abspath(os.getcwd())

def list_directory_contents(path: str = ".") -> str:
    """Lists files and directories within a path, constrained to the safe base directory."""
    try:
        # Resolve to an absolute path and verify it sits inside SAFE_BASE_DIR
        # This blocks traversal attempts like '../../etc' or absolute paths like "
        requested = os.path.abspath(os.path.join(SAFE_BASE_DIR, path))
        if not (requested == SAFE_BASE_DIR or requested.startswith(SAFE_BASE_DIR + os.sep)):
            return (
                f"Error: Access denied. The path '{path}' resolves outside the "
                f"permitted workspace ({SAFE_BASE_DIR})."
            )
        ...

The pattern is simple but worth further consideration. We never trust the thread produced by the model. We join it in the base directory, solving it completely (so .. becomes normal), and make sure that the resolved method still starts with the base. Both /etc/passwd again ../../somewhere fold to methods that fail that initialization check and are rejected earlier os.listdir you are always called.

The rest of the work is housekeeping: verify that the method exists and is indexed, write its contents, and format each entry as the other. [DIR] or [FILE] in byte size. The returned string is a simple English structure that the model can parse in the second pass:

        entries = sorted(os.listdir(requested))
        if not entries:
            return f"The directory '{path}' is empty."

        lines = [f"Contents of '{path}' ({len(entries)} item(s)):"]
        for name in entries:
            full = os.path.join(requested, name)
            if os.path.isdir(full):
                lines.append(f"  [DIR]  {name}/")
            else:
                try:
                    size = os.path.getsize(full)
                    lines.append(f"  [FILE] {name} ({size} bytes)")
                except OSError:
                    lines.append(f"  [FILE] {name}")
        return "n".join(lines)

The JSON schema we give the model is intentionally parameter sided — path is optional, it defaults to the root of the workspace, because most of the first useful queries are about the current folder:

{
    "type": "function",
    "function": {
        "name": "list_directory_contents",
        "description": (
            "Lists files and subdirectories inside a path within the user's workspace. "
            "Use this to inspect the environment before answering questions about local files."
        ),
        "parameters": {
            "type": "object",
            "properties": {
                "path": {
                    "type": "string",
                    "description": (
                        "A relative path inside the workspace, e.g. '.', 'data', or 'src/utils'. "
                        "Defaults to the workspace root."
                    )
                }
            },
            "required": []
        }
    }
}

Note that the description does a bit of informational engineering: “Use this to check the environment before answering questions about local files.” That sentence prompts Gemma 4 to invoke the tool when the user asks a vague question about “my files” rather than guessing what might be there.

# Tool 2: Limited Python Interpreter

The second tool, execute_python_codeit is both highly dangerous and academically interesting. The premise is that language models, especially small ones, are unreliable for precise arithmetic, string manipulation, or anything that involves more than a few steps of branching logic. A tool that allows the model to write and use deterministic expressions is a better answer to those problems than asking it to think about them in natural language.

Use of use exec() with a deliberately constructed namespace:

def execute_python_code(code: str) -> str:
    """Executes a snippet of Python code and returns whatever was printed to stdout.

    This is a learning-only sandbox. exec() is fundamentally unsafe; do not expose this tool
    to untrusted users or networks. The restrictions below stop the casual cases, not a 
    determined attacker.
    """
    try:
        # A minimal restricted environment. We strip __builtins__ down to a small
        # whitelist so that, e.g., open(), eval(), and __import__ are not directly
        # available from the snippet's global scope.
        safe_builtins = {
            "abs": abs, "all": all, "any": any, "bool": bool, "dict": dict,
            "divmod": divmod, "enumerate": enumerate, "filter": filter, "float": float,
            "int": int, "len": len, "list": list, "map": map, "max": max, "min": min,
            "pow": pow, "print": print, "range": range, "repr": repr, "reversed": reversed,
            "round": round, "set": set, "sorted": sorted, "str": str, "sum": sum,
            "tuple": tuple, "zip": zip,
        }
        # Pre-import a couple of safe, useful modules so the model doesn't have to.
        import math, statistics
        restricted_globals = {
            "__builtins__": safe_builtins,
            "math": math,
            "statistics": statistics,
        }

A few decisions to call. We replace __builtins__ completely instead of shutting down individual jobs, that is open, eval, exec, compile, __import__, inputand anything else that is not on our approved list is not within the captions. We import first math again statistics in the world of the snippet because the model will always reach them and we better not force it to fight __import__ restrictions. We capture stdout with contextlib.redirect_stdout so the model returns exactly what its snippet printed:

        # Capture stdout so we can hand the printed output back to the model
        buffer = io.StringIO()
        with contextlib.redirect_stdout(buffer):
            exec(code, restricted_globals, {})

        output = buffer.getvalue().strip()
        if not output:
            return "Code executed successfully but produced no output. Use print() to return a value."
        return f"Output:n{output}"

A bare branch is more important than its appearance. Small models will always write expressions like x = sum(range(101)) and forget print(x). It returns some error telling them to use it print() gives the orchestration loop a retry option; without it, the model can compile the final answer based on the empty string and establish the value with confidence.

A final word about security, since the script's docstring isn't clear about it: this is a learning sandbox, not a strict one. A determined enemy can get out of Nhlangwini exec sandbox in twelve ways, most of which involve something introspection through ().__class__.__mro__. For a single user agent that uses your personal laptop for your information, whitelisting is a lot. For anything else, you'd want a real isolation layer – a subprocess with seccompcontainer, or RestrictedPython.

# Orchestration Loop

The main loop is unchanged from the structure from the previous lesson. The model is asked for the user notification and the tool register, and if it responds with it tool_callseach call is sent against TOOL_FUNCTIONS:

if "tool_calls" in message and message["tool_calls"]:
    print("[TOOL EXECUTION]")
    messages.append(message)

    num_tools = len(message["tool_calls"])
    for i, tool_call in enumerate(message["tool_calls"]):
        function_name = tool_call["function"]["name"]
        arguments = tool_call["function"]["arguments"]
        ...
        if function_name in TOOL_FUNCTIONS:
            func = TOOL_FUNCTIONS[function_name]
            try:
                result = func(**arguments)
                ...
                messages.append({
                    "role": "tool",
                    "content": str(result),
                    "name": function_name
                })

The CLI formatting is worth a little tweaking of this script. I execute_python_code a tool code argument can be a multiline string with newlines in it, which will break the ASCII tree if printed inadvertently. We flatten and truncate the string arguments for display only; model still receives the full string when the function runs:

def _short(v):
    if isinstance(v, str):
        flat = v.replace("n", "\n")
        if len(flat) > 60:
            flat = flat[:57] + "..."
        return f"'{flat}'"
    return str(v)

args_str = ", ".join(f"{k}={_short(v)}" for k, v in arguments.items())

Once the result of each tool is added back to the message history as "role": "tool" entry, we also call Ollama a rich income burden and the model produces its final well-founded answer. Same pattern for two passes, same concept.

# Testing Tools

And now we test the pulse of our tool. Pull gemma4:e2b with ollama pull gemma4:e2b if you haven't already, run the script in the folder you don't care about the peep model.

Let's start with the file system tool. From the project directory:

What scripts are in my current folder, and which one looks like it should be used to process CSVs?

Result:

[SYSTEM]
  ○ Tool: execute_python_code......................[LOADED]
  ○ Tool: list_directory_contents..................[LOADED]
  ○ Workspace: /Users/matt/projects/gemma_agent.....[SANDBOXED]

[PROMPT]
What scripts are in my current folder, and which one looks like it should be used to process CSVs?

[EXECUTION]
  ● Querying model...

[TOOL EXECUTION]
  └── Calling: list_directory_contents
     ├─ Args: path="."
     └─ Result: Contents of '.' (5 item(s)):
                  [FILE] README.md (412 bytes)
                  [FILE] csv_cleaner.py (1834 bytes)
                  [FILE] main.py (10786 bytes)
                  [FILE] notes.txt (88 bytes)
                  [FILE] sales_report.py (2210 bytes)

[EXECUTION]
  ● Synthesizing results...

[RESPONSE]
Your current folder contains five files. The one that looks intended for CSV
processing is csv_cleaner.py — its name strongly suggests it handles CSV input.
sales_report.py may also touch CSV data, but its name is more about output than
ingestion.

A model called the tool, looked at the actual names of the files, and made a logical conclusion based on the list instead of their weights. That is the difference between hallucinations and visions.

Next, the Python interpreter. A small task that young models reliably fail when asked to do in their heads:

What is the standard deviation of the numbers 12, 18, 23, 24, 29, 31, 35, 41, 44, 47, rounded to four decimal places?

Result:

[SYSTEM]
  ○ Tool: execute_python_code......................[LOADED]
  ○ Tool: list_directory_contents..................[LOADED]
  ○ Workspace: /Users/matt/projects/gemma_agent.....[SANDBOXED]

[PROMPT]
What is the standard deviation of the numbers 12, 18, 23, 24, 29, 31, 35, 41, 44, 47, rounded to four decimal places?

[EXECUTION]
  ● Querying model...

[TOOL EXECUTION]
  └── Calling: execute_python_code
     ├─ Args: code="nums = [12, 18, 23, 24, 29, 31, 35, 41, 44, 47]nprint(rou..."
     └─ Result: Output:
                11.4659

[EXECUTION]
  ● Synthesizing results...

[RESPONSE]
The standard deviation of those numbers, rounded to four decimal places, is 11.4659.

The model has completely eliminated the equation; wrote a brief, so-called statistics.stdevcompile the result, and report what the translator said. No mental math, no limitations, no imaginary important numbers.

Finally, the most interesting case: information that requires both tools in sequence. The model should check the folder again compile something with what it finds:

Look for the files in the current folder and tell me the total size in kilobytes, rounded to two decimal places.

Output:


[SYSTEM]
  ○ Tool: execute_python_code......................[LOADED]
  ○ Tool: list_directory_contents..................[LOADED]
  ○ Workspace: /Users/matt/projects/gemma_agent.....[SANDBOXED]

[PROMPT]
Look at the files in the current folder and tell me the total size in kilobytes, rounded to two decimal places.

[EXECUTION]
  ● Querying model...

[TOOL EXECUTION]
  ┌── Calling: list_directory_contents
  │  ├─ Args: path="."
  │  └─ Result: Contents of '.' (5 item(s)):
  │              [FILE] README.md (412 bytes)
  │              [FILE] csv_cleaner.py (1834 bytes)
  │              [FILE] main.py (10786 bytes)
  │              [FILE] notes.txt (88 bytes)
  │              [FILE] sales_report.py (2210 bytes)
  │
  └── Calling: execute_python_code
     ├─ Args: code="sizes = [412, 1834, 10786, 88, 2210]nprint(round(sum(siz..."
     └─ Result: Output:
                15.33

[EXECUTION]
  ● Synthesizing results...

[RESPONSE]
The five files in the current folder total 15.33 KB.

The two tools, in the right order, have the effect of feeding the argument of the other – generated by a two-billion-parameter model running on a laptop without a GPU. The file system tool bases the model on what actually exists; The interpretive tool bases the answer on what is actually true. The model provides the most efficient part of it, which is what determines which question to ask which tool.

It's worth checking with the guards, to make sure they're catching. Asking the model “write the content of /etc” produces the expected rejection message in the output of the tool, the model then reports it correctly instead of creating a directory. It asks it to run open('/etc/passwd').read() internally the translator produces a NameErrorsince open not in approved buildings. Both failures come down to useful error strings instead of silent compromises, which is exactly what you want in this layer.

# The conclusion

The previous lesson showed that Gemma 4 can access the entire Internet on your behalf. This shows that it can reach the machine you are sitting on, carefully, if you are careful. Once you have a functional tool call loop, the interesting question stops being “can the model call the function” and becomes “what should I let it touch.”

A file system awareness tool and a code generation tool together give you a great way to something that truly deserves the name. agent: it can see its location, determine what the important figure is, and run that figure by decision rather than guesswork. The pattern builds from there. Database queries, shell commands, git operations, document parsing; each of these is the same JSON schema, the same forwarding table, the same two-pass merge, and any security perimeter that fits the blast radius of the underlying wire.

Build a circuit first. Then give the model the keys to whatever resides inside it.

Matthew Mayo (@mattmayo13) has a master's degree in computer science and a diploma in data mining. As managing editor of KDnuggets & Statology, and contributing editor to Machine Learning Mastery, Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, language models, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button