ANI

5 Small Language Models for Driving an Agentic Tool

nimda May 14, 2026

0 17 5 minutes read

5 Small Language Models for Driving an Agentic Tool

# Introduction

Agent AI systems rely on the model's ability to reliably drive tools, select the correct task, format arguments correctly, and integrate results into multi-step workflows. Larger frontier models like ChatGPT, Claude, and Gemini handle this well, but come with trade-offs in cost, latency, and hardware requirements that make them impractical for many real-world applications. Smaller language models have done well to bridge that gap, and several compact, open-source options now offer support for calling first-class tools without the need for a data center to run them.

And now, in no particular order, here are 5 small language models for calling an agent. Note that, for simplicity and consistency, all model links point to models owned by Hugging Face.

# 1. SmolLM3-3B

Technical Feature	Details
Parameters	3B
Buildings	Decoder converter only (GQA + NoPE, 3:1 ratio)
Core Length	64K natives; up to 128K with YaRN extrapolation
Training Tokens	11.2T
Multilingual Support	6 languages (EN, FR, ES, DE, IT, PT)
Consultation mode	Dual mode (switch thinking / not thinking)
Hitting the Tool	Yes: JSON/XML (`xml_tools`) and Python (`python_tools`)
License	Apache 2.0

SmolLM3 is a 3D parametric language model designed to push the boundaries of small models, supporting dual-mode logic, 6 languages, and long context. It is a decoder-only transformer using Grouped Query Attention (GQA) and No Positional Embeddings (NoPE) (with a 3:1 ratio), pre-trained on 11.2T tokens with a web-staged curriculum, code, statistics, and reasoning data. Post-training included an intermediate phase of training on 140 billion thought tokens, followed by supervised processing and alignment with Anchored Preference Optimization (APO), HuggingFace's out-of-policy preference alignment method. The model supports two different interfaces for calling tools, JSON/XML blocks with xml_tools and Python-style function calls with python_toolsmaking it more flexible for agent pipelines and RAG systems. As a fully open source release, including weights, datasets, and training code, SmolLM3 is ideal for interviews, RAG programs, and code assistants on constrained hardware such as peripheral devices or low VRAM machines.

# 2. Qwen3-4B-Yala-2507

Technical Feature	Details
Parameters	4.0B (3.6B not embedded)
Buildings	Causal LM, 36 layers, GQA (32 Q heads / 8 KV heads)
Core Length	262,144 tokens (native)
Consultation mode	Not thinking only (no blocks)
Many languages	100+ languages
Hitting the Tool	Yes: native, via Qwen-Agent / MCP
License	Apache 2.0

Qwen3-4B-Yala-2507 an updated version of Qwen3-4B cognitive mode, which shows significant improvement in general skills including: following instructions, logical thinking, text comprehension, math, science, coding, and tool use. It also has great advantages for long-tail content inclusion across multiple languages. Both the Instruction and Thinking variants share 4 billion parameters (3.6B without embedding) built across 36 transformer layers, using a GQA with 32 query heads and 8 key/value heads, allowing efficient memory management in very long instances. This non-thinking variant is optimized for specific, fast use cases, such as delivering short responses without clear tracking, making them well-suited for chatbots, customer support, and call agents where low latency is important. Qwen3 excels in tool calling capabilities, and Alibaba recommends using the Qwen-Agent framework, which includes tool calling templates and internal parsers, reducing code complexity, with support for MCP server configuration files.

# 3. Phi-3-mini-4k-teach

Technical Feature	Details
Parameters	3.8B
Buildings	Decoder-transformer only
Core Length	4K tokens
Vocabulary Size	32,064 tokens
Training Data	Synthetic + filtered public web data
After training	SFT + DPO
Hitting the Tool	Yes: via dialog template (requires HF transformers ≥ 4.41.2)
License	MIT

Phi-3-Mini-4K-Go is a 3.8B parameter, lightweight, state-of-the-art open model trained on Phi-3 datasets that include both synthetic and publicly available filtered web data, with a focus on high-quality and computationally dense structures. The model underwent a post-training process that included both Supervised Fine Tuning (SFT) and Direct Preference Development (DPO) for following instructions and safety. Microsoft's “small but smart” model, the Phi-3-mini was notable at launch for its ability to work on devices, including smartphones, while competing with the GPT-3.5 in benchmarks. The model is primarily intended for the environment with memory and computing, latency-bound situations, and tasks that require strong thinking, especially math and logic. Although it is older than other models on this list and limited to the 4K content window, the MIT license makes it one of the most licensed options available, and its strong general logic has made it a popular base for fine-tuning commercial applications.

# 4. Gemma-4-E2B-it

Technical Feature	Details
Functional Parameters	2.3B (5.1B total with embedding)
Architecture	Dense, mixed attention (sliding window + global) + PLE
Layers	35
Sliding Window	512 tokens
Core Length	128K tokens
Vocabulary Size	262K
Methods	Text, Image, Audio (≤30 sec), Video (as frames)
Many languages	35+ native, trained in 140+ languages
Hitting the Tool	Yes: native work is calling
License	Apache 2.0

Gemma-4-E2B is part of the Gemma 4 family of Google DeepMind, which combines a hybrid attention mechanism, local sliding window attention with full global attention. This design delivers the processing speed and low memory of a lightweight model without sacrificing the deep awareness needed for complex, long-context tasks. The “E” in E2B stands for “efficiency” parameters, which are enabled by an important architectural innovation called Per-Layer Embeddings (PLE), which adds a dedicated state vector to all decoder layers. This is a method that allows E2B to run on less than 1.5 GB of memory per measurement and still produce significant results. The model supports native workflows, enables agent workflows, and is optimized for use on mobile and IoT devices, capable of handling text, image, audio, and video input. Released under Apache 2.0 (a change from the previous Gemma custom license with more restrictions), Gemma 4 E2B is an attractive option for developers building multimodal agent applications that run entirely at the edge.

# 5. Mistral-7B-Yala-v0.3

Technical Feature	Details
Parameters	7.25B
Architecture	Transformer, GQA + SWA
Core Length	32,768 tokens
Vocabulary Size	32,768 tokens (extended from v0.2)
The Tokenizer	v3 Mistral tokenizer
Activity Calling	Yes: with `TOOL_CALLS` / `AVAILABLE_TOOLS` / `TOOL_RESULTS` tokens (see here)
License	Apache 2.0

Mistral-7B-Yala-v0.3 a fine-tuned version of Mistral-7B-v0.3, which introduces three important changes over v0.2: expanded vocabulary to 32,768 tokens, v3 token support, and function calling support. The model uses clustered query attention for fast access and Sliding Window Attention (SWA) to handle long sequences efficiently, and call support is made possible by using an extended vocabulary including dedicated tokens TOOL_CALLS, AVAILABLE_TOOLSagain TOOL_RESULTS. As the largest model in this conference in 7B parameters, Mistral-7B-Instruct-v0.3 offers excellent performance following the group's standard instructions and has become an industry standard horse, widely available through Ollama, vLLM, and many thinking platforms.

# Wrapping up

The five models covered here – SmolLM3-3B, Qwen3-4B-Instruct-2507, Phi-3-mini-4k-instruct, Gemma-4-E2B-it, and Mistral-7B-Instruct-v0.3 – include a range of architectures, parameter calculations, context windows, but they all share a key release tool: an open weight package.

From Hugging Face's fully transparent SmolLM3 to Google DeepMind's multimodal edge-optimized Gemma 4 E2B, the selection shows that capable agent models no longer require large infrastructures and boundary models to implement. Whether your priority is on-device thinking, remote content management, multilingual use, or the most permissive license possible, there's a model on this list worth checking out.

Keep in mind that these are not the only small language models with toolkit capabilities. They do, however, do a good job of representing those that I have direct knowledge of, and that I feel comfortable generalizing based on my results.

Matthew Mayo (@mattmayo13) has a master's degree in computer science and a diploma in data mining. As managing editor of KDnuggets & Statology, and contributing editor to Machine Learning Mastery, Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, language models, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.

Source link

nimda May 14, 2026

0 17 5 minutes read