5 Small Language Models for Driving an Agentic Tool

# Introduction
Agent AI systems rely on the model's ability to reliably drive tools, select the correct task, format arguments correctly, and integrate results into multi-step workflows. Larger frontier models like ChatGPT, Claude, and Gemini handle this well, but come with trade-offs in cost, latency, and hardware requirements that make them impractical for many real-world applications. Smaller language models have done well to bridge that gap, and several compact, open-source options now offer support for calling first-class tools without the need for a data center to run them.
And now, in no particular order, here are 5 small language models for calling an agent. Note that, for simplicity and consistency, all model links point to models owned by Hugging Face.
# 1. SmolLM3-3B
| Technical Feature | Details |
|---|---|
| Parameters | 3B |
| Buildings | Decoder converter only (GQA + NoPE, 3:1 ratio) |
| Core Length | 64K natives; up to 128K with YaRN extrapolation |
| Training Tokens | 11.2T |
| Multilingual Support | 6 languages (EN, FR, ES, DE, IT, PT) |
| Consultation mode | Dual mode (switch thinking / not thinking) |
| Hitting the Tool | Yes: JSON/XML (xml_tools) and Python (python_tools) |
| License | Apache 2.0 |
SmolLM3 is a 3D parametric language model designed to push the boundaries of small models, supporting dual-mode logic, 6 languages, and long context. It is a decoder-only transformer using Grouped Query Attention (GQA) and No Positional Embeddings (NoPE) (with a 3:1 ratio), pre-trained on 11.2T tokens with a web-staged curriculum, code, statistics, and reasoning data. Post-training included an intermediate phase of training on 140 billion thought tokens, followed by supervised processing and alignment with Anchored Preference Optimization (APO), HuggingFace's out-of-policy preference alignment method. The model supports two different interfaces for calling tools, JSON/XML blocks with xml_tools and Python-style function calls with python_toolsmaking it more flexible for agent pipelines and RAG systems. As a fully open source release, including weights, datasets, and training code, SmolLM3 is ideal for interviews, RAG programs, and code assistants on constrained hardware such as peripheral devices or low VRAM machines.
# 2. Qwen3-4B-Yala-2507
| Technical Feature | Details |
|---|---|
| Parameters | 4.0B (3.6B not embedded) |
| Buildings | Causal LM, 36 layers, GQA (32 Q heads / 8 KV heads) |
| Core Length | 262,144 tokens (native) |
| Consultation mode | Not thinking only (no blocks) |
| Many languages | 100+ languages |
| Hitting the Tool | Yes: native, via Qwen-Agent / MCP |
| License | Apache 2.0 |
Qwen3-4B-Yala-2507 an updated version of Qwen3-4B cognitive mode, which shows significant improvement in general skills including: following instructions, logical thinking, text comprehension, math, science, coding, and tool use. It also has great advantages for long-tail content inclusion across multiple languages. Both the Instruction and Thinking variants share 4 billion parameters (3.6B without embedding) built across 36 transformer layers, using a GQA with 32 query heads and 8 key/value heads, allowing efficient memory management in very long instances. This non-thinking variant is optimized for specific, fast use cases, such as delivering short responses without clear tracking, making them well-suited for chatbots, customer support, and call agents where low latency is important. Qwen3 excels in tool calling capabilities, and Alibaba recommends using the Qwen-Agent framework, which includes tool calling templates and internal parsers, reducing code complexity, with support for MCP server configuration files.
# 3. Phi-3-mini-4k-teach
| Technical Feature | Details |
|---|---|
| Parameters | 3.8B |
| Buildings | Decoder-transformer only |
| Core Length | 4K tokens |
| Vocabulary Size | 32,064 tokens |
| Training Data | Synthetic + filtered public web data |
| After training | SFT + DPO |
| Hitting the Tool | Yes: via dialog template (requires HF transformers ≥ 4.41.2) |
| License | MIT |
Phi-3-Mini-4K-Go is a 3.8B parameter, lightweight, state-of-the-art open model trained on Phi-3 datasets that include both synthetic and publicly available filtered web data, with a focus on high-quality and computationally dense structures. The model underwent a post-training process that included both Supervised Fine Tuning (SFT) and Direct Preference Development (DPO) for following instructions and safety. Microsoft's “small but smart” model, the Phi-3-mini was notable at launch for its ability to work on devices, including smartphones, while competing with the GPT-3.5 in benchmarks. The model is primarily intended for the environment with memory and computing, latency-bound situations, and tasks that require strong thinking, especially math and logic. Although it is older than other models on this list and limited to the 4K content window, the MIT license makes it one of the most licensed options available, and its strong general logic has made it a popular base for fine-tuning commercial applications.
# 4. Gemma-4-E2B-it
| Technical Feature | Details |
|---|---|
| Functional Parameters | 2.3B (5.1B total with embedding) |
| Architecture | Dense, mixed attention (sliding window + global) + PLE |
| Layers | 35 |
| Sliding Window | 512 tokens |
| Core Length | 128K tokens |
| Vocabulary Size | 262K |
| Methods | Text, Image, Audio (≤30 sec), Video (as frames) |
| Many languages | 35+ native, trained in 140+ languages |
| Hitting the Tool | Yes: native work is calling |
| License | Apache 2.0 |
Gemma-4-E2B is part of the Gemma 4 family of Google DeepMind, which combines a hybrid attention mechanism, local sliding window attention with full global attention. This design delivers the processing speed and low memory of a lightweight model without sacrificing the deep awareness needed for complex, long-context tasks. The “E” in E2B stands for “efficiency” parameters, which are enabled by an important architectural innovation called Per-Layer Embeddings (PLE), which adds a dedicated state vector to all decoder layers. This is a method that allows E2B to run on less than 1.5 GB of memory per measurement and still produce significant results. The model supports native workflows, enables agent workflows, and is optimized for use on mobile and IoT devices, capable of handling text, image, audio, and video input. Released under Apache 2.0 (a change from the previous Gemma custom license with more restrictions), Gemma 4 E2B is an attractive option for developers building multimodal agent applications that run entirely at the edge.
# 5. Mistral-7B-Yala-v0.3
| Technical Feature | Details |
|---|---|
| Parameters | 7.25B |
| Architecture | Transformer, GQA + SWA |
| Core Length | 32,768 tokens |
| Vocabulary Size | 32,768 tokens (extended from v0.2) |
| The Tokenizer | v3 Mistral tokenizer |
| Activity Calling | Yes: with TOOL_CALLS / AVAILABLE_TOOLS / TOOL_RESULTS tokens (see here) |
| License | Apache 2.0 |
Mistral-7B-Yala-v0.3 a fine-tuned version of Mistral-7B-v0.3, which introduces three important changes over v0.2: expanded vocabulary to 32,768 tokens, v3 token support, and function calling support. The model uses clustered query attention for fast access and Sliding Window Attention (SWA) to handle long sequences efficiently, and call support is made possible by using an extended vocabulary including dedicated tokens TOOL_CALLS, AVAILABLE_TOOLSagain TOOL_RESULTS. As the largest model in this conference in 7B parameters, Mistral-7B-Instruct-v0.3 offers excellent performance following the group's standard instructions and has become an industry standard horse, widely available through Ollama, vLLM, and many thinking platforms.
# Wrapping up
The five models covered here – SmolLM3-3B, Qwen3-4B-Instruct-2507, Phi-3-mini-4k-instruct, Gemma-4-E2B-it, and Mistral-7B-Instruct-v0.3 – include a range of architectures, parameter calculations, context windows, but they all share a key release tool: an open weight package.
From Hugging Face's fully transparent SmolLM3 to Google DeepMind's multimodal edge-optimized Gemma 4 E2B, the selection shows that capable agent models no longer require large infrastructures and boundary models to implement. Whether your priority is on-device thinking, remote content management, multilingual use, or the most permissive license possible, there's a model on this list worth checking out.
Keep in mind that these are not the only small language models with toolkit capabilities. They do, however, do a good job of representing those that I have direct knowledge of, and that I feel comfortable generalizing based on my results.
Matthew Mayo (@mattmayo13) has a master's degree in computer science and a diploma in data mining. As managing editor of KDnuggets & Statology, and contributing editor to Machine Learning Mastery, Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, language models, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.



