StepFun AI Introduces Step-Deep Research: A Cost-Efficient Deep Research Agent Model Built on Atomic Energy

StepFun has launched Step-DeepResearch, the only 32B parameter to complete deep research agent that aims to transform web search into a real research workflow with long-term horizon thinking, tool use and structured reporting. The model is built on Qwen2.5 32B-Base and is trained to work as a single agent that organizes, checks sources, verifies evidence and writes reports with quotes, while keeping the cost of consideration low.
From Research to Deep Research
Most of the existing web agents are enabled for multi-hop query response benchmarks. They try to match the basic truth answers to short questions. This is closer to objective retrieval than actual research. In-depth research activities are different. It includes recognition of ulterior motive, long-term decision making, use of flexible tools, systematic thinking and multi-source validation under uncertainty.
Step-DeepResearch recasts this as making sequential decisions over a compact set of atomic capabilities. The research team describes 4 atomic skills, planning and breakdown of work, searching for deep information, reflection and verification, and production of professional reports. Instead of programming multiple external agents, the system internalizes this loop into a single model that determines the next action at each step.
Data Synthesis around Atomic Energy
To teach these atomic skills, the research team created separate data pipelines for each skill. In terms of programming, they range from high-quality technical reports, research papers and financial analysis documents. They deduce practical research strategies and work trees from topics, summaries and structure, and generate trajectories following these plans. This exposes the model to long-horizon project structures, not just short-term query templates.
For deeper information, they build graph-based queries on top of knowledge graphs such as Wikidata5m and CN-DBpedia. They sample subgraphs, extend them using search, and compile queries that require multi-hop reasoning across entities and documents. A separate pipeline uses Wiki-style link indexing to force retrieval of separate document and evidence combinations. Simple questions that a robust model can already solve with a simple ReAct-style strategy are filtered out, so the training focuses on hard search problems.
Observational and validation data are generated through self-correction loops and multi-agent teacher tracking. Teacher agents issue claims, plan checks, verify facts, reschedule if inconsistencies arise and only write reports. The resulting trajectories are cleaned and used as single agent monitoring for the learner. Report generation is trained in 2 phases, intermediate domain-style training and depth using pairs of query reports, then fine-tuning with strict formatting and system consistency constraints.
Continuous Training in Qwen2.5-32B-Base
The training pipeline consists of 3 stages, agent-centered training, supervised configuration and reinforcement learning. In the middle of the training phase-1, the team injects atomic energy without tools, using a core length of up to 32k tokens. Data includes active learning, integrated thinking sequences, summarizing and reflection. The research team shows continued gains in SimpleQA, TriviaQA and FRAMES as training reaches about 150B tokens, with the largest gains in FRAMES, which emphasizes systematic thinking.
In phase 2, the context extends to 128k tokens and explicit tool calls are introduced. The model learns tasks such as answering URL-based queries, deep web searches, long document summarization and long conversational reasoning. This section aligns the model with real research situations where search, browsing and analysis must be integrated in a single process.
During well-supervised configuration, 4 atomic forces are integrated into full depth search and deep research tracks. Data cleaning keeps trajectories correct and short in terms of steps and tool calls. The pipeline includes controlled tool errors followed by corrections to improve robustness, and enforces citation formats so that reports are always based on returned sources.
Reinforcement learning then improves the agent in the real tool environment. The research team created tasks and checklists using regression, and trained the Rubrics Judge checklist style to find reports in accordance with positive ratings. The reward design transforms the ternary rubric labels into unequal binary rewards that capture both positive and negative targets. The policy is trained with a PPO and an educated critic, using a standard rate of return with a discount near zero so that long trajectories are not discounted.
Single Agent React Architecture and Search Stack
During forecasting, Step-DeepResearch acts as a single ReAct-style agent that exchanges logic, tool calls and observations until it decides to issue a report. The toolset includes batch web search, todo manager, shell commands and file operations. Making runs in the sandbox with terminal persistence with tmux. The vision-oriented browser minimizes unwanted page captures using the vision hash range. Tools for document analysis, audio transcription and image analysis support multi-channel input.
Finding information uses 2 related resources. The StepFun team claims that the Search API is based on more than 20M high-quality articles and 600 premium indexes. The research team then described a selective authority targeting strategy that classified more than 600 trusted domains, including governments, academic sites and institutions. Retrieval works at the category level and uses an authority-awareness level so that the most trusted domains are preferred when compatibility is the same.
File tools support segment-based editing, so the agent can review only the edited parts of the report. The summary information storage system writes the full results of the tools to local files and includes only the summary summaries in the context. This acts as external memory and avoids context overflow in long projects.
Evaluation, Costs and Access
To measure the behavior of intensive research, the team presents the ADR-Bench, a Chinese benchmark with 110 completed tasks in 9 domains. The 70 jobs cover common fields such as education, science and engineering and social life, peer-reviewed by experts. 40 careers in finance and law are found with clear rubrics that follow atomic and proof constraints.
At Scale AI Research RubricsStep-DeepResearch achieves 61.42% rubric compliance, comparable to OpenAI-DeepResearch and Gemini-DeepResearch, and clearly ahead of many open and proprietary baselines. In ADR-Bench, expert-based Elo ratings show that the 32B model outperforms large open models such as MiniMax-M2, GLM-4.6 and DeepSeek-V3.2, and competes with systems such as Kimi-Researcher and MiniMax-Agent-Pro.
Key Takeaways
- Single agent, atomic force design: Step-DeepResearch is a parameter of 32B single agent built on Qwen2.-32B-Base, includes 4 atomic skills, planning, deep information seeking, processing and verification, and production of technical report, instead of relying on many external agents.
- Compilation of target data for each skill: The research team created different lines of data for planning, in-depth information seeking, reflection and report writing, using inverted programs from real reports, graph-based queries with Wikidata5m and CN-DBpedia, tracking of teachers using multiple agents and hard data for formatting reports.
- Three phase training with long context and RL: Training uses centralized training, supervised configuration and reinforcement learning, with centralized training up to 150B tokens in 32k and then 128k context, SFT includes full in-depth research methods, and RL-based PPO with Rubrics Judges develop reports against well-characterized checklists.
- React Architecture with selective search and external memory: At runtime the model uses a ReAct loop that calls the search tools for a wide range of web, artifact, shell and file operations, uses a Search API based on more than 20M documents and 600 premium indexes and 600+ trusted domains, and relies on patch scheduling and abstracted memory storage to act as external memory.
- Competitive quality with low cost: In Scale AI Research Rubrics the model achieves 61.42% rubric compliance and competes with OpenAI-DeepResearch and Gemini-DeepResearch, in ADR Bench it achieves a win or tie rate of 67.1% compared to the robust baseline.
Check it out Paper again Repo. Also, feel free to follow us Twitter and don't forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.


