This AI Paper from Stanford and Harvard Explains Why Most 'Agentic AI' Programs Feel Impressive in Demos and Fail Completely in Real-World Implementation.

Agent AI systems sit on top of large language models and are connected to tools, memory, and external environments. They already support scientific discovery, software development, and clinical research, yet they still struggle with unreliable instrumentation, weak long-term planning, and poor standardization. Recent research paper 'Agentic AI fixes' from Stanford, Harvard, UC Berkeley, Caltech proposes a unified vision of how these systems should harmonize and map existing methods into a unified, mathematically defined framework.
This research paper how to model an agent AI system?
The research survey is an agent AI system model as a basic model agent and 3 key components. The planning module decomposes goals into a sequence of actions, using static processes such as Chain-of-Thought and Tree-of-Thought, or dynamic processes such as ReAct and Reflection that react to feedback. The tool implementation module connects the agent to web search engines, APIs, code execution environments, Model Context Protocols, and the default browser. The memory module stores short-term context and long-term information, which is accessed through advanced generation retrieval. Adaptation changes the information or parameters of these components using supervised fine tuning, preference-based methods such as Direct Preference Optimization, reinforcement learning methods such as Proximal Policy Optimization and Group Related Policy Optimization, and efficient parameter methods such as low-level adaptation.

Four paradigms of practice
The framework describes 4 practice paradigms by combining 2 binary options. The first dimension is the target, adaptation to the agent versus adaptation to the tool. The second dimension is the monitoring signal, the performance of the tool compared to the output of the agent. This produces A1 and A2 for the adjustment agent, and T1 and T2 for the adaptation tool.
A1, Instrumentation to Adapt the Signed Agent, improves the agent using feedback obtained from using the tool. A2, Agent Switching with Signed Output, optimizes the agent using the signal defined only in its last output. T1, Agent-Agnostic Tool Adaptation, develops tools without reference to a specific agent. T2, Agent-Supervised Tool Adaptation, develops tools under supervision from a focused agent.


A1, learning from the feedback of a verifiable tool
In A1, the agent receives an input x, generates a formal tool call a, the tools returns a result y, and the learning objective of O_tool to measure the success of the tool, for example performing with accuracy or retrieval quality. The paper combines both supervised simulation of successful tool trajectories and reinforcement learning that uses the results of validated tools as a reward.
Toolformer, ToolAlpaca, and Gorilla demonstrate A1's supervised methods, as each uses artificial results of real tools to create or filter training traces before simulation. They all keep the monitoring signal defined at the instrument behavior level, not at the final feedback level.
DeepRetrieval is an intermediate A1 reinforcement learning example. It frames query reconstruction as a Markov decision process where the state is the user's query, the action is the rewritten query, and the reward includes retrieval metrics such as Recall and nDCG, the format name, and, for text to SQL, the accuracy of the SQL execution. The policy is trained with the standard KL Advanced Policy Development and the same goal includes searching literature, answering chorus questions, and text in SQL.
A2, learning from the final results of the agent
A2 covers cases where the objective of optimizing O_agent depends only on the final output o produced by the agent, even if the agent uses tools internally. The survey shows that only monitoring uo is not enough to teach the tools, because the agent can ignore the tools and continue to develop opportunities. So effective A2 systems combine monitoring on tool calls and monitoring on final responses, or provide small rewards such as exact match accuracy in o and distribute them over the full route.
T1, agent agnostic tool training
T1 configures the main agent and develops tools for widespread reuse. The objective of O_tool depends only on the output of the tools and is measured by metrics such as retrieval accuracy, ranking quality, simulation fidelity, or the success of the downlink operation. A1 trained search policies, such as DeepRetrieval, can later be reused as T1 tools within new agent systems without modifying the main agent.
T2, instruments prepared under freezing agent
T2 assumes a dynamic but invariant agent A, which is normal if the agent is a closed-source base model. The tool makes calls and returns results that the agent uses to generate o. The optimization goal also resides in O_agent, but the trainable parameters belong to the tool. This paper describes hierarchical weighted training, instruction-based training, and reinforcement learning variants that all derive machine learning signals from the output of the final agent.
The survey treats long-term memory as a special issue for T2. Memory is an external store that is written and read by learned operations, and the agent remains frozen. The latest T2 systems include s3, which trains a 7 billion parameter searcher that maximizes the Gain Beyond RAG reward defined by a frozen generator, and AgentFlow, which trains a programmer to program Qwen2.5 based frozen modules using Flow GRPO.


Key Takeaways
- The study describes a precise framework for a 4-paradigm adaptation of agent AI by crossing 2 dimensions, whether the adaptation is directed at the agent or the tools, and whether the monitoring signal comes from the use of the tool or the final results of the agent.
- A1 methods such as Toolformer, ToolAlpaca, Gorilla, and DeepRetrieval adapt the agent directly to the feedback of the verifiable tool, including retrieval metrics, SQL execution accuracy, and code signing results, which are often configured with standard KL Profile Policy Development.
- A2 methods improve the agent from signals to the final output, for example response accuracy, and the paper shows that systems should still monitor tool calls or distribute small rewards along full trajectories, otherwise the agent can ignore tools while still improving opportunities.
- T1 and T2 replace learning in tools and memory, T1 trains generally useful plugins, searches, and simulators without a specific agent in mind, while T2 adapts tools under a frozen agent, as in s3 and AgentFlow where a fixed generator directs the search and learned scheduler.
- The research team presents the situation of adaptation related to the control of monolithic versus modular and the area and the system of the system, and they say that the operating systems will include the extraordinary updates of A1 or A2 in the basic model that is strong with the adaptation of T1 and T2 to adapt to ingredients, search policies, simulations, and the memory of durability and durability.
Check it out Paper again GitHub Repo. Also, feel free to follow us Twitter and don't forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.



