Microsoft releases agent lightning: A new AI framework that enables reinforcement learning (RL) -Training written by LLMs for any AI agent

How do you transform real agent tracking into a RL learning transition to strengthen policy development LLMS without changing your existing agent stack? The Microsoft Ai Team releases agent lightning to help optimize multi-agent systems. Lightning Agent is an open framework that implements the reinforcement learning function of any AI agent without rewriting. It separates training from execution, defines a unified track format, and introduces Lightningrl, a method caused by time that transforms a complex agent into a general transformation of general trainers of RL trainers.
What agent lightning do you do?
It frames agent models as a decision process. It configures the agent as a visual MarkOV decision process in which the current input is observed in the LLM policy, the action is a Model Call, and the reward can be the end or the middle. Each departure runs only the calls made by the policy model, as well as inputs, outputs, and rewards. This removes some frame noise and produces clean training changes.
Lightningrl assigns credit to all multi-step episodes, then makes a policy for a single RL purpose. The research team describes the compatibility with unique RL methods. In practice, teams tend to use trainers that use PPO or GRPO, such as Verl, which fits this interface.

Program structure
Agent Lightning Uses a Training Agent. The lightning server works for training and performance, and exposes Opelai as an updated model API. The lightning client runs the agent runtime at its destination, captures traces, tool calls, and rewards, and sends them back to the server. This keeps tools, browsers, shells, and other dependencies close to production while GPU processing remains at the server tier.


The runtime supports two trace modes. The default method uses OpenterElemetry Spans, so you can install agent telemetry using standard collectors. There is also a lightweight embedded tracer for teams that don't want to install openteetry. Both methods end up in the same training shop.


Unified Data Index
Agent Lightning records each model call and each tool call as a span with input, output, and metadata. The algorithm layer transforms the spans into the order that was ordered, the response and the reward. This selective release allows you to assign a single agent to a multi-agent agent, or multiple agents at the same time, without touching the local code. The same machines can also drive automatic or supervised defeting.


Tests and datasets
The research group reports three activities. To find text in SQL, the team uses the spider benchmark. Spider contains more than 10,000 questions in 200 details that span 138 domains. The policy model is LLama 3.2 3B is instructive. The implementation uses Langchain with a writer agent, a Rewriter agent, and a tester. The writer and the writer of the retreater are well done, and the examiner remains. Rewards develop strongly during training and during testing.


In order to get a return of the level that is liked by the listeners, the setup uses the Musique Benchmark and the Wikipedia Scale Index with about 21 million texts. Retriever uses BGE embedding and cosine similarity. The agent is built with the Opentai Agents SDK. The reward is a weighted sum of format type and F1 Realness Score. Reward Curves show stable gains during training and testing with the same basic model.


With a math question Answering with the use of tools, the agent is made with Autogen and calls the calculation tool. The dataset is Calc X. The base model is also LLama 3.2 3B tutorial. The training develops the ability to extract the tools properly and synthesize the results into final answers.


Key acquisition
- Agent Lightning uses the flexibility of training agents and a unified tracking interface, so existing agents in Langchain, Opentai Agents SDK, Autogen, or Crewai Connect with zero code changes.
- Lightningrl changes trajectories in transition. Multi-stage credit allocation works, then we make a policy with unique RL methods such as PPO or GRPO for general coaches.
- The central and automatic reward, the wind, gives the greatest response. Air signals convert signals such as Tool Return Status into intermediate rewards to reduce sparse reward constraints in long travel tasks.
- The study examines the text in SQL in SPER, in Musique with the Wikipedia Scale Index using BGE Empleddings and Calc X, all with LLL 3.2 3B commands as the base model.
- The records of the race time are followed by opentelemetry, spread to the training server, and reveal the end corresponding to Opelai with renewed models, allowing the scaled output of tools without dynamic tools.
Lightning Agent is an effective bridge between agent execution and learning reinforcement, not another framework rewrite. It organizes the agent works as a Markov decision process (MDP), informs the credit allocation lightning, and issues smooth transitions for single RL trainers. The design of the asynchronous training agent separates the client running the agent from the training server and uses a parallel OpenAi endpoint, so teams maintain existing stacks. Automatic leak detection transforms the incoming runtime signals with dense feedback, reducing the cheap rewards in the earnings flow. Overall, agent lightning is a clean, minimally integrated way to get agents to learn at their fingertips.
Check out the paper and repo. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.

Michal Sutter is a data scientist with a Master of Science in Data Science from the University of PADOVA. With a strong foundation in statistical analysis, machine learning, and data engineering, Mikhali excels at turning complex data into actionable findings.
Follow Marktechpost: Add us as a favorite source on Google.



