Google AI unveils reinforced reinforcement (SRL): A smart move by an expert framework to teach small language models to tackle complex problems

How can a small model learn to solve tasks that are currently failing, without Rate simulation or relying on correct burn-in? A group of researchers from Google Cloud Ai Research and UCLA have released a training framework, 'SRL to reinforce reinforcement' (SRL), which enables 7b scale models to learn RL well and ademing based on teaching RL cannot learn from.
Small open source models such as Qwen2.5 7B teaching fail at the most difficult problems in S1K 1.1, even if it is correct that Reverend Trace. If we apply good guidance to the full DeFeek R1 style solutions, the model that simulates the token by symbol, the sequence is far, the details are only one thousand thousand, and the final scores are lower than the basic model.

The basic idea of reinforced learning' SRL
'Strengthened learning orientation' (SRL) maintains the RL of the performance style, but includes attention to the reward channel instead of the loss. Each expert skill from S1K 1.1 is combined with a sequence of actions. At each start of that sequence, the research team creates a new training model, the first model generating a private internalized thinking system. then it outputs the action of that step, and only this action is compared to the action of the teacher using a sequential order metric based on diffelib. The reward is long because every step counts, even if the final answer is wrong. The entire text, part of the discussion, is not compressed, so the model can search its chain without being forced to copy teacher tokens.
Mathematical Results
All models are run from the Qwen2.5 7B tutorial and all are trained on the S1K edited Deepseek R1 format, for a clean comparison. The exact numbers in table 1 are:
- Base Qwen2.5 7B Training, AMC23 Greed 50.0, AIME24 LIGHT 13.3, AIME25 Greed 6.7.
- SRL, AMC23 Greed 50.0, AIME24 Greed 16.7, AIMe25 Greed 13.3.
- SRL then RLVR, AMC23 Greedy 57.5, AIME24 Greedy 20.0, AIMe25 Greedy 10.0.


This is a key improvement, only SRL already removes the degradation of SFT and increases AIESE24 and AIMe25, and when RLVR is run after SRL, the system reaches the best open scores in the study. The research team is clear that the best pipeline is SRL and RLVR, not SRL in isolation.
Software Engineering Results
The research group uses SRL in QWENC.5 Coder 7B orders 5,000 verified trajectories produced by Claude 3 7 Sonnet, and 134,000 smart objects are produced. Testing on the SWE Bench is guaranteed. The base model gets 5.8 percent in Oracle file editing mode and 3.2 percent off. SWY GYM 7B gets 8.4 percent and 4.2 percent. SRL receives 14.8 percent and 8.6 percent, which is almost 2 times the base model and much higher than the SFT base.


Key acquisition
- SRL transforms difficult thinking as a step-by-step smart step, the model first generates an internal action and then outputs a single action, so the model receives a reward even when the last answer is wrong.
- SRL is run on the same Deentiseek R1 format S1K 1.1 details as SFT and RLVR, but unlike SFT it doesn't overdo long shows, and unlike RLVR it doesn't sit in the absence of a proper release.
- In Math, the exact order that gives the most powerful results in research, start teaching QWEN2.5 7B with SRL, then add rlvrs, which press the reference benches higher than any method alone.
- The same SRL recipe is full of agentic software engineering, using 5,000 confirmed trajectories from Claude 3 7 Sonnet 20250219, and suggests the Bench has well confirmed both bases of QWEN's STY MYM 7B Base.
- Compared to other intelligent RL methods that require an additional reward model, this srl maintains the grpo-style objective and only uses actions from expert trajectories and lightweight string matching, so it is easy to run on small and difficult datasets.
'Directive reinforcement learning' (SRL) is a practical contribution by the research group. It keeps the grpo-style learning setup, but replaces the bright results with supervised, step-wise results generated directly from expert trajectories, so the model always gets an informative signal, or the desired model gets information, or a D– it's hot regime where rlvr and sft both stall. It is important that the research group shows SRL in the calculations and in the SWE bench verified with the same recipe, and that the dynamic configuration is SRL followed by RLVR, not one alone. This makes SRL a logical approach for open models to learn complex tasks. Overall, SRL is a clean bridge between process monitoring and RL that represents open groups of models that can quickly take over.
Look Paper. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.
AsifAzzaq is the CEO of MarktechPost Media Inc.. as a visionary entrepreneur and developer, Asifi is committed to harnessing the power of social intelligence for good. His latest effort is the launch of the intelligence media platform, MarktechPpost, which stands out for its deep understanding of machine learning and deep learning stories that are technically sound and easily understood by a wide audience. The platform sticks to more than two million monthly views, which shows its popularity among the audience.
Follow Marktechpost: Add us as a favorite source on Google.



