Moonson AI researchers present demonstrator: ROS CORLOUTS' online natural learning program

How do you keep learning to learn large reference models from stang in a few, slow rollouts while GPUS stay low? a group of investigators from Moonshot Ai and Tsinghua University bring in 'the visionary'a new online contextual learning program that targets specific bottleneck systems with learning reinforcement of large language models. In Synchronous policy setups, the rolling phase governs the cost of each iteration. Reorganization of the phase with this phase also reported emissions of 74 percent to 97 percent and tail reduction of 75 percent to 95 percent in the corresponding percentage called Verl.

Why Synchronous Releases are slow models consult?
Today's RL logic loads RL using a long chain of derivation style logic. In the demonstrator article, the researchers installed GRPO on three different models, Moonlight, Qwen2 vl 72B and Kimi K2. These tasks run on 32 computer nodes with 8 h800 gpus per node. These three tasks use 32, 128 and 256 GPUs respectively, respectively 400, 600 and 800 for iteration and 8 or 16 responses for acceleration.
The maximum height is great. Moonlight is set for 65,536 tokens, Qwen2 VL 72B for 40,960 tokens and Kimi K2 for 98,304 tokens. A single long thread of a logic request can grow from a few hundred megabytes of kvcache to tens of gigabytes as the congestion progresses. This increased memory capacity emphasizes reducing concurrency or requesting requests, which hunt expensive redraws.
The research group defines tail requests as the last 10 percent of requests to complete a release. For moonlight and qwen2 vl 72b, this tail alone can consume 50 percent of the total rollout time in the base system. The rollout already owns the runtime, so this effect routine directly reduces the RL.


Seer Architecture over mooncake and vllm
Seer keeps the RL algorithm as verl consistent. Each iteration of training uses only data from the current rollout iteration, so the program is kept in policy behavior. The training phase uses megatron for distribution. The extraction phase uses vllm as the extraction engine.
To support aggressive application scheduling, the viewer relies on a global KVcache pool developed by Moncoka at KVcate Archites used for Kimi production. Mooncake provides a shared TIER DRAM and SSD KV store across all measurement nodes, allowing to leverage applications without retrieving Free.
On this lower structure, Seer presents three main methods:
- A separate release
- Planning Planning
- Constipation is a fixed cumulative variable
This includes a snapper with a request buffer, a composition manager and a compilation engine connected to the global kvcache pool.


Split release, fine tuning and migration
A consistent standard release assigns all grpo groups to measurement conditions. A group is a set of requests that share a single prompt. If you are assigned, the group stays in the same position until all the answers are finished. Due to the large variation in the length of the exit, this leads to damage imbalances and long stragglers.
Seer breaks the teams down in two steps. It begins to decompose in each group in individual applications. It then divides each request into multiple chunks based on the generation configuration. When the scheduler sends a request from the request buffer, it sets a minimum number of token tokens such as 8,000 tokens for that chunk. After each chunk, the request is limited until it reaches the end of the token sequence or its max tokens limit.
Because KVcache is stored in a global kvcache pool, isolated requests can move between instances across chuncy boundaries without running a startup. The scheduler maintains a Con Cocurrency Level that keeps memory usage high while avoiding corruption. This reduces waste and smooths the use of KVcache in every iteration.
Situation Planning Planning Using Group Length Statistics
The research team observes that different requests from the same group tend to have longer exit times. Seer uses this property as an online environment. For each prompt group, it designates one request as a guess request. The schedule keeps requests for consideration in the priority queue and serves them with the smallest policy based on the tokens generated so far. Short requests complete quickly and exit. Long requests remain and point to potential groups selected in the queue.
The context manager maintains the size of each group's measurement. Updates this estimate of the maximum length generated among the completed applications in the group. If no request is completed, we use the first max tokens as bound. As soon as physical requests are in flight or made, the Seer schedules the remainder of the closest first policy to the group. This design achieves recruitment and tail behavior around an all-knowing Oracle schedule ahead of time.


Constipation is a fixed cumulative variable
Seer adds a clear transformation that is stacked on top of the previous two elements to speed up the breakdown, especially for long tail requests. It introduces server-based clustered arrays, or DGDs. DGDs maintain a compressed tree for each group and combine the token sequences for all requests in that group. Instances asynchronously append appled tokens generated from DGDs, periodically download updated append trees and make local guess deals based on shared pattern calculations.
The system modifies the draft length and the number of ways to build the model, the batch size and the measured length. Being a gentle and expert model combination, it pre-comples the different prediction limits and applies them to the written depth of each batch. In the late tail period, concurrency is low, so Peer increased the draft depth and implemented a multi-registration method to predict the tokens received at each step.
Ablation results show that split ablation yields a 35 percent improvement over baseline. Adding Core Policy Planning increases this to 47 percent more than base. Enabling clustered optimization raises overall acceleration to 77 percent to 87 percent over baseline in the tested iteration.
End-to-end impact on RL training
The research team is testing the observer on three RL missions built on the moon, Qwen2 vl 72B and Kimi K2. They run 10 rollout iterations for each transaction and measure the rollout tokens at each time and the completion of each rollout. Seer improves revenue generation by 74 percent to 97 percent for all verl related work with the same rl algorithm engine and vllm based on engine testing.
Tail Latency is reduced by 75 percent to 93 percent. With integrated memory operations, the base program spends up to half of its time on the last 10 percent of requests. The idea removes most of this signature by combining a distributed release, integrated variable programming on top of a mooncake based on the world's mooncake pool.
Key acquisition
- Rollout bottleneck: The idea aims at the ROSChronous RL wrapping stage, which costs about 63% to 87% of the Iteration time and is dominated by long tail requests and KV cache fragmentation.
- Three important ways: Seer combines discrete releases, context has editing and dynamic variables collected to exploit release length and pattern similarity between rapidly interacting GRPO responses.
- Good cold setting in global kv cache: Requests are divided into chunks and moved to a mooncake-style global kvcache Pool, which maintains policy synchronization while keeping GPU memory usage high and reducing bandwidth.
- Internet context for tail reduction: Group length calculations from contextual requests to schedule planning that measured Oracle's initial schedule Above all and significantly reduced the time spent on the last 10 percent of requests.
- A balanced finish to finish off the benefits: In Production Grade RL Moon task activities, Qwen2 vl 72B and Kimi K2, Seer improves tail emissions 75% to 93% relative to art applchronous vllm based.
Seer is an important programming tool because it increases the release phase of ROSCHrous RL without changing the basic grpo, so it maintains policy guarantees and infrastructure deployment while preparing the actual infrastructure box. The combination of distributed output, dynamic content planning provides a viable template for other RL stacks that rely on long-term thinking models and the big ideas of KVcache. Overall, the review shows that online contextual learning at the level of systems is now important as a model construct to measure RL reimagining effectively.
Look Paper here. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.
AsifAzzaq is the CEO of MarktechPost Media Inc.. as a visionary entrepreneur and developer, Asifi is committed to harnessing the power of social intelligence for good. His latest effort is the launch of a media intelligence platform, MarktechPpost, which stands out for its deep understanding of machine learning and deep learning stories that are technically sound and easily understood by a wide audience. The platform sticks to more than two million monthly views, which shows its popularity among the audience.
Follow Marktechpost: Add us as a favorite source on Google.



