Generative AI

Sigmoidal Scaling Curves for Reinforcement of RL Learning Post-LLMS Training

Intensification of learning RL RL-Training is now a great desire for retiming-centric LLMS, but unlike previous training, it has never been direct more measuring rules. Teams pour tens of thousands of GPU hours into runs without a subtle way to measure whether a recipe will continue to improve with more compute. New research from Meta, Ut Austin, UCL, BERKELEY, Harvard, and Labs from time to time contribute The compute-fork framework– Protect more > 400,000 GPU hours-Models rl develop with sigmoidal curve and provide a tested recipe, -Ligagalathat follows those things foretold until 100,000 GPU-hours.

It fits a sigmoid, not a power law

Pre-training tends to fit power laws (loss vs compute). Stones for a good view of RL Combined metrics (eg Research group shows Sigmoidal fits at Average Pass VS Training compute they are nonsense More Durability and more stable there is a law of force, especially when you want it extract from small runs on large budgets. They do not issue alongside the first kingdom, which is noisy (~ first 1.5K GPU-hours) and agree with the next physical part. Sigmoidal parameters have precise roles: another set asymptotic function (ceiling)the other efficiency / firedthe other i what is being said where the benefits are most immediate.

Why that matters: After ~1-2k GPU-hours, you can fit the curve and predict that pressing 10k-100k GPU-hours shouldn't-forward You burn the budget. Research also shows that power-law fits can produce misleading ceilings unless they are very good at very high extinctions, which defeats the purpose of early prediction.

Scarler: recipe with predictable scale

Scalerl is not just a new algorithm; It is something Structure of options That produced more stable, more accurate measurements in the study:

  • Asynchronous Pipeline RL (Generator-Trainer is separated across the GPUS) for installation without policy.
  • Sex (Most Important – Prevent Sampling) Like RL loss.
  • FOP32 accuracy FREE to avoid numerical imbalance between the generator and the trainer.
  • Loss of Level Level and Standardized Batch-Level Adwateralization.
  • Forced height disturbance cap tracking.
  • Zero-variance filtering (Drop Completes Gives Me No Collection Signal).
  • There is nothing to do with repetition (Remove High-Case-Rate Prompts ≥0.9 from the latest Epochs).

The research team verifies each part with leave-Out-Out (loo) ablations in 16k GPU-hours and show that scarlerl integrated curves are reliable remove the material from the 8K → 16Kthen hold on a larger scale – including one extended run 100K GPU-hours.

Results and generalization

Two important shows:

  1. Prediction by measurement: With 8b is dense model with LLama-4 17b × 16 Moe (“Scout”)the Extended training followed closely Sigmoid Extrapolations derived from small-compute components.
  2. Downstream referrals: Improvement of Pass-Rate in IID Set Verification follow the trail Downstream testing (eg. Aime-24), Elevating the computer performance curve is not a data subject.

The study also compares the combined curves of conventional recipes (eg. Deepseek (GRPO), Qwen-2.5 (Dapo), Mational, Minimax-M1) and reports higher asymptotic performance and better efficiency For scarlerl in its setting.

What are the rods that move the roof vs efficiency?

The framework allows you to isolate design decisions:

  • Ceiling Moving (Asymptote): styling model size (eg, moe) and Long generation length (up to 32,768 tokens) Increase asymptotic Performance however may improve early. They are big Global batch size It can also raise the final asymptote and reinforce the training.
  • BLOCK HOSPES: Loss Consolidation, Common Profit, Data curriculumno Off-Polif pipe A lot of change How fast You approach the ceiling, not the ceiling itself.

To work, the research team advises the right curves early and prioritizing interventions that suggest middle roofand combine i ability to do Buttons for quick access to a fixed computer.

Key taken

  • The research team is moving forward with the progress of training in the background sigmoidal performance curves (Pass-Rate VS. Log cockete), enables reliable downscaling – unlike power efficiency in bound metrics.
  • A very effective recipe, -LigagalaMix it up Pipelinerl-k (asynchronous generator-trainer), Sex getting lost, FOP32Fast integration, standard gain, Disturbance-based length control, zero differential filtering, and repeatability.
  • You use these things, the research group predictable and consistent It is extended and runs until 100K GPU-hours (8b each) and ~50k GPU-hours (17b × 16 moe “scout”) in confirmation curves.
  • Swarms Show another dynamic choice Asymptotic ceiling (a) (eg, model scale, long generation length, large global batch), while others are more advanced Efficiency (b) (eg integration / standard, curriculum, off pipe
  • The framework provides Early Predictions determining whether it will be able to run, and progress in distribution verification Track the bottom line metrics (eg, AiMe-24), which supports external validity.

This work turns post-RL training from trial and error to sound engineering. Fits sigmoidal computer-Performulan Curves (Pass-rate vs. log cockete) to predict returns and decide when to stop or measure. It also provides a concrete recipe, scalperl, which uses pipelinerl-style asynchronous generation / training, CISPO loss, and FOP32 intensity magnets. Research reports > 400,000 GPU-hours of testing and one extension to 100,000 GPU-hours. The results support a clean separation: Some decisions suggest an asymptote; Others mainly improve performance. Classification helps teams to prioritize roof changes prior to installation of signposts.


Look Paper. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.


AsifAzzaq is the CEO of MarktechPost Media Inc.. as a visionary entrepreneur and developer, Asifi is committed to harnessing the power of social intelligence for good. His latest effort is the launch of the intelligence media platform, MarktechPpost, which stands out for its deep understanding of machine learning and deep learning stories that are technically sound and easily understood by a wide audience. The platform sticks to more than two million monthly views, which shows its popularity among the audience.

Follow Marktechpost: Add us as a favorite source on Google.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button