QERL: NVFP4-intensive learning (RL) brings 32B LLM training to H100-while improving assessment

What can you build if you can run Emphasis on Reading (RL) Training after 32B LLM in 4-bit NVFP4-in H100-in BF16-with BF16-Level Eastecy and 1.2-1,5 × Step step? Nvidia researchers (and collaborators from Mit, HKU, and Tsinghua) are open Qerl (advanced advanced reading)a training framework that goes beyond that Emphasis on Reading (Rl) background training 4-bit fp4 (NVFP4) While maintaining high precision gradient calculations with lora. Research Team reports > 1.5 × SpeedUps in the wrapping section, ~ 1.8 × end-to-end vs qlora In one case, once First demonstration of RL training of 32B policy on H100-80GB GPU.

What qerl changes in the reinforcement learning (RL) loop?
Most of the rlhf / grpo / dapo pipelines use a lot of clock time to fold (Token generation). Qerl waives this policy Weight method to NVFP4 (FP4) Scal-Level Scaling and continues Logits / Gradients with high precision with loraso the backprop stays solid while the sample path hits the cakes with Hordware FP4 × BF16 kernels (Marlin). The result is a quick start / bloat at the time of release without maintaining a complete differential policy.
Mechanically, the research group includes MARLIN-based FP4 kernels Both rollout and tutorial, and Lora limits the desired parameters. This specifically targets the phase that governs RL costs and long-term communication latency.


As much as testing, it is done regularly
Basic Empirical Findings: The value of FP4 determines the Entropy of the policyearly token distribution in training as well Developing Assessments Versus 16-bit Lora and NF4-based Qlora Baselines. To control that effect over time, Qerl introduces Adaptiveailed Noight (AQN)–Smart Gaussian channels included in LaserNorm Scale parameters and was urged by Exponential Schedule. This keeps the kernel fusion small (no extra weight requirements) while it is being changed from testing to exploitation.


In carts, mberl shows Faster reward growth and Final high scores in mathematical consulting jobs under two Salmon and Dapoconsistent with the hypothesis that systematic noise in the parameter space can be a driver of useful tests in RL, even if such noise is generally detrimental to the optimization.
Results are reported
Despite of- Q2.5 The Backbone model, the research group shows that NVFP4 + Lora Outperforms Vanilla Lora and Qlora in transferring through discharge throughout the training period, with > 2 × rollout fraction on 14b / 32B qlora faced models too ~ 1.8 × end-to-end vs qlora in the setting that must. They show again Training 32B policy per grpo on one H100-80GBpowered by Memory Footprint below weight-only FP4.
Accuracy is competitive with high-end bases. Of course 7 model, the research team reports GSM8K = 90.8% and Math500 = 77.4%, Overrides 16-bit lora and qlora under their setup too Full-Parameter Fine Alignment. Across extensive math benchmarks (e.g., Bigmath), Qerl maintains a balance or advantage, while quickly changing due to advanced testing.


Where is this-and isn't it?
Qerl weight-only fp4 and The revival of Lora; it does -I Search for FP4 accuracy of Logits / Gradients. Benefits focus on rollout / free shipping and memory of the feetwith strong evidence that Included entropy Testing AIDS RL there The aqn mimics training. Generalization to modalities in addition to Math-Consulting tasks or safety / using the RL tool depends on the structure of the reward and the length of the sequence.
Key taken
- QERL includes NVFP4 4-bit weighting
- NighTaning works as a test: FP4 increases the entropy of the policy, while the adaptive aughtioness (AQN) intelligent noise of the channel on the scale of learnorm.
- Reported efficiency: > 1.5 × rollout SpeedSUPS vs 16-bit lora and ~ 1.8 × end-to-end vs qlora; > 2 × rollout floction vs qlora on 14b / 32b setups.
- Holding Accuracy: Qwen2.5-7b reaches 90.8% in GSM8K and 77.4% in Math500, fully matching with full parameter under paper setup.
- NVFP4 is a 4-bit hardware-optimized floating-point format with two dimensions
QERL accelerates the RL rollout phase. It has NVFP4 weighting and keeps updates and exercises with high accuracy using Lora. It reports >1.5× rollout speeds and can train 32B policy on H100-80GB GPU. It adds adaptive sound qualification to perform controlled signal testing during training. The results are mainly shown in Math-Consulting exercises using GRPO and DAPO. Benefits rely on NVFP4 kernel support like Marlin.
Look Full codes here and Paper. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.
AsifAzzaq is the CEO of MarktechPost Media Inc.. as a visionary entrepreneur and developer, Asifi is committed to harnessing the power of social intelligence for good. His latest effort is the launch of a media intelligence platform, MarktechPpost, which stands out for its deep understanding of machine learning and deep learning stories that are technically sound and easily understood by a wide audience. The platform sticks to more than two million monthly views, which shows its popularity among the audience.
Follow Marktechpost: Add us as a favorite source on Google.



