QWEN researchers raise the QWenlong-L1: Structured Document Medicine for Major Language Models

While the largest consultation models (LRMS) showed impressive skills in a short situation in a short situation by strengthening the strengthening (RL), these benefits do not care about natural conditions. Applications such as QA in a multi-documentation, the conversion of research, and legal or financial analysis requires processing models and reflecting the order of 100 tokens. However, the efficiency of RL in such communities is suffering from a slow-moving policy, unstable policy reviews because of the variations of KL Defergence changes, and reduces the appearance from the fall of an entropy. These bottles portray the basic gap in converting LRMs from the shortest context of the context.
Qwenlong-L1: Organic RL frame to agree to long content
Dealing This Limits, The QWen Research Group comes Qwenlong-L1The RL novel is designed to adapt the LRMS to long-term consultation activities. The framework is set up in three important stages:
- Heating the best heating (sft): It provides stable implementation of the policy model in training in the TRIVETS of the selected questions – to ensure a basic understanding of the understanding content and answer.
- Curriculum-is guided by stages of categories: He introduces the process of training set for gradually increasing the context of the context. This development makes the model able to increase a long-term behavior without policy analysis.
- The difficult sample of delicate viewing: Improving the maintenance and re-use difficult examples from previous, weightable stages weighing their difficulties, promoting deep thinking and robbery by various contracts.
These categories are accompanied by hybrid methods
Technical design and the benefits of the way
Qwenlong-L1 includes the latest advances in group related team performance, especially Salmon including DapoReducing the computational oversama associated with a long amount of value:
- Salmon It measures profitably by rewards within sample groups, to eliminate the need for a different number and various different patterns.
- Dapo Includes the powerful sample, combination of penal fines, and asymmetric determinations to prevent the fall of entropy and reduce the minimum process during training during training during training.
Reward work is defined as two signals of two signals: the matching rules and formal rules from the Compact (eg, QWEN2.5-1.5b). This method of hybrid avoids extreme formats while maintaining the correction of responding to a variety of objectives and phrases.
In addition, the frame is prepared with Continuous estimateWhen the RL procedure from 20k-Token to 60k-Token is the length of the controlled stages, Training power training power and simplify policy training.
Benchmark test results
L1 Penlong-L1 was examined in seven long-tall Qo Benchmarks, including Docmath, independent, 2wikimultihqqqa, Hotpos, Musique, Qaspper. Different 32b, Qwenlong-L1-32bshowed powerful operation:
- On the models based on the base such as R1-Dissill-Qwen-32b by 5.1 Points and it passed the programs that led up to relation Open-o3-mini including QWEN3-23B-A22B.
- Its operation was compared to the Claude-3.7-Sonnet-thinkingindicating competitive thinking skills under the thick side of the context.
- PASS Analysis @ k revealed consistent development with additional sample, achieving the world @ 2 Average of 73.7passing Deepseek-R1 including Open Opena-O1-PreviewEven in the lowest levels.

Trading courses also confirmed the individual's SFT donations, RL of RL domain, and a recovery sample. Significantly, RL has played a role in which emerging emergence, preparation, verification, and background – Features – Factors that are unsuccomed by proper guidance.
Store
QWenlong-L1 represents a formal approach to empower the LRS have the ability to think of skills. Its design effectively imprisoned the gap between the short technology and the needs of the duration of information about combining supervision, curriculum techniques driven, and hybrid test techniques. The framework does not achieve the effects of the art only on all tall benches of the bean but also indicates the appearance of changing consultation at the time of training.
Check paper, model in face and Gitity. All credit for this study goes to research for this project. Also, feel free to follow it Sane and don't forget to join ours 95k + ml subreddit Then sign up for Our newspaper.
Asphazzaq is a Markteach Media Inc. According to a View Business and Developer, Asifi is committed to integrating a good social intelligence. His latest attempt is launched by the launch of the chemistrylife plan for an intelligence, MarktechPost, a devastating intimate practice of a machine learning and deep learning issues that are clearly and easily understood. The platform is adhering to more than two million moon visits, indicating its popularity between the audience.




