Generative AI

Nanbe4-3b-Thinking: How the 23t pipe rides the previous 3B models of the 30B Class compare

Can the 3B model bring out the 30B classes by adjusting the training recipe instead of the measurement parameters? The Nanbeige LLM Lab at Boss Zhinin has released nanbe4-3b, a 3B micro-language model for trained languages ​​with a heavy emphasis on data quality, course planning, and reinforcement learning.

The research team goes through 2 basic tests, nanbe4-3b-base and nanbe4-3b-thinking, and evaluates the thought based on the parameters of 3B.

Benchmark results

At aime 2024, Nanbe4-3b-2511 reports 90.4, while Qwen3-32b-2504 reports 81.4. In GPQA-Diamond, Nanbe4-3b-2511 reports 82.2, while Qwen3-14b-25b04 reports 64.0 and Qwen3-32b-2504 reports 68.7. These are the 2 benches where the research “3B Beats 10 × is directly supported.

The research group also shows a strong tool using the advantages in BFCL-V4, Nanbe4-3b reports 53.8 versus 47.9 qwen3-30b and 48.6 for Qwen3-30B-A3B. In Arena-hard v2, nanbe4-3b Reports 60.0, matching the highest points written in that comparison table within the research paper. At the same time, the model does not fare well in all categories, in Fullstack-Bentch it scores 48.0, below Qwen3-3 and Qwen3-3, and in Qwenqa it scores 53.2, below Qwen3-32 at 54.1.

Training recipe, components that move the 3b model

Hybrid data filtering, then re-enter the scale

Hypothetically, the research team combines the marking of many items in parallel. They reduced their time to label 20 characters and reported 2 key findings, content-related labels are more predictable than format labels, and good ones are graded 0 to 9. For similarity-based scoring, they build a retrieval database with hundreds of billions of entries that support hybrid and vector text.

They sort through 12,5t tokens of high quality data, then select a high quality subset of 6.5t and put on top of 2 or more opochs, producing a final corpus of 23t tokens. This is the first place where the report highlights the general training of the models, the pipeline is not just “clean data”, it gets points, and it is repeated again and again.

FG-WSD, data usage schedule instead of the same sample

Similar research projects treat steady state degradation as a quantitative study program only. NanbeigE4-3B Adds a data curve within the stable phase with FG-WSD, solid state decomposition. Instead of lighting a consistent mix throughout steady training, they continue to focus on high quality training.

In the time to be delivered with 1B trained on 1T tokens, the table above shows the improvement of GSM8K from 27.1 under vanilla wsd to 34.3 under FG-WSD, and MMLU-Pro. For the full 3B run, the research group divides the training into warm-up, stable restricted variation, high-level, and decay, and uses ABF in the decay phase to extend the length of 64k.

Multi-Stage SPT, then adjust the monitoring traces

Post training starts with Cold Start SFT, then general SFT. The first cold stage uses about 30m qa samples focused on math, science, and code, and is 32K long, with a reported integration of about 50% math, and 20% code tasks. The research group also says that the cold SFT orders from 0.5m to 35m end up improving the aime 2025 with GPQA-diamond, without completing their tests.

SFT Shiffs Full 64K Medium Compilation includes general discussion and writing, Agent style implementation and programming, critical thinking for vulnerabilities, and coding activities. This section introduces the solution to the refinement of the solutions and the reconstruction of the concept. The system works to generate, critique, review cycles directed at a dynamic checklist, and then uses a chain completion model to reconstruct the corresponding solution corresponding to the refined final solution. This is intended to avoid training in a broken sequence of follow-up after a hard set.

DPD distillation, then MULL STAGE RL with guarantees

The distillation uses the preferred Dual-Level distillation, DPD. The learner learns the token level distribution from the teacher model, while the DPO sequencing process DPO aims to narrow the line between good and bad answers. Posters come from spreading the teacher Nanbeity3.5-Pro, the negatives are scattered in the student 3b, and distillation is used in both types of sample to reduce the errors of certainty and improve other methods.

Emphasis Learning is defined by Domain, and is applied to each category in the policy GRPO. The research group describes the filtering of policy data using AVG @ 16 levels and keeping samples between 10% and 90% to avoid things that are impossible or impossible. Stem RL uses an agentic guarantee that calls the Python interpreter to check for matches over the same string. Coding RL uses synthetic test functions, validated with a synthetic sandbox, and uses passing rewards from those tests. RL's human preference approach uses a virtual reward model designed to generate preferences with fewer tokens and reduce the risk of reward flow compared to conventional linguistic expertise.

Comparison table

Benchmark, metric Qwen3-14b-2504 Qwen3-32B-2504 Nanbe4-3b-2511
AIED2024, AVG @ 8 79.3 81.4 90.4
AIED2025, AVG @ 8 70.4 72.9 85.6
GPQA-Diamond, AVG @ 3 64.0 68.7 82.2
Super sharp, avg @ 3 46.8 54.1 53.2
BFCL-V4, AVG @ 3 45.4 47.9 53.8
FullStack Bench, AVG @ 3 55.7 58.2 48.0
Arenahard-V2, AVG @ 3 39.9 48.4 60.0

Key acquisition

  • 3B can lead top models open to consultation, under a sample paper setup. Nanbeige4-3b imaging reports AIM 2024 AVG @ 8 90.4 vs Qwen3-32b 81.4, and GPQA-Diame-Diamond AVG @ 3 82.2 64.0.
  • The research team is careful with the tests, these are AVG @ K results in a specific test, not a single precision. AID is AVG @ 8, most others are AVG @ 3, with heat 0.6, High P 0.95, and long generation max.
  • The acquisition of benefits is integrated into the data curriculum, not many tokens. The WSD schedules are well organized for the highest level over time, and 1B Ablation shows GSM8K from 27.1 to 34.3 Vanilla Programming.
  • Post-Training focuses on the quality of supervision, and chooses the distillation of knowledge. The pipeline uses the solution of deliberate solution analysis and chain-of-thought reconstruction, then the two preferences combine the similarity of the distribution of the token in the order of the highest level.

Look Paper weight and model. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.


Michal Sutter is a data scientist with a Master of Science in Data Science from the University of PADOVA. With a strong foundation in statistical analysis, machine learning, and data engineering, Mikhali excels at turning complex data into actionable findings.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button