VibeThinker-3B: A Dense 3B Reasoning Model Built on Qwen2.5-Coder-3B with a Spectrum-to-Signal Post-Training Pipeline

0 5 3 minutes read

VibeThinker-3B: A Dense 3B Reasoning Model Built on Qwen2.5-Coder-3B with a Spectrum-to-Signal Post-Training Pipeline

While recent breakthroughs in AI thinking are largely driven by large scale, it yields billions of parameters to cross the limits of cognitive complexity—VibeThinker-3B chart is a completely different way.

Created by researchers from Sina Weibo Inc (China), this 3 billion parameter model proves that efficiency can far exceed its weight class. Released under the MIT open source license, the VibeThinker-3B matches the performance of models hundreds of times its size in non-verifiable tasks such as math, coding, and STEM disciplines.

What is VibeThinker-3B

VibeThinker-3B is a small compact model built on the base of Qwen2.5-Coder-3B. Trained from behind, not trained from scratch. The research team uses supervised processing, reinforcement learning, and self-refining on top.

The training framework continues the Spectrum-to-Signal Principle (SSP) from the previous VibeThinker-1.5B. SFT (Supervised Fine-Tuning) forms a broad area of valid modes of thought, the 'Spectrum.' RL then amplifies the correct means, 'Signal.'

The model directs one task: the assumption that the verifier can verify the answer. The research team recommends major common models for open source information services. The VibeThinker-3B is an expert by design.

Works with standard stacks. Model weights require transformers>=4.54.0. For a quick explanation it is recommended vLLM==0.10.1 or SGLang>=0.4.9.post6. BF16 weighs about 6 GB, which is small enough for a single GPU.

Benchmark

In AIME26, the VibeThinker-3B scores 94.3. According to the research paper, this compares to DeepSeek V3.2 (671B) and Kimi K2.5 (1T).

In LiveCodeBench v6, it reaches 80.2 Pass@1. In OJBench, another code benchmark, it scores 38.6, below the larger models. In HMMT25 it scores 89.3, and in BruMO25 it reaches 93.8. In IMO-AnswerBench, an IMO standard set of 400 problems, it scores 76.4.

The table below compares it to the most common thinking models. Line '+CLR' uses test time scaling. It stands for Claim Level Reliability Testing

Model	Parameters	AIME26	HMMT25	IMO-Ans	LCBv6	GPQA-D
VibeThinker-3B	3B	94.3	89.3	76.4	80.2	70.2
VibeThinker-3B +CLR	3B	97.1	95.4	80.6	–	72.9
GPT-OSS (top)	120B	93.2	90.0	75.6	81.9	80.1
DeepSeek V3.2	671B	94.2	90.2	78.3	80.8	82.4
GLM-5	744B	95.8	97.9	82.5	85.5	86.0
For me K2.5	1T	93.3	95.4	81.8	85.0	87.6

Source: VibeThinker-3B Technical Report, Table 2. GPQA-D is GPQA-Diamond.

The pattern is consistent. In the statistics and code that can be verified, the 3B model sits near the top cluster. In GPQA-Diamond, a data-heavy benchmark, the gap in large models is always apparent.

The research team also carried out decoding tests without deployment. It used LeetCode's latest weekly and biweekly competitions, from Apr 25 to May 31, 2026. The model passed 123 of 128 Python submissions on the first attempt. That's a 96.1% acceptance rate for invisible problems.

Inside the Spectrum-to-Signal Pipeline

The post-training pipeline goes through four stages. Each targets a different weakness of the sub-concept models.

It comes first A two-stage curriculum-based SFT. Stage 1 includes math, coding, STEM, dialogue, and general following instructions. Stage 2 shifts to rigorous, long-horizon samples sorted by length of thought and complexity. Diversity-Exploring Distillation preserves many valid solution paths in both categories.

Second coming Multi-domain Consulting RL. The research team reuses MaxEnt-Guided Policy Optimization (MGPO). The MGPO weights are evaluated near the model's current power boundary, where positive and negative emissions converge. Training is sequential across Math, Code, and STEM.

Noteworthy details: VibeThinker-3B reduces the expansion of continuous content. The research team found high truncation warming impairs long-term thinking on this scale. So RL uses a single 64K long content window throughout.

Math RL adds a Long2Short class. It also distributes the reward between correct trajectories along the length. Short correct answers get a high reward, long ones a low one, and the group means no change. The goal is fewer invalid tokens without losing accuracy.

Thirdly, Offline Self-Distillation combines RL checkpoints back into a single student model. For the fourth timePrescribe RL improves adherence to prescriptions. That category explains the 93.4 IFEval and 74.5 IFBench scores. Both show tuning thinking did not break the control.

CLR: Measurement at Test Time, Not Parameter Calculation

Claim Rate Reliability Testing (CLR) is a method of measuring the test time of a report. It works on functions that validate a response and do not include parameters.

The process has two steps. The model initially generates K = 32 trajectories for each problem. For one, it outputs M = 5 claims related to the decision and the final answer.

The model then acts as its verifier. It verifies or falsifies each claim, producing binary decisions. CLR attributes this to the credibility effect of a non-linear trajectory, where a single weak claim significantly reduces weight.

Answers are weighted equally, and the answer with the highest credibility weight wins. Full flow enters 8 times, and an average Pass@1 is reported. CLR raises AIME26 to 97.1 and BruMO25 to 99.2.

The interactive demo below allows you to change the claims and watch the result unfold. It also allows you to change benchmarks and compare with larger models.

'+rmp+'

Source link

nimda 3 weeks ago

0 5 3 minutes read