Qwen's Former Lead in Incorrect Integrated Thinking – and Why She Now Supports Agents

0 0 5 minutes read

Qwen's Former Lead in Incorrect Integrated Thinking – and Why She Now Supports Agents

Junyang Lin was the technical lead for Alibaba's Qwen project. He announced that he will step down on March 3, 2026. He now lists himself as an independent researcher on his website.

In a talk titled 'Qwen: Towards a Generalist Model/Agent,' he walks with the Qwen family. It ends with one line: “Training models -> training agents.” He later expanded that line into a detailed post as an independent researcher. This article reads the speech and the detailed post together.

Which actually covers Lin's speech

The speech is a family tour of Qwen models, not a single release. Compatible with QwQ-32B, Qwen2.5-Max, Qwen3, Qwen2.5-VL, and Qwen2.5-Omni. Each stop shows benchmark charts against contemporaries. Innovative frameworks include DeepSeek-R1, Grok 3 Beta, Gemini 2.5 Pro, and the OpenAI series.

The Qwen3 stop holds a lot of information. Lin highlights mixed thinking modes: the thinking mode of step-by-step thinking, and the non-thinking mode of immediate answers. Add dynamic thinking budgets, so callers can cover how much the model reasons. Qwen3 has extended multilingual support from 29 to 119 languages and dialects.

The presentation lists many types of models and sizes from parameters 0.6B to 235B. It also lists a number of formats including GGUF, GPTQ, AWQ, and MLX, all under Apache 2.0. Two demos follow: a Web Dev demo and a Deep Research demo. “Future Work” closing refers to agents. It includes more pre-training, RL with natural feedback, longer context, and more methods. The last key mentioned is “training models -> training agents.”

Qwen3 Architecture, As Illustrated in Speech

The lecture includes tables of Qwen3 properties, which are reproduced below.

Model	Layers	Heads (Q/KV)	Embedding Responsibilities / Professionals (Content/Legal.)	Context
Qwen3-0.6B	28	16/8	Thabiso: Yes	32K
Qwen3-1.7B	28	16/8	Thabiso: Yes	32K
Q3-4B	36	32/8	Thabiso: Yes	32K
Qwen3-8B	36	32/8	Liability: No	128K
Qwen3-14B	40	40/8	Liability: No	128K
Qwen3-32B	64	64/8	Liability: No	128K
Qwen3-30B-A3B	48	32/4	Professionals: 128/8	128K
Qwen3-235B-A22B	94	64/4	Professionals: 128/8	128K

Smaller denser models include input and embedding and use 32K cores. Larger denser models with MoE reduce binding and increase the core to 128K. The two MoE models use 8 experts out of 128 per token.

Hybrid Thinking, and Why Integration is Hard

Lin presents mixed thinking as a pure trait. The post explains why it was difficult to build. Lin writes that thinking mode and teaching mode pull in different directions.

A robust instructional model is rewarded for directness, brevity, and low latency. A strong thinking model is rewarded by spending more tokens on difficult problems. Mix the two carelessly, and both are demeaning. The behavior of thinking is narrowed, and the behavior of teaching becomes less sharp.

Qwen3 tried to merge with the four-stage pipeline after training. That pipeline included a long CoT cold start, RL logic, and a “think mode integration” step. Later in 2025, the 2507 line sent a different version of Instruction and Thinking instead. Lin frames this as a data problem rather than a model problem.

Anthropic took the opposite route, and Lin calls it a useful adjustment. The Claude 3.7 Sonnet is shipped as a hybrid model with a user-set imaging budget. Claude 4 allows thinking to be accompanied by the use of tools, intended for long-term coding and tasks. His point: a long line of thinking does not make a model smarter. Thinking should be saved for a target workload, not a benchmark.

Interactive Descriptor

From 'Thinking' to 'Agentic Thinking'

Lin draws a line between the two eras. The first one was reflective thinking, defined by o1 and DeepSeek-R1. It taught the field that RL needed determinable, verifiable rewards, so math, code, and logic became central. It also turned RL into a system problem for mass production and validation.

The next period, in his formation, is thinking of agency: thinking in order to act. The agent makes plans, decides when to act, uses tools, learns environmental feedback, and updates. It is explained through close-loop interaction with the world, not through long internal monologue.

Lin enumerates what agentic reasoning must handle that pure reasoning can avoid:

Deciding when to stop thinking and take action
Choosing which tool to request, and in which order
It includes loud or small observations from the surrounding area
Reviewing programs after failure
Maintaining consistency across multiple curves and multiple tool calls

The development target changes with the times. The table below summarizes the highlights that Lin draws.

Size	Thinking is thinking	Agent thinking
It was judged by	The quality of internal discussions before feedback	Whether progress is sustained while working
Reward signal	Verifiable answers (maths, code, logic)	Career success in an interactive environment
A key element of training	Model	Model and its location (harness)
Infra bottleneck	Release, validation, stable policy updates	Tool feeders, sandbox, train-service decoupling
Maximum failure mode	Verbose, low-value thought leads	Rewards hack with access to tools and env leaks

Use Cases, and Examples

The difference is how you build:

Coding agents: The logic model extracts one slice from the stack trace. The agent system runs the test harness, reads the actual error, updates, and restarts until the suite passes. Thinking here should help with codebase navigation, error detection, and tool tuning.
Advanced Search: A thinking model memorizes a long answer. The agent system breaks down the query into sub-queries, calls searches, drops weak sources, and returns grounded quotes. Qwen's Deep Research demo sits in this section.
Multi-agent orchestration: Lin expects 'wire engineering' to be very important. Orchestra programs and routes are active. Special small agents perform small tasks and help control context contamination.

Concrete Hook: Qwen3 Thinking Toggle

Mixed reasoning is expressed directly in code. I enable_thinking tag changes methods in the dialog template.

from transformers import AutoModelForCausalLM, AutoTokenizer

name = "Qwen/Qwen3-8B"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(
    name, torch_dtype="auto", device_map="auto"
)

messages = [{"role": "user", "content": "Refactor this function and explain the change."}]

# enable_thinking=True  -> step-by-step thinking mode
# enable_thinking=False -> near-instant, non-thinking mode
text = tok.apply_chat_template(
    messages, tokenize=False,
    add_generation_prompt=True, enable_thinking=True,
)
inputs = tok(text, return_tensors="pt").to(model.device)

# Qwen's recommended sampling for thinking mode
out = model.generate(
    **inputs, max_new_tokens=2048,
    temperature=0.6, top_p=0.95, top_k=20,
)

enable_thinking=True default, and the output wraps thinking about a ... block. Qwen3 also accepts soft switches. It includes /think or /no_think for the user to toggle the mode with each message. That control of each turn is what dynamic thinking budgets build on.

Why Agentic RL Infrastructure is difficult

The main engineering point of the presentation is about infrastructure. In RL logic, abstractions are usually self-contained methods with pure evaluators. In agent RL, policy resides within a harness of tools servers, browsers, terminals, and sandboxes.

Those harnesses force a new requirement: training and explanation must be brought down cleanly. Without it, the output is collapsed. A code agent waiting for a live test release keeps indexing and starving. GPU usage falls well below what RL's rendering achieves.

Lin also illustrates what to consider. In the SFT era, teams have developed a diversity of data. In the agent's time, he says that teams should improve the quality of the environment: stability, realism, coverage, and exploitation of resistance. He cites reward hacking as the most difficult problem, because access to tools increases the attack surface to be successfully executed by mistake.

Key Takeaways

Junyang Lin left Qwen on March 3, 2026, and is now publishing as an independent researcher.
His speech concludes with one thesis: the field is moving from training models to training agents.
An agent's reasoning is judged by ongoing action in the environment, not by internal deliberation.
Agentic RL needs infra and high quality rail infrastructure, not just guaranteed rewards.
Hacking rewards is a big risk when models get access to real tools.

Sources:

The main source – speech

Main source – Junyang Lin's blog

“From 'Thinking' to 'Contemplative' Thinking”:
His home page (independent researcher status):

Technical details of Qwen3 (architecture, 119 languages, mixed logic)

Qwen3 Technical Report (arXiv:2505.09388): · HTML:

Code verification (enable_thinking, /think /no_thinksamples)

Qwen docs Quickstart:
Qwen3-8B model card:
Qwen3-32B model card:

Travel facts (excerpted from the article)

TechCrunch:
Bloomberg:
VentureBeat:

Support navigation/content (used for cross-checking, not all listed in line)

RecodeChinaAI (LatePost Translation):
Simon Willison:
Geopolitechs:
OfficeChai:
MLQ news:
GenAI Assembling (article analysis, used for article priming):

Two X posts

Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.

Source link

nimda 2 hours ago

0 0 5 minutes read