Tracking Length is a Simple Signal of Uncertainty in Reasoning Models

Uncertainty estimation of LLMs is an important research direction in dealing with hallucinations and other factors that limit their reliable deployment. In this work, we show that inference tracking length is a simple and useful measure of confidence in large inference models. Through comprehensive tests across multiple models, data sets, and instructions, we show that follow-up length works in comparable but complementary ways for other non-trivial confidence measures such as confidence intervals. Our work reveals that thinking after training fundamentally changes the relationship between trace length and accuracy, going beyond previous work that showed that thinking after training causes traces to grow longer in general (eg, “overthinking”). Investigating the mechanisms behind tracking performance length as a confidence signal, we see that the effect remains even after adjusting for confounders such as problem difficulty and length bias caused by GRPO. We identify high-entropy or “forked” tokens as playing an important role in the mechanism. Our findings show that post-training reasoning improves uncertainty estimation beyond verbalization, and establishes follow-up length as a practical confidence measure for large-scale reasoning models.
- † University of Southern California
- ‡ Stanford University



