Machine Learning

Smarter, Not Harder: How AI’s Self-Doubt Unlocks Peak Performance

Introduction

(LLMs) are increasingly capable of solving complex reasoning tasks, such as math Olympiad problems, scientific Q&A, and multi-step logical puzzles[3,8]. But are they really great? Yes, they are, but right now, they are very computationally expensive and inefficient at test time[5,6]. To address this challenge, Researchers at Meta AI have come up with a solution called “DeepConf,” also known as “Deep Think with Confidence”[1].

There is a problem known as self-consistency with majority voting.

I am sure you are wondering what this problem looks like in practice. Imagine a classroom of 100 students. You gave them a complex Olympiad problem and an hour to solve it. At the end, you can take all the answers and vote — the answers with the most votes “win.”

(Source: Author)

This is how the self-consistency with the majority problem works in LLMs[2,3]. Instead of only 1 solution, the model explores hundreds of reasoning paths (for example, 512 different step-by-step solutions) and then chooses the most frequent answer.

On the AIME 2025 math benchmark, a single pass by Qwen3–8B (called pass@1) gets about 68% accuracy; it’s like taking 1 answer from 1 student. But if you generate 512 reasoning traces per question (called conf@512) and take the majority answer, then accuracy jumps to 82%[1,4].

Sounds great, right? The catch is that those extra 511 traces generate nearly 100 million additional tokens, and more traces don’t always help; performance will remain the same or even drop sometimes when low-quality solutions dominate the vote[1,7,8]. In other words, if the students are guessing randomly, then the class vote doesn’t reflect the best thinker in the room[1].


What did the researchers do about it: Early Fixes

Researchers tried to solve this problem by looking at the model’s internal uncertainty signals. Now what’s that internal un…… It’s like looking at each student after some period of time, suppose every 5 minutes, to see if they are doing the correct baby steps or not. The model looks at the probability distribution of each token and calculates its confidence or entropy at a particular time. If the model has high confidence or low entropy (low spread with a high peak), then the model is certain about the particular token prediction, and vice versa[1,11].

By adding these token-level prediction statistics across a whole reasoning trace, we can estimate how “trustworthy” the solution really is. We can also filter out the low-confidence traces before majority voting — just like ignoring the answers from the students who clearly guessed. Fewer bad votes, Stronger results[1].

(source: Author)

However, these methods are still global and don’t fully solve the efficiency problem[1,6,13].

Let’s talk about some maths here, such as how token entropy, token confidence, and trace confidence work [1,11].

Token Entropy:

(Source: Author)

Let’s break this Entropy thing. The logPᵢ(j) term tells how surprising the token prediction is, with the Probability of the token at the ith position. When the probability is 1 (the model is dead sure, surprise is 0. No drama, no uncertainty), which tells the model is highly certain about the token prediction. We then take the average of all token entropies to define the entropy in each step or token prediction[1].

Token Confidence:

(Source: Author)

Token Confidence senses how sharp it guesses for each token prediction (anti-surprise meter)[1].

Average Trace Confidence:

(Source: Author)

While we are calculating the confidence in each token, the average of these confidence scores gives the confidence of the trace[1].


Confidence-Aware Test Time Scaling: DeepConf 

DeepConf takes the idea further, instead of taking hundreds of solutions and simply voting on them[2,3,12]. It looks at the model’s internal confidence signals during and after generation. It filters out low-quality reasoning traces dynamically, either in real time (online mode) or after all the solutions are generated (offline mode). It keeps only the most trusted reasoning ways and reduces wasted computation[1,6].

And the results? On AIME 2025, DeepConf@512 with GPT-OSS-120B hits a jaw-dropping 99.9% accuracy. Compared with plain majority voting, it is 97.0%, and a single attempt (pass@1) achieves only 91.8%. At the same time, DeepConf reduces token generation by up to 84.7% compared to brute-force parallel thinking[1,6,7].

With the intuition clear, it’s time to see how these confidence measures actually work under the hood.

Group Confidence:

(Source: Author)

Cₜ is still our token level confidence. Think of group confidence (C_Gᵢ) as a zoomed-in check for the certainty, where |Gᵢ| is the number of previous tokens with the overlapping window (example 1024 or 2048 tokens). This gives us a local snapshot of the certainty[1].

Bottom 10% Group Confidence:

(Source: Author)

When we sort the Group confidence score and zoom in on the bottom 10%, we are basically shining a light on the weakest links in the chain of reasoning. If those steps look shaky, we can toss them out to save our computation[1].

Tail Confidence:

(Source: Author)

Tail confidence is simple; we just take the last fixed number of tokens, like 2048, and find how confident the model is in the last few steps (checking the last mile), a critical step for predicting the correct conclusions[1].

We can use the DeepConf in two modes: Offline and online[1].


Offline Thinking with Confidence

When you are offline, you don’t call the model again and again or fetch extra data. Instead, you are left with traces you’ve already generated.

The challenge is to squeeze the most reliable answers out of them.

In Offline Mode, we can do plain voting of the outcome traces(which can break when there are more noisy outcomes) or confidence-weighted majority voting, where we take the mean confidence value of the trace and simply take the product of the confidence score with the occurrence of that solution[1,2].

Confidence Filtering and Voting: Before voting, discard the weakest traces. First filter traces by confidence (take top n% of the traces) and then either do plain voting or weighted confidence voting[1,9,10].

You can take whichever confidence metrics suit you, like Average confidence, Group Confidence, or tail confidence[1,10,11].

Algorithm 1 for Offline Thinking (source: Deep Think with Confidence[1])

Step-by-step explanation:

Inputs:
Prompt P: the question or input you want answered.
Number of traces N: how many reasoning paths you will generate.
Filtering threshold 𝜂: the percentage of top traces to filter on.
Confidence measurement C

Initialization:
Create an empty set T.
create an empty confidence set C[1].

Generate Traces:
For each iteration from 1 to N: You can generate a trace tᵢ​ for prompt P.
Calculate the confidencescore Cᵢ ​= C(tᵢ​).
Store the pair (tᵢ, Cᵢ) in T and C[1].

Filter High-Confidence Traces:
From all N traces, select the top η% based on their confidence scores.
This removes the noisy or low-quality traces, keeping only strong confident answer[1].

Voting:
we can calculate the vote score V(a) for each possible answer a.
This can be plain counting or weighted voting[1].

Select the Final Answer:
Choose the answer âwith the highest vote score[1]:

(Source: Author)
Condidence measurements and Ofline Thinking with confidence (source: Deep Think with Confidence[1])

Online Thinking with Confidence

The algorithm generates the traces on the fly, dynamically measuring confidence when there is enough proof[1,5,14,15].

The Algorithm:

Algorithm 2 for online Thinking (source: Deep Think with Confidence[1])

Step-by-Step Explanation
1. Inputs
Prompt P: again the question you’re answering.
Trace budget B: It is for the maximum number of traces you want to generate.
Initial traces Nᵢₙᵢₜ​: It’s a starting pool of traces to warm up with.
Filtering threshold η: how many high-confidence traces to keep.
Consensus threshold τ: It gives a percentage that reflects, when you can stop because you’re confident in the majority answer[1].

2. Offline Warmup
Before generating online:
Run Algorithm 1 with Nᵢₙᵢₜ​ traces.
Compute the confidence threshold s:
Take the 100, η percentile of the confidence scores from the initial traces.
This defines the minimum confidence a token/group needs to be considered.
Initialize the trace set T with the initial traces and calculate initial vote values V(a) for all answers[1].

(Source: Author)

Determine the initial majority answer â[1].

3. Online Generation Loop
While two conditions hold:
The current majority answer is not yet confident enough:

(Source: Author)

And you still haven’t exceeded the trace budget |T|→ Keep generating new traces[1]:

4. Generate a Trace Step-by-Step
While generating a trace t: Generate token by token.
After each token iii, calculate the group confidence C_Gᵢ for that token/group.
If C_GᵢElse: add token iii to the trace t[1].

5. Update
Add the completed trace ttt to the trace set T.
Compute the trace confidence Cₜ​.
Update vote counts V(a) for all answers.
Update the majority answer â[1].

6. Termination
Stop when either:
The majority answer âachieves consensus above the threshold τ.
Or the trace budget B is reached.
Return the final majority answer â[1].

DeepConf during Online Generation (source: Deep Think with Confidence[1])

I think this algorithm is the art of early stopping and saving an enormous amount of computation and resources[1,5,6,7,13,14].


Conclusion

So, what do you think? What is the moral of the story? Even the smartest “students” in the AI classroom sometimes need a little self-doubt to shine. DeepConf shows how powerful self-doubt is. We can save millions of computations not by brute force but by choosing smarter, confidence-based approaches. It’s like turning a chaotic math contest into a calm team of expert problem-solvers.

As AI keeps learning to think with confidence, we’re moving toward a future where models are not only smarter but also thriftier, spending less compute, making fewer mistakes, and delivering more brainpower per token. And who knows? Maybe one day your favorite model will be your most frugal, self-aware study buddy. Until then, let’s keep thinking smarter, not harder.


References

[1] Dayananda, A., Sivasubramanian, S., & Bartlett, P. (2024). Deep Think with Confidence: Confidence-Aware Test-Time Scaling for Better Alignment. arXiv preprint arXiv:2508.15260. Retrieved from

[2] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). Self-consistency improves chain-of-thought reasoning in language models. arXiv preprint arXiv:2203.11171.

[3] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., & others. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Advances in neural information processing systems (Vol. 35, pp. 24824–24837).

[4] Art of Problem Solving. (2025a). 2025 AIME I. https://artofproblemsolving.com/wiki/index.php/2025_AIME_I. Accessed: 2025.

[5] OpenAI. (2024). OpenAI o1 system card. arXiv preprint arXiv:2412.16720.

[6] Snell, C., Lee, J., Xu, K., & Kumar, A. (2024). Scaling LLM test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314.

[7] Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V., Ré, C., & Mirhoseini, A. (2024). Large language monkeys: Scaling inference computation with repeated sampling. arXiv preprint arXiv:2407.21787.

[8] Chen, L., Davis, J. Q., Hanin, B., Bailis, P., Stoica, I., Zaharia, M., & Zou, J. (2024a). Are more LLM calls all you need? towards scaling laws of compound inference systems.

[9] Aggarwal, P., Madaan, A., Yang, Y., et al. (2023). Let’s sample step by step: Adaptive consistency for efficient reasoning and coding with LLMs. arXiv preprint arXiv:2305.11860.

[10] Geng, J., Cai, F., Wang, Y., Koeppl, H., Nakov, P., & Gurevych, I. (2024). A survey of confidence estimation and calibration in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 6577–6595.

[11] Fadeeva, E., Rubashevskii, A., Shelmanov, A., Petrakov, S., Li, H., Mubarak, H., … & Panov, M. (2024). Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification. arXiv preprint arXiv:2403.04696.

[12] Farquhar, S., Kossen, J., Kuhn, L., & Gal, Y. (2024). Detecting hallucinations in large language models using semantic entropy. Nature, 630(8017), 625–630.

[13] Li, Y., Yuan, P., Feng, S., Pan, B., Wang, X., Sun, B., … & Li, K. (2024). Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. arXiv preprint arXiv:2401.10480.

[14] Han, Z., Li, Z., Wang, Y., Guo, C., Song, R., He, J., … & Chen, W. (2024). Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation. arXiv preprint arXiv:2410.02725.

[15] Fu, Y., Chen, J., Zhuang, Y., Fu, Z., Stoica, I., & Zhang, H. (2025). Reasoning without self-doubt: More efficient chain-of-thought through certainty probing. In the ICLR 2025 Workshop on Foundation Models in the Wild.


Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button