DiffuJudge-AV: A Diffusion-Inspired Framework for Calibrated AV Video Evaluation

Like a Noisy Sensor. It Changed Which Autonomous-Driving Evaluator I Would Ship.
There is a particular kind of result that looks impressive until you ask the wrong second question.
In this project, that result was a Pearson correlation of 0.753 from a text-only Claude judge grading autonomous-driving visual-QA answers. At first glance, that looks like a usable evaluator. It tracks the gold scores, it produces rationales, it is a strong closed model. Good enough to triage model outputs, right?
Then I looked at quadratic-weighted Cohen’s κ. It was 0.057.
That is the moment the project changed. The judge was rank-correlated with the gold labels, but it was not behaving like an ordinal safety evaluator. It had learned the safest-looking failure mode: compress almost everything toward the middle of the 1–5 scale. For ordinary benchmark reporting, that might pass unnoticed. For an autonomous-driving review pipeline that needs to flag bad answers before they gate a software release, it is dangerous.
So I built DiffuJudge-AV, a small evaluation-of-evaluation framework for LLM/VLM judges on driving video. The idea is simple: treat a judge’s score as a noisy observation of a latent true rubric score, deliberately expose the judge to known sources of scoring bias, then denoise the resulting score distribution with a one-step Tweedie posterior mean and report calibrated uncertainty.
Across 28,400 judge evaluations on Wayve’s LingoQA benchmark, the most interesting finding was not that a larger closed model won. It did not. The best judge in the experiment was Qwen2.5-VL-7B, an open 7B vision-language model. It reached:
- Pearson r = 0.857
- Spearman ρ = 0.856
- Quadratic-weighted Cohen’s κ = 0.837
- MAE = 0.57
- Fail-detection F1 = 0.712
Note: The LingoQA benchmark is released under a non-commercial license. The dataset creators at Wayve have granted permission for its use in this article.
For this AV-style evaluation task, an open VLM was not just competitive. It was better on the metrics that actually matter.
Why “evaluation of evaluation”?
When a model answers a question about a driving scene, the obvious evaluation question is:
Did the model answer correctly?
For example:
Question: Are there any parked cars on the side of the road? Reference: Yes, there are two cars parked on the right. Candidate answer (model under test): I don’t know. Gold score: 1.13 (low).
For a human, this is easy. Watch the clip, compare the answer to the scene, assign a score. At scale though, human evaluation becomes the bottleneck. Modern autonomy stacks generate more perception clips, scenario logs, counterfactual rollouts, and model outputs than any annotation team can score manually. So teams naturally reach for LLM-as-a-Judge or VLM-as-a-Judge: give a model the question, reference, candidate answer, rubric, and sometimes the frames, then ask it to score.
That creates a second-order problem:
If the judge is a model, how do we know the judge is reliable?
This is evaluation of evaluation (eval-of-eval). Instead of only asking whether the AV model is correct, we ask whether the evaluator itself is stable, calibrated, bias-resistant, and useful for downstream decisions. Recent papers (Judging the Judges by Shi et al., IJCNLP-AACL 2025; JETTS by Salesforce, 2025; CALM by Ye et al., ICLR 2025) have catalogued structural failure modes in LLM judges: position bias, verbosity bias, scoring-ID-format bias, self-inconsistency across runs, and severe score compression.
There is also a more uncomfortable claim from Wang Lun’s recent essay, Your Evals Will Break and You Won’t See It Coming: evaluation infrastructure fails silently when models cross capability thresholds,
because current benchmarks assume incremental improvement. His proposed remedy is adaptive evals that detect their own obsolescence. DiffuJudge-AV is one concrete step in that direction. By attaching a calibrated uncertainty to every score the judge emits, the framework widens its own confidence interval before the point estimate misleads you.
For autonomous driving this matters operationally. If a learned evaluator decides which failures get escalated to human review, which scenarios enter a regression suite, or which releases deserve more scrutiny, then the evaluator’s failure modes become part of the safety story.
The intuition: a judge score is a noisy sensor reading
An LLM judge score looks clean because it is a number: 1, 2, 3, 4, or 5.
But that number can move for reasons that have nothing to do with the actual quality of the answer. Change the order of options. Paraphrase the rubric. Reorder criteria. Swap score labels from Arabic numerals to Roman. Resample exemplars. Change temperature. Shuffle the video frames you sample. The true answer quality did not change. The judge changed.
That suggests a useful mental model:
Treat the judge like a noisy sensor.
There is a latent score s 0. The judge never observes it directly. Each prompt variant produces a noisy reading
Here t is not a diffusion timestep in the image-generation sense. It is a documented source of judge
perturbation, drawn directly from the 2024–2025 LLM-as-a-Judge bias literature. Seven canonical sources, each one a controlled noise level:
| Level t | Perturbation | What it tests | Reference |
| 1 | option / order swap | position bias | Shi et al., 2025 |
| 2 | rubric paraphrase | prompt sensitivity | SPUQ, arXiv 2403.02509 |
| 3 | criterion reorder | rubric-order sensitivity | Chen et al., 2025 |
| 4 | score-ID format swap (1–5 / I–V / A–E) | scoring-format bias | Chen et al., 2025 |
| 5 | temperature noise | self-inconsistency | Thakur et al., 2025 |
| 6 | exemplar resample | few-shot variance | classical |
| 7 | frame shuffle (video) | temporal robustness | this work |
For each item, the framework runs the judge across all seven perturbation levels with k = 3 samples each, giving roughly 22 score observations per item instead of one. That is a practical measurement of judge instability. It also lets us run a reverse step.
The denoising step: Tweedie in one equation
The diffusion analogy becomes useful because there is a classical result behind denoising: Tweedie’s formula (Robbins 1956; revived for modern diffusion by Manor & Michaeli, ICLR 2024).
If a noisy observation s~ is generated by adding Gaussian noise to a latent clean value s 0, the posterior mean is:
The framework estimates p(s~) with a Gaussian KDE over the per-item sampled scores. Within-perturbation-level variance gives . Pooling across levels is precision-weighted before the Tweedie
correction. Two outputs come out of this single reverse step:
- A denoised point estimate
- A per-item posterior uncertainty
The second output is the part I care about more. In a safety-review workflow, I do not only want a number. I want to know whether the judge is confident enough for that number to be actionable.
The denoised estimate is then wrapped in an ordinal-boundary-adjusted split-conformal interval (Sheng et al., EMNLP 2025), studentized by the Tweedie posterior σ:
Because the score is ordinal, the resulting interval is snapped to valid 1–5 boundaries. The judge’s output is no longer “this answer is a 2.1.” It is one of three actions:
“This answer is likely in the failure region, and the calibrated interval is narrow enough to escalate automatically.”
“This answer is likely a clean pass and the interval is narrow. Release.”
“The model is uncertain on this case. Route to a human reviewer.”

The problem domain and the data
This framework is general-purpose, but the application is intentionally specific: safety-critical autonomous-driving video evaluation. AV systems generate scenario logs and counterfactual rollouts at a volume that no human team can label. Industry now routinely uses LLM/VLM judges to score model answers, prediction quality, planner rationales, and chain-of-thought outputs, and those judges gate release decisions. NHTSA’s pre-crash typology catalogues 37 light-vehicle crash scenarios; ISO 26262 and SOTIF demand calibrated confidence on safety-critical events; CARLA Leaderboard 2.0 generates more validation traffic per day than any annotation budget can absorb.
The benchmark we test on is LingoQA (Marcu et al., ECCV 2024), a visual question-answering dataset for autonomous driving released by Wayve. Each item is a short driving clip (4 seconds, dash-cam, 1 Hz, 5 frames) with a free-form question, reference answer, and a learned-classifier gold score from Lingo-Judge. We use a stratified 200-clip subset of the official evaluation suite and treat the high-confidence Lingo-Judge scores as Tier-1 anchor labels.
A representative item, the same one the judges grade in production:
Question: “Why did the ego vehicle slow down here?” Reference answer: “Because a pedestrian started crossing the road at the marked crossing.” Candidate answer (AV-VLM under test): “Because of traffic ahead.” Gold score: 2.

The qualitative figure above shows three real LingoQA items. In the score-1 clip (red border) the candidate answer hallucinated a motorcycle that is not in the frames. The vision judge’s rationale explicitly calls out that contradiction. The score-5 clip (green border) is one where the candidate correctly verifies a negative claim (“there are no scooters visible”) that a text-only judge cannot check without seeing the scene. The score-3 clip is genuinely ambiguous: the candidate is partially right.
This is the kind of decision the judge has to make, and the kind of decision a text-only judge cannot make well, because the evidence lives in pixels.
Pipeline
The system that produced the numbers in this article fits in one diagram:

From the top:
- Inputs: a driving clip with sampled frames, the question, the reference answer, and the AV-VLM’s candidate answer.
- Forward perturbation cascade: 7 known judge-bias operators applied programmatically to the prompt, producing 22 prompt variants per item.
- Judge ensemble: five configurations evaluated (Claude text-only ensemble, open-source text ensemble, Claude with 3 frames, Qwen2.5-VL-7B with 1 frame, InternVL2-8B with 1 frame). Each emits a scalar score plus a one-sentence rationale.
- Noisy score samples: per-item distribution of scores tagged with perturbation level.
- Tweedie reverse step: single-step denoising with posterior mean and posterior variance.
- Ordinal conformal interval: boundary-snapped, studentized by the Tweedie σ.
- Eval-of-eval report: Cohen’s κ, Krippendorff’s α, ECE, Brier, MAE, fail-F1, stochastic stability, robustness deltas per perturbation source.
Across all five judge configurations this produced 28,400 real judge evaluations on LingoQA.
The full implementation, scripts, run logs, and every figure in this article live in the project repository at github.com/syedhumarahim/diffujudge-av.
Where this slots into NVIDIA’s AV-Eval stack
I built this framework with NVIDIA’s AV-Eval charter in mind: learned evaluation pipelines that replace hand-crafted rules, agentic workflows that chain model inference with retrieval and structured reasoning, and explicit evaluation-of-evaluation methodology. Every primitive in DiffuJudge-AV maps onto that mandate. The 7-level perturbation cascade is the agentic workflow. The Tweedie and conformal layer is the calibration loop. The 12-category behavior taxonomy used internally maps cleanly to NHTSA pre-crash IDs, ASAM OpenSCENARIO 1.x phenomena, and CARLA Leaderboard 2.0 routes, the same scenario vocabulary NVIDIA’s AV training and eval stack already speaks.
The repository also ships a drop-in wrapper for NVILA-8B, NVIDIA’s own efficient VLM (Liu et
al., 2024), and a deployment recipe that serves the three-VLM judge ensemble as OpenAI-compatible
NVIDIA NIM endpoints. One caveat: NVILA-8B’s architecture is not yet supported by vLLM
0.8.4, so the running numbers in this article use Qwen2.5-VL-7B and InternVL2-8B as the open-VLM substitutes. The integration shape is ready for the day vLLM lands NVILA support.
Result: Pearson correlation hid the failure mode
Here is the full metric table:
| Model | Mode | r | ρ | κ | MAE | ECE | Fail-F1 |
| Claude TEXT-only | text ensemble | 0.753 | 0.702 | 0.057 | 0.85 | 0.111 | 0.041 |
| Open-source TEXT ensemble | Qwen+Llama+DSV3 | 0.803 | 0.717 | 0.701 | 0.92 | 0.207 | 0.526 |
| Claude VISION | 3 frames | 0.708 | 0.703 | 0.632 | 1.05 | 0.252 | 0.612 |
| Qwen2.5-VL-7B VISION ★ | 1 frame | 0.857 | 0.856 | 0.837 | 0.57 | 0.121 | 0.712 |
| InternVL2-8B VISION | 1 frame | 0.766 | 0.753 | 0.738 | 0.60 | 0.084 | 0.511 |

The key column is Cohen’s κ. Text-only Claude had a respectable Pearson correlation, but almost zero ordinal agreement. Why? Because its predictions were squeezed into a narrow middle band. It was directionally aware, but not operationally useful.
That is the Pearson trap:
A judge can preserve ranking while destroying the decision boundary you actually care about.
A safety-review system needs to distinguish:
- Clear failure → route to human or regression suite.
- Partial answer → inspect or keep uncertain.
- Clean pass → allow lower-priority review.
A judge that refuses to use the bottom and top of the scale cannot support that workflow. Text-only Claude’s fail-detection F1 is 0.041. It flags 2% of actual failures. The same model with three frames jumps to 0.612. Qwen2.5-VL goes further to 0.712, with κ = 0.837.
Result: vision changed Claude’s scoring behavior
The surprising discovery was not just that text-only judging performed worse. It was that the same Claude model behaved differently when given frames.
Text-only Claude compressed predictions into approximately [1.3, 3.5]. With three driving frames, the range expanded to approximately [1.0, 5.0].

The second panel above is Claude TEXT-only: roughly 80% of all judgments at score 3 with a tiny sprinkle of 1s and 5s. The third panel is the same Claude model with three frames: scores now spread across the full ordinal scale. Same model, same rubric, same items.
The compression was not a model-family property or a generic RLHF effect. It was input-mode-specific. When the judge only saw text, it hedged. When the judge saw the scene, it was willing to use the full ordinal scale. Many evaluation pipelines still use text-only judge prompts even for visual tasks. They ask the judge to compare a candidate answer to a reference answer, but the judge never sees the underlying evidence. For driving scenes that is a severe limitation. A text-only judge can check semantic similarity; a vision judge can check whether the answer is grounded in the scene.
Same finding shown another way, per-item scatter against gold:

Left panel: every text-only Claude prediction sits inside [1.3, 3.5] regardless of where the gold actually is. Right panel: the same Claude on the same items with three frames. Predictions now climb the y = x line.
Result: vision unlocks safety-threshold decisions
The bottom-line operational metric for an AV-review pipeline is can this judge flag a bad answer when the gold is bad? That is fail-detection at the ordinal threshold gold ≤ 2.

- Claude TEXT-only: precision 1.00 on fail-detection, but recall 0.02. Catches 2% of actual failures, because it almost never says “≤ 2”.
- Claude VISION: 0.45 precision, 0.94 recall. Catches 94% of failures.
- Qwen2.5-VL-7B: 0.43 precision, 1.00 recall.
For pass-detection (gold ≥ 4), text-only Claude has F1 = 0.00. It never says “5,” so it can never confirm a clean pass. Claude-vision reaches 0.76; Qwen-VL reaches 1.00.
This is what vision grounding plus a calibrated scale buys you in practice.
Result: a single heatmap of judge bias per noise source
One of the most useful artefacts the SDJ cascade gives you is a single picture that shows which of the seven known judge-bias sources each model family is most sensitive to. For each (model, perturbation level) cell, we average the absolute score shift from the anchor across all items:

A few things stand out:
- Text-only Claude is uniformly fragile. Rubric paraphrase, criterion reorder, score-ID swap, and temperature each shift its mean by ~0.4 on a 1–5 scale. That matches the diffusion-framing’s prediction that compressed, hedging judges drift the most when you perturb the prompt.
- The open-source text ensemble is roughly 3× more robust across every column, maxing out at | Δ| = 0.15 for score-ID swap.
- Qwen2.5-VL is dominated by one specific bias: score-ID format swap (Arabic → Roman → A–E) shifts its mean by 0.44. Knowing which bias matters most is itself actionable: lock the score format in production prompts for this judge.
This is exactly the kind of audit-ready, evaluation-of-evaluation artefact a learned-evaluation pipeline needs to ship alongside the headline metrics.
Result: does the uncertainty have signal?
The Tweedie reverse step produces a posterior σ at no extra cost. The question is whether that σ has information. Are the items the cascade marks as uncertain the same items the judge gets wrong? For each vision judge we plot per-item std of perturbation samples (a proxy for posterior σ) against the absolute error against gold:

Qwen2.5-VL’s σ has the cleanest signal (r = 0.26 between predicted-σ and observed-|error|). Items with σ near zero almost never have |error| > 1; items with σ > 0.6 are where the judge’s mean was off by 1–3 score points. That is precisely the regime where the safety-gate diagram says route to human review. We now have empirical evidence the framework’s own uncertainty estimate identifies those items.
Result: stochastic stability hit the original target
One goal was to test whether SDJ could expose and reduce stochastic instability. I ran Qwen2.5-VL-7B across 5 random seeds at two temperatures on all 100 vision items:
| Temperature | Median per-item std | Mean | Frac items with std ≤ 0.15 |
| T = 0.6 (noisy single-judge baseline) | 0.40 | 0.40 | 31% |
| T = 0 (deterministic floor) | 0.00 | 0.024 | 95% |
The noisy baseline matched the expected instability almost exactly: about 0.40 per-item standard deviation, in line with the literature. At T = 0, 95% of items sat at or below the original design target of 0.15.

/
In the diffusion framing, temperature is one of the forward noise sources. I am not saying every production judge should run at T = 0 forever. The point is that a judge harness should measure this instability explicitly instead of pretending the score is deterministic, and report a posterior σ alongside the point estimate.
Result: conformal coverage matches the calibration target
The conformal layer aims for empirical coverage ≥ 1 − α with α = 0.10. Across the three runs with enough items for stable split-conformal calibration:
| Run | n_test | Empirical coverage | Target | Mean interval width |
| Claude TEXT-only | 80 | 0.950 | 0.900 | 4.51 |
| Open-source TEXT | 80 | 1.000 | 0.900 | 4.50 |
| Claude VISION | 20 | 1.000 | 0.900 | 3.50 |
All three are above target. Per-bin coverage (fail / mid / pass tiers) is also above 0.92 in every cell. The intervals are 3.5–4.5 score-units wide on a 1–5 scale. That is the price of full coverage on a heterogeneous calibration set.
Limitations
A few caveats worth flagging up front. The gold labels are high-confidence Lingo-Judge classifier outputs, not a Tier-3 human-adjudicated set. The CODA-LM-style corner-case stress split (cut-ins at night, occluded VRUs, ambiguous near-misses) is not included yet. The best ECE measured is 0.084, not the original 0.05 target; a post-hoc isotonic or Platt calibration on the calibration split would almost certainly close that gap. The current VLM ensemble has two committed judges (Qwen2.5-VL and InternVL2-8B) rather than the planned three. The vision runs are smaller than the text-only runs (100 items vs 200)
because of available frame maps, so cross-modality numbers are model-level summaries rather than strictly per-item comparable. And the denoising step is a single-step analytical Tweedie correction rather than a multi-step learned sampler. These are honest limitations, but they also map directly onto the roadmap below.
Conclusion
The most important lesson from this project is not that one model beat another. It is that the metric you optimize for during eval-of-eval determines the judge you ship.
If I had optimized for Pearson r alone, I would have shipped a text-only Claude judge that barely used the ordinal scale and caught 2% of safety-critical failures. Using the full eval-of-eval table (ordinal κ, fail-detection F1, calibration, stochastic stability) flipped the decision to an open VLM judge with calibrated uncertainty and a routing rule that sends ambiguous cases to humans. Same data, different metric, different judge in production.
That is the difference between an eval that looks good in a benchmark table and an eval that can support real engineering decisions.
For learned evaluation systems, especially in autonomous driving, robotics, and healthcare, we should stop treating evaluator scores as ground truth. They are measurements. Measurements have noise. Noise has structure. If we can measure that structure, we can build better evaluators.
That is what DiffuJudge-AV tries to do: make the failure modes of the evaluator visible before the evaluator becomes part of the production decision loop. Wang Lun’s essay closes with a line worth quoting: “If you can evaluate correctly, you can train correctly.” This work is one small contribution toward that ambition.
Future work
The roadmap follows directly from the limitations above: a small Tier-3 expert-anchored golden set on the 50 hardest LingoQA items, a CODA-LM corner-case stress split, a third specialized VLM judge (LLaVA-Critic-7B, or NVILA-8B once vLLM lands the architecture), a learned Tweedie MLP that replaces the analytical Gaussian KDE with a small denoiser trained on perturbation-level / judge-family / item-embedding features, a post-hoc isotonic calibration layer to close the remaining ECE gap, and a NIM-served production ensemble with A/B comparison tooling and model versioning.
References
- Shi, Y. et al. Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge. IJCNLP-AACL 2025.
- Chen, X. et al. Evaluating Scoring Bias in LLM-as-a-Judge. arXiv 2506.22316, 2025.
- Thakur, A. et al. Rating Roulette: Self-Inconsistency in LLM-as-a-Judge. arXiv 2510.27106, 2025.
- SPUQ: Semantically-Perturbed Uncertainty Quantification. arXiv 2403.02509, 2024.
- Sheng, H. et al. Analyzing Uncertainty of LLM-as-a-Judge via Conformal Prediction. EMNLP 2025 / arXiv 2509.18658.
- Ye, S. et al. CALM: A Reasoning-Calibrated Multi-Step Eval-of-Eval Framework. ICLR 2025.
- Wang, L. Your Evals Will Break and You Won’t See It Coming, blog, 2025.
- Robbins, H. An empirical Bayes approach to statistics. Berkeley Symp. 1956 (Tweedie’s formula).
- Manor, H. & Michaeli, T. On the Posterior Distribution in Denoising: Application to Uncertainty Quantification. ICLR 2024 / arXiv 2309.13598.
- Marcu, A. et al. LingoQA: Visual Question Answering for Autonomous Driving. ECCV 2024 (Wayve).
- Sima, C. et al. DriveLM: Driving with Graph Visual Question Answering. ECCV 2024.
- Liu, Z. et al. NVILA: Efficient Frontier Visual Language Models. NVIDIA, arXiv 2412.04468, 2024.
- Lin, J. et al. VILA: On Pre-training for Visual Language Models. NeurIPS 2024 / arXiv 2312.07533 (NVIDIA).
- Wang, S. et al. OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning, and Planning. NVIDIA, arXiv 2405.01533, 2024.
- Mao, J. et al. A Survey on Multimodal Large Language Models for Autonomous Driving. NVIDIA / Tsinghua, arXiv 2311.12320, 2023.
- Najm, W. G. et al. Pre-Crash Scenario Typology for Crash Avoidance Research. NHTSA / Volpe, 2007.
- ASAM e.V. OpenSCENARIO 1.x specification. 2022–2024.



