Why Your AI Search Test Is Probably Wrong (And How to Fix It)

almost a decade, and I'm often asked, “How do we know our current AI setup is done right?” The honest answer? A lot of testing. Clear benchmarks allow you to measure improvements, compare vendors, and adjust ROI.
Many teams test AI search by asking a few questions and choosing whichever program “feels” best. Then they spend six months putting it together, only to find that the accuracy is actually worse than that of their previous setup. Here's how to avoid that $500K mistake.
The problem: ad-hoc testing doesn't reflect production behavior, isn't repeatable, and corporate benchmarks aren't customized to your use case. Functional benchmarks are tailored for your domain, cover different types of queries, produce consistent results, and address disagreement between testers. After years of research to evaluate search quality, here is a process that actually works in production.
Basic Assessment Level
Step 1: Define what “good” means in your use case
Before you even begin a single test question, get to know what the “correct” answer looks like. Common characteristics include the accuracy of the foundation, the novelty of the results, and the validity of the sources.
For a financial services client, this might be: “Numerical data must be accurate to within 0.1% of official sources, cited with publication timestamps.” From a developer tools company: “The code examples should work without modification in the specified language version.”
From there, write down your limit for switching providers. Instead of the absurd “5-15% improvement,” include business impact: If a 1% accuracy improvement saves your support team 40 hours/month, and the change costs $10K in engineering time, you break even on a 2.5% improvement in the first month.
Step 2: Build your golden test set
The golden set is a curated collection of questions and answers that gets your organization on the same page about quality. Start finding these queries by looking at your production query logs. I recommend that you fill your golden set with 80% of the questions devoted to general patterns and the remaining 20% to edge cases. For sample size, aim for a minimum of 100-200 questions; this produces confidence intervals of ±2-3%, which are tight enough to detect meaningful differences between providers.
From there, create a grading rubric to assess the accuracy of each question. For factual questions, I explain: “Point 4 if the result contains a direct answer with an authorized citation. Point 3 if it is correct, but needs an explanation by the user. Point 2 if it is partially relevant. Point 1 if it is directly related. Point 0 if it is not related.” Include 5-10 sample questions with scoring results for each section.
Once you have that list, have two domain experts independently rank the top 10 results for each question and measure the agreement with Cohen's Kappa. If it is less than 0.60, there may be more problems, such as unclear criteria, insufficient training, or differences in judgment, that need to be addressed. When making revisions, use the changelog to capture new versions of each scoring rubric. You'll want to keep different versions of each test so you can reproduce them in later tests.
Step 3: Run a controlled comparison
Now that you have your list of test questions and a clear rubric for measuring accuracy, run your query set across all providers in parallel and collect the top 10 results, including location, title, caption, URL, and timestamp. You should also log query latency, HTTP status codes, API versions, and result counts.
With RAG pipelines or agent search tests, pass each result to the same LLMs with the same instructions to be executed with the temperature set to 0 (as you classify the quality of the search).
Most tests fail because they use each question only once. Search systems are static in nature, so sampling randomness, API variations, and timeout behavior all introduce trial-to-trial variation. To measure this well, use more tests per question (I recommend starting with n=8-16 tests for systematic retrieval tasks, n≥32 for complex reasoning tasks).
Step 4: Compare with LLM judges
Modern LLMs have more thinking power than search engines. Search engines use minimal rescaling optimized for millisecond delays, while LLMs use 100B+ parameters and seconds to think about each decision. This degree of asymmetry means that LLMs can judge the quality of results more carefully than the programs that produced them.
However, this analysis only works if you equip the LLM with a detailed scoring notice that uses the same rubric as the human evaluators. Provide example questions with score results as a demonstration, and require JSON formatted output with the relative score (0-4) and a brief description for each result.
At the same time, use the LLM judge and have two human experts find a validation subset of 100 questions that includes easy, medium, and difficult questions. Once that is done, calculate interrater agreement using Cohen's Kappa (target: κ > 0.70) and Pearson's correlation (target: r > 0.80). I have seen Claude Sonnet achieve 0.84 agreement with expert raters when the rubric is well specified.
Step 5: Measure test stability with ICC
Accuracy alone does not tell you if your test is reliable. You also need to know whether the differences you see between search results reflect real differences in query difficulty, or just random noise from model provider inconsistent behavior.
The Intraclass Correlation Coefficient (ICC) divides the variation into two buckets: between-question variation (some questions are harder than others) and within-question variation (inconsistent results for the same question across runs).
Here's how to interpret ICC when evaluating AI search providers:
- ICC ≥ 0.75: Good reliability. Provider responses are consistent.
- ICC = 0.50-0.75: Moderate reliability. A mixed contribution from query complexity and provider inconsistency.
- ICC < 0.50: Poor reliability. The results of a single program are unreliable.
Consider two providers, both of which achieve 73% accuracy:
| Accuracy | The ICC | Interpretation |
| 73% | 0.66 | Consistent behavior across trials. |
| 73% | 0.30 | It's impossible. The same question produces different results. |
Without ICC, you could use a second provider, assuming you get 73% accuracy, for reliability issues in production.
In our study evaluating providers of GAIA (reasoning tasks) and FRAMES (retrieval tasks), we found the ICC to vary widely with task complexity, from 0.30 for complex reasoning with suboptimal models to 0.71 for systematic retrieval. In general, improvements in accuracy without improvements in ICC reflect random sampling rather than real gains in power.
What Success Really Looks Like
With that validation in place, you can test providers across your entire test set. The results may look like this:
- Provider A: 81.2% ± 2.1% accuracy (95% CI: 79.1-83.3%), ICC=0.68
- Provider B: 78.9% ± 2.8% accuracy (95% CI: 76.1-81.7%), ICC=0.71
The intervals do not overlap, so Provider A's accuracy advantage is statistically significant at p<0.05. However, Provider B's higher ICC means they are more consistent—same question, predictable results. Depending on your use case, consistency may be more important than the 2.3pp accuracy difference.
- Provider C: 83.1% ± 4.8% accuracy (95% CI: 78.3-87.9%), ICC=0.42
- Provider D: 79.8% ± 4.2% accuracy (95% CI: 75.6-84.0%), ICC=0.39
Provider C appears to be better, but those wide confidence intervals overlap a lot. More importantly, both providers have an ICC <0.50, indicating that most of the variability is due to trial-to-trial randomness rather than the difficulty of the questions. If you see a discrepancy like this, your test method itself needs to be debugged before you can trust the comparison.
This is not the only way to evaluate the quality of a search, but I find it the most effective in measuring accuracy and feasibility. This framework delivers repeatable results that predict manufacturing performance, enabling you to compare suppliers on an equal footing.
Right now, we're at a stage where we rely on cherry-picked demos, and most vendor comparisons don't make sense because everyone rates differently. When you're making million-dollar decisions about search infrastructure, you owe it to your team to get it right.



