Why 90% Accuracy in Text-to-SQL is 100% Irrelevant

I have been working in the Analytics space for over 20 years. Back then, it wasn't called “statistics”, it was “Business Intelligence” or “Decision Support Systems” in the old days. The goals are changing, from data warehouses to Big Data, to pool houses, and now with AI, the core and eternal promise of self-service analytics remains the same: extracting truth from data to empower users without relying on someone from the data team. AI without humans in the loop? That sounds controversial.
With the advent of Large Language Models (LLMs), one use case I find interesting is developing dialogs to interact with databases (Text-to-SQL). The potential here is huge, promising to democratize data access for all organizations.
However, for this specific application, the solution must be binary. Either it works or it doesn't.
80% or even 90% accuracy is, unfortunately, not enough. Giving your end users an AI analytics application that renders tables or misinterprets filters is no joke. You cannot compromise on accuracy because it destroys trust quickly. And what happens when the system loses trust? It will not be used. Adoption will decrease, without forgetting the catastrophic risk of business decisions made based on incorrect data.
RAG Pipeline Complexity
I began my research on this topic about a year and a half ago and it quickly became clear that programming a robust Text-to-SQL RAG (Retrieval-Augmented Generation) application is no small feat. You need many parts in your pipeline, working in perfect harmony:
- An objective separator to find out the purpose of the question.
- A vector database to store additional context (such as entity definitions) required by language models.
- An embedding model to make this additional information.
- A way to find out in the database.
- Access to the database.
- Ability to generate SQL in a particular dialect of the database.
- And skill check it out Results.
This last part, testing, I believe is often left out or considered an afterthought, but it is perhaps the most important part of ensuring the reliability required in a business environment.
BigQuery: A Case Study in Native AI Integration
Managing this complex pipeline often requires integrating multiple platforms. I've recently been impressed with how BigQuery has introduced the integration of Analytics and Generative AI natively into their platform.
You have the ability to work with your SQL in the BigQuery IDE and quickly implement Gen AI without moving to another platform or product. For example: you can query the database and the returned results can be immediately sent to Gemini (or with Vertex you can add other models). You can use Gemini to isolate intent, create embeds and store them in the BigQuery database, perform semantic searches, and generate SQL.
All that with just one platform, without the hassle of managing multiple subscriptions.
Yes, like everything in life, it has its problems.
One of the main disadvantages is that BigQuery may not be the cheapest database, and I've heard stories of startups where one wrong query can cost your credit card. It didn't happen to me, but I can feel how this could happen. Another trick would be that you are completely locked out of Google. Maybe that's not a bad thing; the same way we're all stuck with Gmail. Perhaps in the future, AI will be a commodity, the way emails are now.
Another drawback is the lack of granular tracking of token costs and the “LLM mock” type of development; you don't really want to use an expensive LLM in your development phase.
If you agree with the above disadvantages, you get a great product that combines many tools into one cloud platform that can handle very large data.
I created the following repo that was part of the Kaggle hackathon, where I tested these native BigQuery capabilities further. For more information visit the repo here:
The Missing Piece: Rigorous Testing
Now, going back to the eval parameters. Platforms like BigQuery make it easy propertiesbut they do not automatically solve i accuracy problem. I see a lot of solutions out there, but most of them don't have strong testing capabilities.
If we accept that Text to SQL should be binary (right or wrong), we need analysis techniques that reflect the dirty truth of business data, not the clean surfaces of academic or demo data sets.
Testing a Text-to-SQL program is very difficult because of the declarative nature of SQL and how complex your database schema is. Does it have thousands of tables? Are those tables well written? Probably not. Are the naming conventions consistent across tables? Two queries can look completely different programmatically (eg, different join orders, renaming, or use of CTEs) but produce the same results.
To truly measure your RAG application during development and production, you must use the right metrics.
Key Metrics
Going back to BI's promise of self-service or analytics, this means that the end user relies 100% on themselves; unfortunately, there is no human-in-the-loop or data expert to verify the results. Because of this, we need to establish a definable AI or evaluation framework with a set of metrics to measure the quality of the generated SQL.
- Exact Exactness (EX): Previous benchmarks relied on Exact Match (EM), which compared a predicted SQL string to the ground truth. This was highly flawed, as it penalized valid syntactic variations. The modern standard is Execution Accuracy (EX). This metric runs both predicted SQL and “Gold” (ground truth) against the actual database and compares the returned result sets. This correctly validates questions regardless of how they are written.
- Focused Assessment: In business cases, the query may return additional, non-essential columns (eg, the ID column used for the join). Strict execution accuracy may mark this as a failure. “Performance-oriented checking” allows for multi-part comparisons, checking that target columns and values are correct, while being soft on outliers or row ordering.
- The “Soft-F1” Metric: To reduce the binary nature of the Execution Accuracy (where one wrong cell fails the entire test), Soft-F1 is used more often. This metric gives partial credit for calculating the overlap between predicted results and gold. If the query returns 99 out of 100 correct rows, Soft-F1 indicates high performance, while EX will return 0. This is important for debugging.
- LLM-as-a-Judge: Sometimes execution is not possible (eg, loss of private data, natural errors). In these cases, advanced LLM can be told to compare the semantic logic of predicated SQL against Gold SQL. Although less objective than execution, it has more to do with human judgment.
Spider 2.0: An Enterprise Reality Check
Currently there are three notable test frameworks: Spider 2.0, BIRD (Big Bench for LaRge-scale Database Grounded Text-to-SQL) and SynSQL (based on synthetic data). However, the industry has been plagued by a false sense of security created by outdated benchmarks. For many years, the industry relied on Spider 1.0. It focuses on small, clean SQLite databases (average less than 10 tables). Models were achieving 90%+ accuracy, leading many to believe the problem was “solved.”
A framework that I've always emphasized, that incorporates these modern metrics and really assesses business readiness, Spider 2.0.
Spider 2.0 (released in conjunction with ICLR 2025) is a paradigm shift, designed to address this “reality gap” by introducing complexity that breaks LLMs into production:
- Main scale: Business schemes are big. Spider 2.0 databases average 812 columns, with some exceeding 3,000. This scale often exceeds the limitations of LLM context, forcing models to use “Schema Linking” (retrieval) techniques to identify relevant tables before generating SQL.
- Language Diversity: Real companies use Snowflake, BigQuery, and T-SQL, not just SQLite. Spider 2.0 enforces language diversity, requiring models to master certain syntax (eg, handling nested JSON data using UNNEST or FLATTEN).
- External Information: Business logic (like the definition of “Churn Rate”) resides in the documentation or project code base (like DBT), not the schema. Spider 2.0 mimics this by providing external files (Markdown, YAML) the model must read to support its logic.
- Agent workflow: Most importantly, Spider 2.0 is a workflow model for the modern data engineer. It goes beyond static rendering, testing the model's ability to explore the file system, read documents, interact with live database environments, and debug errors iteratively.
The difficulty difference is stark. Models governing Spider 1.0 saw success rates drop by up to 10-20% on the full Spider 2.0 benchmark, highlighting the shortcomings of current LLMs when faced with real-world challenges.
Conclusion: Binary Bar for Business Data
The journey from Business Intelligence to AI-driven analytics has been marked by increased abstraction, but the fundamental requirement for data integrity has not changed. While the promise of Text-to-SQL is closer than ever, we must resist the lure of high scores on outdated benchmarks.
Achieving 90% accuracy may be academically interesting, but in business, it is of no industrial use. The bar is not binary: it works or breaks trust.
As platforms like BigQuery facilitate the integration of AI and data, it is important that we simultaneously embrace sophisticated testing methods and robust benchmarks like Spider 2.0. By only examining the dirty truth of business data we can develop Text-to-SQL applications that are reliable enough to bet the business.
Until next time, I hope you found this article as interesting as I did.
Continuous Learning
Spider 2.0: Examining Language Models in Real-World Enterprise Text-to-SQL Workflows Authors: Fangyu Lei, Jixuan Chen, Yuxiao Ye, et al. Published: arXiv (Nov 2024), Accepted at ICLR 2025 (Oral). Link:
Of course, like anything with AI these days, this discussion shouldn't end here. I would like to hear your thoughts and your opinion on www.gyza.org



