Machine Learning

A Cost-Effective Building RAG for Long Text Documents in SQL Tables

a reliable, low-cost, cost-effective SQL table RAG system that stores large documents in long text fields – without changing the existing schema?

This is not a theoretical problem.

In many enterprises, critical business information already resides within traditional relational information. Proposals, reports, contracts, essays – all kept inside TEXT or LONGTEXT columns — designed for keyword matching and aggregation, not semantic retrieval.

With the advent of LLMs, business demands have evolved towards structured analytics, deep semantic understanding and contextual insights in a natural, conversational way.

For example:

  • How many projects over $1M are allowed from 2023 to 2025?
  • Summarize the major trends seen in technology over the past 6 months
  • What has been the difference in winning proposals in 2025?

They require a retrieval strategy that can decide when to count, when to search statistically, and when to combine the two. In this article, I will demonstrate an Agentic RAG architecture that runs directly on top of a standard SQL database – without schema changes – and discuss the design principles needed to make it reliable in production.

System setup

In this example, I used a subset of the Social Animal 10K Articles with NLP dataset, which has a large number of news articles and blog posts and metadata. The created SQL database has the following columns – url, title, authors, date_published, article_section, word_count and full_content.

The title can be considered a unique identifier (primary key) of the content. Article categories are technology, business, sports, travel, health, entertainment, politics and fashion. Articles are distributed almost evenly across categories. LLM used gemini-2.5-flash and FAISS indexing and storing vector embeddings. The design works in any choice of LLM or vector database.

Buildings

Without embedding the raw text, we matched the metadata of the vector store with the same fields that exist in SQL (except for the full content). This allows Filteringas we will see in the results. For long documents, the merge and embed sliding window can be used with metadata attached to each embed.

A snippet of the metadata code is attached
for idx, row in df_sql.iterrows():
    content = str(row['full_content']).strip()
    if not content:
        continue
        
    metadata = {
        "source": row.get('url', ''),
        "title": row.get('title', ''),
        "authors": str(row.get('authors', '')),
        "article_category": str(row.get('article_category', 'unknown')),
        "published_date": str(row.get('published_date', '')),
        "word_count": int(row.get('content_word_count', 0))
    }
    
    doc = Document(page_content=content, metadata=metadata)
    documents.append(doc)

We created two special, smart tools that React agent you can request using the following architecture. The ReAct agent (router) orchestrates the entire query pipeline by intelligently deciding which tool to use based on the query type. It uses metadata and query context to determine whether an SQL tool, a vector tool, or a hybrid method is appropriate. The following figure shows the query decision flow:

Query decision flow

The tools are as follows:

  1. search_database (SQL tool): Handles questions that require calculation, integration or complex reasoning. It uses SQL queries
  2. search_topics (Vector Tool): Handles questions about content, topic or specific organizations. It accepts a natural language query, and optionally, metadata filters to perform a global semantic search (eg: “articles about children“) or search for a subset of the data (eg: “filtha_authors='XYZ', “query”=”articles”).

As seen in the image above, the query can take the following forms:

  1. Because calculations (eg how many articles..), inequality/width (eg: Articles published between Jan and April 2023) or combining (eg What is the average word count..), use the SQL tool only.
  2. Semantic searchwith or without filters, use the Vector tool as described above.
  3. Hybrid question: Hybrid queries are useful when you need both structured data (eg, sorting by date) and unstructured content (eg, semantic search on articles). The query contains metadata filter criteria (usually a category or date range), from which the SQL tool is used to retrieve articles. Then the list of articles is passed to the Vector tool to perform a semantic search on only those articles. An example would be “between March and May 2023 there is an article about mother's day in fashion

Results

Here are the results of a few questions for each type. It would be interesting to note the instrumental approach followed in each case.

1. “what are the topics per year and categories?” — This is a complex computer query, so only an SQL tool will be used.

The result is here
QUERY: what are the articles by year and category ?
TOOL SELECTION:
  - search_database: {'query': "SELECT strftime('%Y', published_date) AS year, article_category, COUNT(*) FROM articles GROUP BY year, article_category"}
DOCUMENTS SELECTED:
  - Tool Output: [('1970', 'business', 1), ('2001', 'politics', 1), ('2013', 'politics', 1), ('2018', 'entertainment', 1), ('2020', 'entertainment', 1), ('2020', 'sports', 1), ('2022', 'fashion', 1), ('2023', 'business', 19), ('2023', 'entertainment', 16), ('2023', 'fashion', 16), ('2023', 'health', 18), ('2023', 'politics', 16), ('2023', 'sports', 16), ('2023', 'technology', 18), ('2023', 'travel', 18)]
RESULT:
Here are the number of articles by year and category:
* 1970: business (1)
* 2001: politics (1)
* 2013: politics (1)
* 2018: entertainment (1)
* 2020: entertainment (1), sports (1)
* 2022: fashion (1)
* 2023: business (19), entertainment (16), fashion (16), health (18), politics (16), sports (16), technology (18), travel (18)

2. “what articles do you have about children?” — This does not match any metadata categories contained in the schema, so the agent decides to use the Vector tool for global semantic search.

The result is here
QUERY: what articles about children do you have
TOOL SELECTION:
  - search_articles: {'query': 'children'}
DOCUMENTS SELECTED:
  - Tool Output: --- Document 1 ---
Source: 
Title: 10 Films That Children Should Watch
Category: entertainment
Date: 2020-11-14 02:30:36
Content: On this children's day, here's a list of stories designed to help young ones understand the world around them, gain life lessons and be dazzled by colourful imagination. It's a great time to be a...
  - 
  - 
  - 
RESULT:
Here are some articles about children:

1.  10 Films That Children Should Watch (entertainment)
2.  Save the Children and THINKMD Expand Partnership to Improve the Lives of Children Globally (health)
3.  St. Lucie County School Board decides to keep challenged books in school libraries (health)

3. “what are the fashion trends?” — The agent finds the category = fashion, and runs a semantic game using a Vector tool with this filtering feature.

The result is here
QUERY: what are the trends in fashion
TOOL SELECTION:
  - search_articles: {'query': 'trends', 'filter_category': 'fashion'}
DOCUMENTS SELECTED:
  - Tool Output: --- Document 1 ---
Source: 
Title: The Best Thing We Saw in Milan Today: India Mahdavi for Gebrüder Thonet Vienna - Sight Unseen
Category: fashion
Date: 2023-04-18 12:00:00
Content: How to Live With Objects

The Best Thing We Saw in Milan Today: India Mahdavi for Gebrüder Thonet Vienna

Sight Unseen is on the ground at the Milan Furniture Fair right now and we’ll be bringing you load...
  - 
  - 
  - 
RESULT:
The search results mention the following trends in fashion: India Mahdavi for Gebrüder Thonet Vienna, clean make up & skincare, and 80's-inspired jewelry.

4. “tell me technical articles about crypto in 2023” — This is a hybrid query where the SQL tool will be used to find the topics in 2023 for the category = technology, then the Vector tool will be used and the query = crypto and the topic list. The result will be found in that subset.

The result is here
QUERY: tell me technology articles about crypto in 2023
TOOL SELECTION:
  - search_database: {'query': "SELECT title FROM articles WHERE article_category = 'technology' AND published_date LIKE '2023%'"}
  - search_articles: {'filter_titles': ['NPR quits Twitter, says Musk-led platform is “undermining our credibility”', 'Crypto.com arena considers rebrand after FTX fallout reignites investor anger', "World's Biggest EV Battery Maker Unveils S", 'What Checks Should You Perform Before Buying a Used Car? - Free Car Mag', 'Preeclampsia Diagnostics Market 2023-2030 Overall Analysis Report | Fortune Business Insights', 'Spanish startup on ‘mission to save planet’s beer’ from climate crisis', 'Best third-party apps for Reddit (2023)', "How Ford's 'driverless' BlueCruise system works", "Save $100 on Apple's M1-Powered iPad Air", 'Media release: Music cassette sales highest in nearly two decades - allmediascotland…media jobs, media release service and media resources for all', 'Software Engineer, Service Platform at Stripe | echojobs.io', 'Driving data to deliver net zero with life cycle assessments', 'The Future of Work: Trends and Predictions for the Next Decade', 'Risks of Outsourcing Software Development to Uruguay', 'Kochie Arrest News Addressed, Affected By Scam', 'Paying attention to employee retention | IRIS', 'Could this foam-filled £7,250 track bike rival the Team GB Hope HB.T?', 'Aave, Chainlink, and Binance are concentrating on enhancing the Shapella, while Big Eyes Coin is aiming for a significant final push'], 'query': 'crypto'}
DOCUMENTS SELECTED:
  - Tool Output: [('NPR quits Twitter, says Musk-led platform is “undermining our credibility”',), ('Crypto.com arena considers rebrand after FTX fallout reignites investor anger',), ("World's Biggest EV Battery Maker Unveils S",), ('What Checks Should You Perform Before Buying a Used Car? - Free Car Mag',), ('Preeclampsia Diagnostics Market 2023-2030 Overall Analysis Report | Fortune Business Insights',), ('Spanish startup on ‘mission to save planet’s beer’ from climate crisis',), ('Best third-party apps for Re...
  - Tool Output: --- Document 1 ---
Source: 
Title: Crypto.com arena considers rebrand after FTX fallout reignites investor anger
Category: technology
Date: 2023-04-19 01:15:52
Content: The Crypto.com Arena, formerly known as the Staples Center, is reportedly considering a rebrand after the recent controversy surrounding FTX, a cryptocurrency exchange. The FTX fallout reignited anger among investors, wh...
  - 
  - 
  - 
RESULT:
Here are some technology articles from 2023 that mention crypto:

1.  **Crypto.com arena considers rebrand after FTX fallout reignites investor anger**: This article discusses the possibility of the Crypto.com Arena rebranding due to the controversy surrounding the FTX cryptocurrency exchange.
2.  **Aave, Chainlink, and Binance are concentrating on enhancing the Shapella, while Big Eyes Coin is aiming for a significant final push**: This article mentions the Shapella upgrade and its implications for cryptocurrencies on the Ethereum network, including Binance, Aave, and Chainlink.
3. **Software Engineer, Service Platform at Stripe | echojobs.io**: This article mentions Stripe, a financial infrastructure platform that allows businesses to accept payments.

Important Considerations

As with any architecture, there are design principles that must be considered to implement a robust application. Here are a few of them:

  • Tool docstrings vs command system: These are two types of instructions that direct the agent's behavior in different ways. It is important to use them for their intended purposes without any overlap or conflict with the reliable operation of the agent. The docstring tool, found in @tool the decorator, he explains what the tool does again how to use. Apart from the name of the tool, it describes the parameters, types and definitions.
Here is an example of the search_articles tool docstring.
@tool
def search_articles(query: str, filter_category: Optional[str] = None, ...):
    """Useful for finding information about specific topics, summaries, or details within articles.
    
    You can filter by metadata for precision:
    - `filter_category`: 'health', 'tech', etc.
    - `filter_titles`: List of exact titles to retrieve (BATCH MODE).
    - `filter_date`: Published date (YYYY-MM-DD) for EXACT or PARTIAL match only.
    ...
    """
  • On the other hand, the system quickly intelligently guides the agent's routing strategy, enabling it to decide when to use an SQL tool, a Vector tool or a combination. It is also the most complex and weak part of the application. It explains how the tools are integrated into a hybrid workflow, provides examples of correct use of the tools, and specifies mandatory rules and constraints. To adequately design the system information, It is important to start with a test case database of expected user questions, provide examples in the system information, and continue to enrich it with deviations from critical situations during operation.
Here is a sample system prompt
system_prompt = (
    "1. **LISTING/BROWSING QUERIES** (e.g., 'what articles are in politics'):n"
    "   - **ALWAYS use [search_database] to list titlesn"
    "   - DO NOT use [search_articles] without a semantic queryn"
    ...
    "### MANDATORY RULESn"
    "1. **DATE RANGES & INEQUALITIES**: Use SQL first, then pass titles to vector tooln"
    ...
)
  • Pre and post filter vector databases: This is a subtle point that may have unintended and difficult-to-explain results for some questions. Consider the following two questions where the only difference is a misspelled word: “summarize articles about Doo ley in politics for April 17, 2023” and “Summary articles about Dooley in politics for 17 apr 2023“. Both of these queries follow the same path, where the SQL tool successfully selects the topics for this section and date (there is only 1 topic that mentions Judge Dooley), and then the Vector tool is called on this list of topics and the query. Surprisingly, in the first query, the Vector tool returns. "Tool Output: No documents found matching the criteria." in this small spelling error even if the list has only 1 article to choose from, and in the second question it returns the correct article.
Here is the result of the first question
QUERY: QUERY: summarize articles about Doo ley in politics on 17th apr 2023
TOOL SELECTION:
  - search_database: {'query': "SELECT title FROM articles WHERE published_date LIKE '2023-04-17%' AND article_category = 'politics'"}
  - search_articles: {'query': 'Doo ley', 'filter_category': 'politics', 'filter_titles': ['Judge Dooley Ends Hartford Police Consent Decree Despite Concerns']}
DOCUMENTS SELECTED:
  - Tool Output: [('Judge Dooley Ends Hartford Police Consent Decree Despite Concerns',)]
  - Tool Output: No documents found matching the criteria.
And the second question
QUERY: summarize articles about Dooley in politics on 17th apr 2023
TOOL SELECTION:
  - search_database: {'query': "SELECT title FROM articles WHERE published_date LIKE '2023-04-17%' AND article_category = 'politics'"}
  - search_articles: {'query': 'Dooley', 'filter_category': 'politics', 'filter_titles': ['Judge Dooley Ends Hartford Police Consent Decree Despite Concerns']}
DOCUMENTS SELECTED:
  - Tool Output: [('Judge Dooley Ends Hartford Police Consent Decree Despite Concerns',)]
  - Tool Output: --- Document 1 ---
Source: 
Title: Judge Dooley Ends Hartford Police Consent Decree Despite Concerns
Category: politics
Date: 2023-04-17 05:36:24
Content: Judge Dooley has ended the nearly 50 years of federal oversight of police in Hartford, despite continued concerns the department still has not hired enough minority officers to reflect the city's large Black and Hispanic populations.

And the reason is not just weak embedding due to incorrect spelling. It's because FAISS (and Chroma etc) do after filtering — first search the world by query, then filter the metadata results (= list of topics). In this case, the correct article does not include top_k = 3 articles after semantic search. A pre-filtering The database, on the other hand, could only perform a semantic search on the articles listed in the article and find the correct article even with the incorrect spelling.

  • Can all metadata filters be removed from Vector Tool?: Yes, it is possible, but a high cost optionlike a simple semantic query with a metadata filter (such as category or author), it will be a hybrid query, requiring two tool calls, adding to token usage and latency. A practical middle ground would be to store dates (and maybe some numeric metadata like word count in this case) in SQL only, and display all text metadata and categories in a vector database.

The conclusion

Building RAG on top of SQL is not compatible with embedded add-ons. It's about designing the right recovery strategy.

When structured metadata and long-form content live in the same table, the real challenge is planning – deciding when to combine with SQL, semantic search, and when to combine the two. Subtle details like filtering metadata and tool routing can make the difference between a reliable system and one that silently fails.

With a well-designed Agentic RAG layer, legacy SQL databases can enable semantic applications without schema changes, costly migrations, or performance trade-offs.

Connect with me and share your comments at www.linkedin.com/in/partha-sarkar-lets-talk-AI

Reference

Social Animal 10K Articles with NLP — Dataset by Alex P (Owner) (CC BY-SA 4.0)

The images used in this article were created using Google Gemini. The data set used below CC-BY-SA 4.0 license. The values ​​and code below are created by me.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button