How to Perform Agentic Data Recovery

0 2 7 minutes read

It's an important task to accomplish, with the vast amount of content available today. An information retrieval function, for example, every time you Google something or ask chatgpt to get an answer to a question. The information you are looking for using closed document data or the entire Internet.

In this article, I will discuss the eventic information received, I will cover how information retrieval has changed with the release of LLMS, and especially with the rise of Agents ai, who are able to obtain information instead. I will first discuss the rag, because that block is the basis for obtaining agentic information. I will continue to discuss at a higher level how ai Agents can be used to obtain information.

This highlights the main content of this article. I will discuss some methods of retrieving different traditional data, such as TF-IDF (keyword search), and then continue to discuss the rag. I will be discussing different ways to start a rag, either by doing it yourself with an input model and vector data, or by using managed rag solutions. I will be discussing how to make keyword and rag search available to your AI agents as tools. Photo by chatgpt.

Why do we need agentic information

Data recovery is an old task. TF-IDF is the first algorithm used to find information in a large corpus of documents, and it works by indexing your documents based on the frequency of words within certain documents and how common the word is.

If a user searches for a word, and that word appears frequently in a few documents, but rarely in all documents, it indicates a strong relevance of those few documents.

Data recovery is such a critical task because, as humans, we rely on getting data quickly to solve various problems. These problems can be:

How to cook a certain dish
How to use a specific algorithm
How to find it in the area a-> b

TF-IDF still works surprisingly well, although now we have found more powerful ways to get information. RetrietAl for Retriet Augmented Generation (Rag), is one robust method, which relies on parallelism to find useful documents.

To find the details of Agentic Information uses different techniques to search for the keyword (TF-IDF, for example, modern versions of the algorithm, such as BM25), and BM25), and to find relevant results.

Build your own rag

This figure highlights how the rag works. You embed a document query and find the most similar documents from the Corpus based on semantic similarity. You then feed these relevant documents to LLM, which puts its response to the user in the relevant documents. Photo by the author.

Building your own rag is surprisingly easy with all the technology and tools available today. There are many packages here that help you use the rag. All of them, however, rely on the following basic technologies:

Embed Your Document Corpus (You are often short on documents)
Store the embedding in the vector database
A user entering a search query
Embed a search query
Find more matches between the document Corpus and the user's query, and return the most similar documents

This can be started in just a few hours if you know what you are doing. To embed your data and user queries, you can use, for example, the usage:

Managed services such as
- Opelai Embedering-Embedding-Great-3
- Gemini-Embedding-001
Open Source Options
- Alibaba's Qwen-Embedding-8B
- Lintral's LinQ-Embed-Mistral

After you embed your documents, you can save them to a vector database like:

After that, you are basically ready to make a rag. In the next section, I'll also cover fully managed rag solutions, where you simply upload a document, and all the chunking, embedding, and searching is handled for you.

Managed Rag Services

If you want an easy way, you can also use fully managed rag solutions. Here are a few options:

Ragie.ai
Gemini File search tool
Open the File Search Tool

These services make the rag process much easier. You can upload documents to any of these services, and the services automatically handle the chunking, embedding, and retrieval for you. All you have to do is upload your raw documents and provide the search query you want to run. This service will provide you with documents relevant to the questions, which you can cool with LLM to answer the user's questions.

Even though the rag held simplifies the process a lot, I would like to highlight some below:

If you only have PDF, you can upload them directly. However, there are currently some file types that are not supported by Rag managed services. Some of them do not support PNG / JPG files, for example, which shows the process. One solution is to do OCR on the image, then upload a TXT file (supported), but this, of course, includes your application, which is exactly what you want to avoid when using a managed rag.

Another downside is that you have to upload raw documents to the services. When doing this, you need to make sure you stay compliant, for example, with GDPR REENTALS in the EU. This can be a challenge for other managed Rag services, although I know Abalayi at least supports EU residency.

I will also give an example of using the Openai search tool, which is very easy to use.

First, you create a vector store and load scripts:

from openai import OpenAI
client = OpenAI()

# Create vector store
vector_store = client.vector_stores.create(        
    name="",
)

# Upload file and add it to the vector store
client.vector_stores.files.upload_and_poll(        
    vector_store_id=vector_store.id,
    file=open("filename.txt", "rb")
)

After uploading and processing the documents, you can ask them through:

user_query = "What is the meaning of life?"

results = client.vector_stores.search(
    vector_store_id=vector_store.id,
    query=user_query,
)

As you may have noticed, this code is much simpler than setting up embedding models and vector details to build yourself.

Information retrieval tools

Now that we have data recovery tools readily available, we can begin to recover agentic data. I'll start with the first way to use LLMS to get information, before moving on to a better and updated way.

Retrieving, answering

The first method is to start by retrieving the relevant documents and feeding that information to the LLM before it responds to the user's query. This can be done by running keyword searches and rag searches, finding top X documents, and feeding those documents into LLM.

First, find some articles on rag:

user_query = "What is the meaning of life?"

results_rag = client.vector_stores.search(
    vector_store_id=vector_store.id,
    query=user_query,
)

After that, find other articles by keyword search

def keyword_search(query):
    # keyword search logic ...
    return results


results_keyword_search = keyword_search(query)

Then add these results together, remove duplicate documents, and feed the contents of these documents to LLM to answer:

def llm_completion(prompt):
   # llm completion logic
   return response


prompt = f"""
Given the following context {document_context}
Answer the user query: {user_query}
"""

response = llm_completion(prompt)

In most cases, this works very well and will provide high quality answers. However, there is a better way to get agentic information.

Information retrieval as a tool

The new LLMs of the advanced LLMS are all trained with agentic behavior in mind. This means that LLMS is very good at using tools to answer questions. You can provide LLM with a list of tools, which you decide when to use, and which you can use to answer user questions.

A better way is to provide Rag and keyword search as tools for your LLMS. For GPT-5, you can, for example, make it lower:

# define a custom keyword search function, and provide GPT-5 with both
# keyword search and RAG (file search tool)
def keyword_search(keywords):
    # perform keyword search
    return results 

user_input = "What is the meaning of life?"

tools = [
    {
        "type": "function",
        "function": {
            "name": "keyword_search",
            "description": "Search for keywords and return relevant results",
            "parameters": {
                "type": "object",
                "properties": {
                    "keywords": {
                        "type": "array",
                        "items": {"type": "string"},
                        "description": "Keywords to search for"
                    }
                },
                "required": ["keywords"]
            }
        }
    },
    {
        "type": "file_search",
        "vector_store_ids": [""],
    }
]

response = client.responses.create(
    model="gpt-5",
    input=user_input,
    tools=tools,
)

This works much better because you don't use one-time information to find with Rag / Keyword Search and answer the user's question. It works well because:

The agent can decide when to use the tools. Some queries, for example, do not require vector search
Opelai automatically rewrites itself, which means that it runs parallel queries with different versions of the user's query (it writes itself, based on the user's query
The agent may decide to use more detailed queries / keyword searches if it believes it does not have enough information

The last point in the above list is the most important point to get agentic information. Sometimes, you don't get the information you want with the first question. The agent (GPT-5) can decide that this is the case and choose to turn off the rag / key search queries if it thinks it is necessary. This often leads to better results and makes the agent more likely to find the information you are looking for.

Lasting

In this article, I covered the basics of agentic data recovery. I began by discussing why aventic information is so important, highlighting how dependent we are on quick access to information. In addition, I covered the tools you can use to find information recovery information by keyword and rag. I then highlighted that you can run these tools graphically before feeding the results to LLM, but the best way is to feed these tools to LLM, making it an agent capable of finding information. I think that getting agentic information is very important and very important in the future, and understand that using AI Agents will be an important skill to create powerful AI applications in the coming years.

👉 Find me in the community:

💻 My webinar on visual language models

📩 Subscribe to my newsletter

🧑💻 Get in touch

🔗 lickEdin

🐦 X / Twitter

✍️ Medium

You can read my other articles:

Source link

nimda 12 hours ago

0 2 7 minutes read