Generative AI

Implementation of Codet Coducts for the Search Agent (documents) by facing faces, Chromadb, and Langchain

In today's rich – rich, proper documents are rapidly important. Search programs that support low words are usually falling when facing the simctionic description. This lesson shows how we can build a powerful press for a powerful document using:

  1. Facial embodding models to change text into rich vector presentations
  2. Chroma DB as our local search database
  3. Changers of a high-level text characters

This implementation enables the Semantic search skills – to find out-based documents rather than comparing the keyword. At the end of this lesson, you will have a valid valid service engine that cannot:

  • Processing and embedded text documents
  • Keep this rumorate
  • Recover the same identical documents in any question
  • Manage different types of texts and search requirements

Please follow the detailed steps mentioned below in order to use DocSeardagent.

First, we need to include the required libraries.

!pip install chromadb sentence-transformers langchain datasets

Let's start making library libraries use:

import os
import numpy as np
import pandas as pd
from datasets import load_dataset
import chromadb
from chromadb.utils import embedding_functions
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
import time

In this lesson, we will use the Subset of Wikipedia from Hugging Library Library. This gives us a variety of text set of the Scriptures.

dataset = load_dataset("wikipedia", "20220301.en", split="train[:1000]")
print(f"Loaded {len(dataset)} Wikipedia articles")


documents = []
for i, article in enumerate(dataset):
   doc = {
       "id": f"doc_{i}",
       "title": article["title"],
       "text": article["text"],
       "url": article["url"]
   }
   documents.append(doc)


df = pd.DataFrame(documents)
df.head(3)

Now, let's separate our documents smaller chunks for more search:

text_splitter = RecursiveCharacterTextSplitter(
   chunk_size=1000,
   chunk_overlap=200,
   length_function=len,
)


chunks = []
chunk_ids = []
chunk_sources = []


for i, doc in enumerate(documents):
   doc_chunks = text_splitter.split_text(doc["text"])
   chunks.extend(doc_chunks)
   chunk_ids.extend([f"chunk_{i}_{j}" for j in range(len(doc_chunks))])
   chunk_sources.extend([doc["title"]] * len(doc_chunks))


print(f"Created {len(chunks)} chunks from {len(documents)} documents")

We will use a model for pre-kissing training model to create our prevention:

model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedding_model = SentenceTransformer(model_name)


sample_text = "This is a sample text to test our embedding model."
sample_embedding = embedding_model.encode(sample_text)
print(f"Embedding dimension: {len(sample_embedding)}")

Now, let's bring Christa DB, a lighter vector datame with our search engine:

chroma_client = chromadb.Client()


embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)


collection = chroma_client.create_collection(
   name="document_search",
   embedding_function=embedding_function
)


batch_size = 100
for i in range(0, len(chunks), batch_size):
   end_idx = min(i + batch_size, len(chunks))
  
   batch_ids = chunk_ids[i:end_idx]
   batch_chunks = chunks[i:end_idx]
   batch_sources = chunk_sources[i:end_idx]
  
   collection.add(
       ids=batch_ids,
       documents=batch_chunks,
       metadatas=[{"source": source} for source in batch_sources]
   )
  
   print(f"Added batch {i//batch_size + 1}/{(len(chunks)-1)//batch_size + 1} to the collection")


print(f"Total documents in collection: {collection.count()}")

Now comes the exciting part – Search for our documents:

def search_documents(query, n_results=5):
   """
   Search for documents similar to the query.
  
   Args:
       query (str): The search query
       n_results (int): Number of results to return
  
   Returns:
       dict: Search results
   """
   start_time = time.time()
  
   results = collection.query(
       query_texts=[query],
       n_results=n_results
   )
  
   end_time = time.time()
   search_time = end_time - start_time
  
   print(f"Search completed in {search_time:.4f} seconds")
   return results


queries = [
   "What are the effects of climate change?",
   "History of artificial intelligence",
   "Space exploration missions"
]


for query in queries:
   print(f"nQuery: {query}")
   results = search_documents(query)
  
   for i, (doc, metadata) in enumerate(zip(results['documents'][0], results['metadatas'][0])):
       print(f"nResult {i+1} from {metadata['source']}:")
       print(f"{doc[:200]}...") 

Let's build a simple task to provide better user experience:

def interactive_search():
   """
   Interactive search interface for the document search engine.
   """
   while True:
       query = input("nEnter your search query (or 'quit' to exit): ")
      
       if query.lower() == 'quit':
           print("Exiting search interface...")
           break
          
       n_results = int(input("How many results would you like? "))
      
       results = search_documents(query, n_results)
      
       print(f"nFound {len(results['documents'][0])} results for '{query}':")
      
       for i, (doc, metadata, distance) in enumerate(zip(
           results['documents'][0],
           results['metadatas'][0],
           results['distances'][0]
       )):
           relevance = 1 - distance  
           print(f"n--- Result {i+1} ---")
           print(f"Source: {metadata['source']}")
           print(f"Relevance: {relevance:.2f}")
           print(f"Excerpt: {doc[:300]}...")  
           print("-" * 50)


interactive_search()

Let's add the ability to filter our search results:

def filtered_search(query, filter_source=None, n_results=5):
   """
   Search with optional filtering by source.
  
   Args:
       query (str): The search query
       filter_source (str): Optional source to filter by
       n_results (int): Number of results to return
  
   Returns:
       dict: Search results
   """
   where_clause = {"source": filter_source} if filter_source else None
  
   results = collection.query(
       query_texts=[query],
       n_results=n_results,
       where=where_clause
   )
  
   return results


unique_sources = list(set(chunk_sources))
print(f"Available sources for filtering: {len(unique_sources)}")
print(unique_sources[:5])  


if len(unique_sources) > 0:
   filter_source = unique_sources[0]
   query = "main concepts and principles"
  
   print(f"nFiltered search for '{query}' in source '{filter_source}':")
   results = filtered_search(query, filter_source=filter_source)
  
   for i, doc in enumerate(results['documents'][0]):
       print(f"nResult {i+1}:")
       print(f"{doc[:200]}...") 

In conclusion, we show how to build a Semantic document engine using the sewing models and Chromadb. The program returns documents based on meaning rather than only keywords by converting the text into the vector representations. Wikipedia launch processes bark because of the guard, to motivate the sentences of sentences, and maintain a vector database to return. The final product includes active search, metadata filters, and related ranges.


Here is the Colab Notebook. Also, don't forget to follow Sane and join ours Telegraph station including LinkedIn Grtopic. Don't forget to join ours 80k + ml subreddit.


Asphazzaq is a Markteach Media Inc. According to a View Business and Developer, Asifi is committed to integrating a good social intelligence. His latest attempt is launched by the launch of the chemistrylife plan for an intelligence, MarktechPost, a devastating intimate practice of a machine learning and deep learning issues that are clearly and easily understood. The platform is adhering to more than two million moon visits, indicating its popularity between the audience.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button