Step Guide by Construction Step for Semantic Search Engine With Change Sentences, FASS, and Minilm-L6-V2

nimda March 21, 2025

0 6 5 minutes read

Step Guide by Construction Step for Semantic Search Engine With Change Sentences, FASS, and Minilm-L6-V2

Semantic search is more than a traditional keyword in understanding of understanding understanding of the content of search questions. Instead of compassing the correct words, Semantic search systems captured the purpose and description of the question and returns appropriate results whether they contain the same keywords.

In this lesson, we will use the SEMATIC search system using the sentences of the sentences, a powerful library built over the kisses of the faces that provide you for submission of sentences. This embedded by the text number representations captured the Semantic description, allowing us to obtain the same content similar to the Vector. We will create an active app: Semantic search engine for scientific signs that may respond to the research questions about appropriate paperwork, whether the term terminology is different between the relevant question and desired documents.

First, let's include the required libraries on our Colab Notebook:

!pip install sentence-transformers faiss-cpu numpy pandas matplotlib datasets

Now, let's count on libraries we will need:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
import faiss
from typing import List, Dict, Tuple
import time
import re
import torch

By our show, we will use a collection of scientific symbols. Let's build a small dataset of symptoms from various fields:

abstracts = [
    {
        "id": 1,
        "title": "Deep Learning for Natural Language Processing",
        "abstract": "This paper explores recent advances in deep learning models for natural language processing tasks. We review transformer architectures including BERT, GPT, and T5, and analyze their performance on various benchmarks including question answering, sentiment analysis, and text classification."
    },
    {
        "id": 2,
        "title": "Climate Change Impact on Marine Ecosystems",
        "abstract": "Rising ocean temperatures and acidification are severely impacting coral reefs and marine biodiversity. This study presents data collected over a 10-year period, demonstrating accelerated decline in reef ecosystems and proposing conservation strategies to mitigate further damage."
    },
    {
        "id": 3,
        "title": "Advancements in mRNA Vaccine Technology",
        "abstract": "The development of mRNA vaccines represents a breakthrough in immunization technology. This review discusses the mechanism of action, stability improvements, and clinical efficacy of mRNA platforms, with special attention to their rapid deployment during the COVID-19 pandemic."
    },
    {
        "id": 4,
        "title": "Quantum Computing Algorithms for Optimization Problems",
        "abstract": "Quantum computing offers potential speedups for solving complex optimization problems. This paper presents quantum algorithms for combinatorial optimization and compares their theoretical performance with classical methods on problems including traveling salesman and maximum cut."
    },
    {
        "id": 5,
        "title": "Sustainable Urban Planning Frameworks",
        "abstract": "This research proposes frameworks for sustainable urban development that integrate renewable energy systems, efficient public transportation networks, and green infrastructure. Case studies from five cities demonstrate reductions in carbon emissions and improvements in quality of life metrics."
    },
    {
        "id": 6,
        "title": "Neural Networks for Computer Vision",
        "abstract": "Convolutional neural networks have revolutionized computer vision tasks. This paper examines recent architectural innovations including residual connections, attention mechanisms, and vision transformers, evaluating their performance on image classification, object detection, and segmentation benchmarks."
    },
    {
        "id": 7,
        "title": "Blockchain Applications in Supply Chain Management",
        "abstract": "Blockchain technology enables transparent and secure tracking of goods throughout supply chains. This study analyzes implementations across food, pharmaceutical, and retail industries, quantifying improvements in traceability, reduction in counterfeit products, and enhanced consumer trust."
    },
    {
        "id": 8,
        "title": "Genetic Factors in Autoimmune Disorders",
        "abstract": "This research identifies key genetic markers associated with increased susceptibility to autoimmune conditions. Through genome-wide association studies of 15,000 patients, we identified novel variants that influence immune system regulation and may serve as targets for personalized therapeutic approaches."
    },
    {
        "id": 9,
        "title": "Reinforcement Learning for Robotic Control Systems",
        "abstract": "Deep reinforcement learning enables robots to learn complex manipulation tasks through trial and error. This paper presents a framework that combines model-based planning with policy gradient methods to achieve sample-efficient learning of dexterous manipulation skills."
    },
    {
        "id": 10,
        "title": "Microplastic Pollution in Freshwater Systems",
        "abstract": "This study quantifies microplastic contamination across 30 freshwater lakes and rivers, identifying primary sources and transport mechanisms. Results indicate correlation between population density and contamination levels, with implications for water treatment policies and plastic waste management."
    }
]


papers_df = pd.DataFrame(abstracts)
print(f"Dataset loaded with {len(papers_df)} scientific papers")
papers_df[["id", "title"]]

We will now upload a trained transformer model before training from the face of face. We will use the model of the Minilm-L6-V2 model, which provide a good balance between working and speed:

model_name="all-MiniLM-L6-v2"
model = SentenceTransformer(model_name)
print(f"Loaded model: {model_name}")

Next, we will convert our output text into the Black Vectores:

documents = papers_df['abstract'].tolist()
document_embeddings = model.encode(documents, show_progress_bar=True)


print(f"Generated {len(document_embeddings)} embeddings with dimension {document_embeddings.shape[1]}")

FAISS (Facebook AI search searches) by the active search library. We will use it to indicate our Document Income:

dimension = document_embeddings.shape[1]  


index = faiss.IndexFlatL2(dimension)
index.add(np.array(document_embeddings).astype('float32'))


print(f"Created FAISS index with {index.ntotal} vectors")

Let us now use a question, to turn into a rim, and return the same documents:

def semantic_search(query: str, top_k: int = 3) -> List[Dict]:
    """
    Search for documents similar to query


    Args:
        query: Text to search for
        top_k: Number of results to return


    Returns:
        List of dictionaries containing document info and similarity score
    """
    query_embedding = model.encode([query])


    distances, indices = index.search(np.array(query_embedding).astype('float32'), top_k)


    results = []
    for i, idx in enumerate(indices[0]):
        results.append({
            'id': papers_df.iloc[idx]['id'],
            'title': papers_df.iloc[idx]['title'],
            'abstract': papers_df.iloc[idx]['abstract'],
            'similarity_score': 1 - distances[0][i] / 2  
        })


    return results

Let us examine our Mantic search with various questions that show its ability to understand the top:

test_queries = [
    "How do transformers work in natural language processing?",
    "What are the effects of global warming on ocean life?",
    "Tell me about COVID vaccine development",
    "Latest algorithms in quantum computing",
    "How can cities reduce their carbon footprint?"
]


for query in test_queries:
    print("n" + "="*80)
    print(f"Query: {query}")
    print("="*80)


    results = semantic_search(query, top_k=3)


    for i, result in enumerate(results):
        print(f"nResult #{i+1} (Score: {result['similarity_score']:.4f}):")
        print(f"Title: {result['title']}")
        print(f"Abstract snippet: {result['abstract'][:150]}...")

Imagine the document embedding to see how they include the theme:

from sklearn.decomposition import PCA


pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(document_embeddings)


plt.figure(figsize=(12, 8))
plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], s=100, alpha=0.7)


for i, (x, y) in enumerate(reduced_embeddings):
    plt.annotate(papers_df.iloc[i]['title'][:20] + "...",
                 (x, y),
                 fontsize=9,
                 alpha=0.8)


plt.title('Document Embeddings Visualization (PCA)')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.grid(True, linestyle="--", alpha=0.7)
plt.tight_layout()
plt.show()

Let's build a search interface:

from IPython.display import display, HTML, clear_output
import ipywidgets as widgets


def run_search(query_text):
    clear_output(wait=True)


    display(HTML(f"Query: {query_text}"))


    start_time = time.time()
    results = semantic_search(query_text, top_k=5)
    search_time = time.time() - start_time


    display(HTML(f"Found {len(results)} results in {search_time:.4f} seconds"))


    for i, result in enumerate(results):
        html = f"""
        
            {i+1}. {result['title']} (Score: {result['similarity_score']:.4f})
            {result['abstract']}
        
        """
        display(HTML(html))


search_box = widgets.Text(
    value="",
    placeholder="Type your search query here...",
    description='Search:',
    layout=widgets.Layout(width="70%")
)


search_button = widgets.Button(
    description='Search',
    button_style="primary",
    tooltip='Click to search'
)


def on_button_clicked(b):
    run_search(search_box.value)


search_button.on_click(on_button_clicked)


display(widgets.HBox([search_box, search_button]))

In this lesson, we create a comprehensive Semantic Search program using the sentences. This program may understand the meaning of user questions and return relevant documents even if it does not match the keyword. We have seen that the search based on how to provide the most intelligent results than traditional ways.

Here is the Colab Notebook. Also, don't forget to follow Sane and join ours Telegraph station including LinkedIn Grtopic. Don't forget to join ours 85k + ml subreddit.

Asphazzaq is a Markteach Media Inc. According to a View Business and Developer, Asifi is committed to integrating a good social intelligence. His latest attempt is launched by the launch of the chemistrylife plan for an intelligence, MarktechPost, a devastating intimate practice of a machine learning and deep learning issues that are clearly and easily understood. The platform is adhering to more than two million moon visits, indicating its popularity between the audience.

Source link

nimda March 21, 2025

0 6 5 minutes read