Meet the PyVersity Library: How to Improve Retrieval Programs by classifying results using pyversity?

PyVersity is a fast, lightweight Python library designed to improve the diversity of results from retrieval systems. Retrieval often retrieves very similar items, which leads to redundancies. PyVersity also refactors these results so that they are more relevant but less desirable.
It provides a clear, unified API for finding a variety of popular techniques, including maximal marginal matching (MMR), Max-Sum-Supservification (MSD), decision point procedures (DPP), and coverage. Its only dependency is zooming, which makes it very easy.
In this tutorial, we will focus on MMR and MSD techniques using a practical example. Look Full codes here.
The classification with refunds is necessary because the ranking methods, which prioritize only relevance to the user's question, tend to produce a set of top results that are read more or more often.
This high similarity creates a poor user experience by limiting exploration and wasting screen space on similar objects. Classification techniques deal with this through multivariate and multivariate comparisons, ensuring that the newly selected items present novel information and not on items that have already been counted.
This method is important in all different domains: in e-commerce, it shows the different styles of the product; When searching for news, find different opinions or sources; And in the case of rag / llm, it prevents feeding the repeated model, duplicated text passages, improving the quality of the complete response. Look Full codes here.

pip install openai numpy pyversity scikit-learn
import os
from openai import OpenAI
from getpass import getpass
os.environ['OPENAI_API_KEY'] = getpass('Enter OpenAI API Key: ')
client = OpenAI()
In this step, we simulate the type of search results you can retrieve from a vector database (such as pinecone, weave, or faiss) after searching for a query such as “smart family dogs.”
These results intentionally contain entries – many interpretations of similar breeds such as Golden Retievers, Labradors, and German Shepherds – each fully defined by qualities such as loyalty, intelligence and family.
This redundancy mirrors what often happens in real world restoration projects, where similar items get very high scores. We will use this data to show how segmentation techniques can reduce duplication and produce a balanced, diverse set of search results. Look Full codes here.
import numpy as np
search_results = [
"The Golden Retriever is the perfect family companion, known for its loyalty and gentle nature.",
"A Labrador Retriever is highly intelligent, eager to please, and makes an excellent companion for active families.",
"Golden Retrievers are highly intelligent and trainable, making them ideal for first-time owners.",
"The highly loyal Labrador is consistently ranked number one for US family pets due to its stable temperament.",
"Loyalty and patience define the Golden Retriever, one of the top family dogs globally and easily trainable.",
"For a smart, stable, and affectionate family dog, the Labrador is an excellent choice, known for its eagerness to please.",
"German Shepherds are famous for their unwavering loyalty and are highly intelligent working dogs, excelling in obedience.",
"A highly trainable and loyal companion, the German Shepherd excels in family protection roles and service work.",
"The Standard Poodle is an exceptionally smart, athletic, and surprisingly loyal dog that is also hypoallergenic.",
"Poodles are known for their high intelligence, often exceeding other breeds in advanced obedience training.",
"For herding and smarts, the Border Collie is the top choice, recognized as the world's most intelligent dog breed.",
"The Dachshund is a small, playful dog with a distinctive long body, originally bred in Germany for badger hunting.",
"French Bulldogs are small, low-energy city dogs, known for their easy-going temperament and comical bat ears.",
"Siberian Huskies are energetic, friendly, and need significant cold weather exercise due to their running history.",
"The Beagle is a gentle, curious hound known for its excellent sense of smell and a distinctive baying bark.",
"The Great Dane is a very large, gentle giant breed; despite its size, it's known to be a low-energy house dog.",
"The Australian Shepherd (Aussie) is a medium-sized herding dog, prized for its beautiful coat and sharp intellect."
]
def get_embeddings(texts):
"""Fetches embeddings from the OpenAI API."""
print("Fetching embeddings from OpenAI...")
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
return np.array([data.embedding for data in response.data])
embeddings = get_embeddings(search_results)
print(f"Embeddings shape: {embeddings.shape}")
In this step, we calculate how much each search result corresponds to the user's query using cosmec similarity. This produces a high-level list of results based on a semantic test, showing which documents are most similar to the meaning in the query. In practice, it simulates what the search engine or retrieval engine will return before applying any variation techniques, which often results in many similar or redundant entries. Look Full codes here.
from sklearn.metrics.pairwise import cosine_similarity
query_text = "Smart and loyal dogs for family"
query_embedding = get_embeddings([query_text])[0]
scores = cosine_similarity(query_embedding.reshape(1, -1), embeddings)[0]
print("n--- Initial Relevance-Only Ranking (Top 5) ---")
initial_ranking_indices = np.argsort(scores)[::-1] # Sort descending
for i in initial_ranking_indices[:5]:
print(f"Score: {scores[i]:.4f} | Result: {search_results[i]}")


As seen in the output above, the top results are dominated by Multiple Mention Labradors and Golden Retrieverseach is defined by similar characteristics such as honesty, intelligence and family – friendship. This is typical of affiliate marketing programs, where the top results are rarely the same but often, they offer little variation in content. While these results are all relevant to the question, none of them have lost the benefit of users who want a wide view of variety or different opinions. Look Full codes here.
MMR works by finding a balance between consistency and diversity. Instead of simply selecting the most similar results in a query, you gradually select items that still work but are not very similar to what was already selected.
In simple terms, imagine that you have built a list of dog breeds from “smart family dogs.” The first result would be a labrador – very suitable. For the next selection, MMR avoids choosing another description of Labrador and instead chooses something like Golden Retriever or German Shepherd.
In this way, MMR ensures your end results are useful and diverse, reducing repetition while keeping everything very relevant to what the user is looking for. Look Full codes here.
from pyversity import diversify, Strategy
# MMR: Focuses on novelty against already picked items.
mmr_result = diversify(
embeddings=embeddings,
scores=scores,
k=5,
strategy=Strategy.MMR,
diversity=0.5 # 0.0 is pure relevance, 1.0 is pure diversity
)
print("nn--- Diversified Ranking using MMR (Top 5) ---")
for rank, idx in enumerate(mmr_result.indices):
print(f"Rank {rank+1} (Original Index {idx}): {search_results[idx]}")


After using the MMR (Maximal Marginal Relevance) Strategy, the results vary greatly. While high-end items are like An airplane and German shepherd stay more relevant to the question, the following entries include different types such as Siberian huskies and French bulldogs. This shows how MMR minimizes repetition by avoiding several similar results – instead, it balances relevance and variety, giving users a comprehensive and informative collection that is always on topic. Look Full codes here.
The MSD (Max Number of Duats) strategy focuses on choosing results that are not only relevant to the question but also different from each other as much as possible. Instead of worrying about the similarity of previously selected items one by one (as MMR does), MSD looks at the overall distribution of selected results.
In simple words, it tries to select results that cover a wide range of ideas or topics, ensuring strong diversity across the set. Therefore, with the same example of dogs, MSD could include the breeds of labrador, German shepherd, and husky – each different – to give a broad idea of ”smart and loyal.” Look Full codes here.
# MSD: Focuses on strong spread/distance across all candidates.
msd_result = diversify(
embeddings=embeddings,
scores=scores,
k=5,
strategy=Strategy.MSD,
diversity=0.5
)
print("nn--- Diversified Ranking using MSD (Top 5) ---")
for rank, idx in enumerate(msd_result.indices):
print(f"Rank {rank+1} (Original Index {idx}): {search_results[idx]}")


The results produced by the MSD (Max Sum of Duats) Strategy show a strong focus on diversification and coverage. While An airplane and German shepherd always be relevant to the question, the inclusion of types such as French Bulldog, Siberian Huskyagain A dachshund It highlights Msd's tendency to choose different results from each other.
This approach ensures that users see a wider mix of options rather than related or duplicate entries. Essentially, MSD emphasizes high diversity across all result sets, offering a broad perspective while still maintaining complete relevance to the search intent.
Look Full codes here. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.

I am a civil engineering student (2022) from Jamia Millia Islamia, New Delhi, and I am very interested in data science, especially neural networks and their application in various fields.
Follow Marktechpost: Add us as a favorite source on Google.



