Refunds of uninfected categories: Improving text classification with external information

Division is as one of the most basic but most important of the environmental processing. It has a significant role in many realistic programs from linking unwanted emails such as spam, to get the products of product or separate user objectives in Chat-Bot's goals. The default methodology is to create a large number of data collection that is stolen, which means installation documents and their corresponding labels, and train electronic Learning model. Items changed slightly as the llms had a great force, where you can find good work through great intentions such as zero-shots Chest However, it can reduce customization models and provide customization work. In this blog, we aim to reduce the gap between ML custom models for classification and generalistic gos intent while reducing the effort required to renew the immediate llm.
Llms vs models according to the ML custom division of the text
Good:
Let us first consider the Pro and Cesor of each approach to the classification of the text.
Highest language models as normal purposes of purpose:
- The general force is given a large corpus of the former training and the skills of the llM.
- One general llm intention can manage multiple separation activities without the need to send each model.
- Since the llMS continues to improve, you may improve accuracy with small effort by getting new, more powerful as they are available.
- Availability of as many llMS as managed services are very reducing the information of the service required to get started.
- The llms are usually ML ML ML models in ML in low data vehicles where data label is limited or expensive to find.
- Llms do in many languages.
- The llMS is not cheaper when having low or informal volumes when you pay in each token.
- Class definitions can be exchanged except to restart simply changing arguments.
Help:
- Llms tend to be on a stray.
- The llms can be a bit, or less than small ml models of ML.
- They require immediate engineering effort.
- The most caused requests are using the llms-as-a-service may quickly encounter the limitations of a measure.
- This approach becomes a success for the largest number of classes as a result of the odds of contaments. The explanation of all classes will cost an important part of the operating and operating substance.
- The llms usually have a worse accuracy than the custom models in the top data kingdom.
Real Machine Learning Models:
Good:
- Works well and quick.
- More flexible in the selection of Architecture, Training and Workplace performance.
- The power adds the interpretation and uncertainty to measure features in the model.
- High accuracy in the upper data state.
- You end up controlling your model and infrastructure work.
Help:
- It requires recycling tendency to adjust to new data or distribution changes.
- May require an important amount of information that has been assigned.
- A limited balance.
- Sensitivity to vocabulary or domain construction or construction.
- Requires mlops knowledge by shipment.
To close the gap between custom text and llms:
Let us work in a way that keeps the benefits of using the Division of Division while reducing some of the worsening. We will take inspiration from the RAG and use the promotion called a few persecution.
Let's explain both:
Rag
Retreatet generation Augmented Generation is a popular way to grow the llm contest with foreign information before asking a question. This reduces the probability of the halucination light and improves the quality of the answers.
A few duplicate shots
For each partial employee, we show examples of llm inputs and expected results as part of the Press to help understand this function.
Now, the main view of this project is connecting both of you. We download flexible examples such as the text question to be separated and install as an example of a few shot. We also limit the size of the section that may use those nearby neighbors. This releases an important number of tokens in the installation context when working with a distinction problem with a large number of potential.
Here is how that can work:
Let's go on the practical steps to find this method to run:
- To create a base of text information for label files / categories. This will be our Fountain Fountain of LLM. We will be using Chrradb.
from typing import List
from uuid import uuid4
from langchain_core.documents import Document
from chromadb import PersistentClient
from langchain_chroma import Chroma
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
import torch
from tqdm import tqdm
from chromadb.config import Settings
from retrieval_augmented_classification.logger import logger
class DatasetVectorStore:
"""ChromaDB vector store for PublicationModel objects with SentenceTransformers embeddings."""
def __init__(
self,
db_name: str = "retrieval_augmented_classification", # Using db_name as collection name in Chroma
collection_name: str = "classification_dataset",
persist_directory: str = "chroma_db", # Directory to persist ChromaDB
):
self.db_name = db_name
self.collection_name = collection_name
self.persist_directory = persist_directory
# Determine if CUDA is available
device = "cuda" if torch.cuda.is_available() else "cpu"
logger.info(f"Using device: {device}")
self.embeddings = HuggingFaceBgeEmbeddings(
model_name="BAAI/bge-small-en-v1.5",
model_kwargs={"device": device},
encode_kwargs={
"device": device,
"batch_size": 100,
}, # Adjust batch_size as needed
)
# Initialize Chroma vector store
self.client = PersistentClient(
path=self.persist_directory, settings=Settings(anonymized_telemetry=False)
)
self.vector_store = Chroma(
client=self.client,
collection_name=self.collection_name,
embedding_function=self.embeddings,
persist_directory=self.persist_directory,
)
def add_documents(self, documents: List) -> None:
"""
Add multiple documents to the vector store.
Args:
documents: List of dictionaries containing document data. Each dict needs a "text" key.
"""
local_documents = []
ids = []
for doc_data in documents:
if not doc_data.get("id"):
doc_data["id"] = str(uuid4())
local_documents.append(
Document(
page_content=doc_data["text"],
metadata={k: v for k, v in doc_data.items() if k != "text"},
)
)
ids.append(doc_data["id"])
batch_size = 100 # Adjust batch size as needed
for i in tqdm(range(0, len(documents), batch_size)):
batch_docs = local_documents[i : i + batch_size]
batch_ids = ids[i : i + batch_size]
# Chroma's add_documents doesn't directly support pre-defined IDs. Upsert instead.
self._upsert_batch(batch_docs, batch_ids)
def _upsert_batch(self, batch_docs: List[Document], batch_ids: List[str]):
"""Upsert a batch of documents into Chroma. If the ID exists, it updates; otherwise, it creates."""
texts = [doc.page_content for doc in batch_docs]
metadatas = [doc.metadata for doc in batch_docs]
self.vector_store.add_texts(texts=texts, metadatas=metadatas, ids=batch_ids)
This class deals to create a collection and embed in each document before installing it to the Vector Index. We use Baai / BGE-SCLE-En-v1.5 Any rector model will work, even those found as AS-A-service from Gemini, Open, or Nebius.
- To find neighboring neighbors near the installation text
def search(self, query: str, k: int = 5) -> List[Document]:
"""Search documents by semantic similarity."""
results = self.vector_store.similarity_search(query, k=k)
return results
This method restores the Vector datoto documents such as our entry.
- To Create Retrieval Augmented Classifer
from typing import Optional
from pydantic import BaseModel, Field
from collections import Counter
from retrieval_augmented_classification.vector_store import DatasetVectorStore
from tenacity import retry, stop_after_attempt, wait_exponential
from langchain_core.messages import AIMessage, HumanMessage, SystemMessage
class PredictedCategories(BaseModel):
"""
Pydantic model for the predicted categories from the LLM.
"""
reasoning: str = Field(description="Explain your reasoning")
predicted_category: str = Field(description="Category")
class RAC:
"""
A hybrid classifier combining K-Nearest Neighbors retrieval with an LLM for multi-class prediction.
Finds top K neighbors, uses top few-shot for context, and uses all neighbor categories
as potential prediction candidates for the LLM.
"""
def __init__(
self,
vector_store: DatasetVectorStore,
llm_client,
knn_k_search: int = 30,
knn_k_few_shot: int = 5,
):
"""
Initializes the classifier.
Args:
vector_store: An instance of DatasetVectorStore with a search method.
llm_client: An instance of the LLM client capable of structured output.
knn_k_search: The number of nearest neighbors to retrieve from the vector store.
knn_k_few_shot: The number of top neighbors to use as few-shot examples for the LLM.
Must be less than or equal to knn_k_search.
"""
self.vector_store = vector_store
self.llm_client = llm_client
self.knn_k_search = knn_k_search
self.knn_k_few_shot = knn_k_few_shot
@retry(
stop=stop_after_attempt(3), # Retry LLM call a few times
wait=wait_exponential(multiplier=1, min=2, max=5), # Shorter waits for demo
)
def predict(self, document_text: str) -> Optional[str]:
"""
Predicts the relevant categories for a given document text using KNN retrieval and an LLM.
Args:
document_text: The text content of the document to classify.
Returns:
The predicted category
"""
neighbors = self.vector_store.search(document_text, k=self.knn_k_search)
all_neighbor_categories = set()
valid_neighbors = [] # Store neighbors that have metadata and categories
for neighbor in neighbors:
if (
hasattr(neighbor, "metadata")
and isinstance(neighbor.metadata, dict)
and "category" in neighbor.metadata
):
all_neighbor_categories.add(neighbor.metadata["category"])
valid_neighbors.append(neighbor)
else:
pass # Suppress warnings for cleaner demo output
if not valid_neighbors:
return None
category_counts = Counter(all_neighbor_categories)
ranked_categories = [
category for category, count in category_counts.most_common()
]
if not ranked_categories:
return None
few_shot_neighbors = valid_neighbors[: self.knn_k_few_shot]
messages = []
system_prompt = f"""You are an expert multi-class classifier. Your task is to analyze the provided document text and assign the most relevant category from the list of allowed categories.
You MUST only return categories that are present in the following list: {ranked_categories}.
If none of the allowed categories are relevant, return an empty list.
Return the categories by likelihood (more confident to least confident).
Output your prediction as a JSON object matching the Pydantic schema: {PredictedCategories.model_json_schema()}.
"""
messages.append(SystemMessage(content=system_prompt))
for i, neighbor in enumerate(few_shot_neighbors):
messages.append(
HumanMessage(content=f"Document: {neighbor.page_content}")
)
expected_output_json = PredictedCategories(
reasoning="Your reasoning here",
predicted_category=neighbor.metadata["category"]
).model_dump_json()
# Simulate the structure often used with tool calling/structured output
ai_message_with_tool = AIMessage(
content=expected_output_json,
)
messages.append(ai_message_with_tool)
# Final user message: The document text to classify
messages.append(HumanMessage(content=f"Document: {document_text}"))
# Configure the client for structured output with the Pydantic schema
structured_client = self.llm_client.with_structured_output(PredictedCategories)
llm_response: PredictedCategories = structured_client.invoke(messages)
predicted_category = llm_response.predicted_category
return predicted_category if predicted_category in ranked_categories else None
The first part of the code describes the formation of the release we expect in the llm. The Pydantic section consists of two fields, consultation, used for a chain-of-or dump (and the foretold section.
The first prediction of the nearby neighbors and use them as a few shots by creating the history of the production message as a llm has provided the correct categories of individuals, and adding the question text as the last message.
We sort the number to check if it is valid and if so, return it.
_rac = RAC(
vector_store=store,
llm_client=llm_client,
knn_k_search=50,
knn_k_few_shot=10,
)
print(
f"Initialized rac with knn_k_search={_rac.knn_k_search}, knn_k_few_shot={_rac.knn_k_few_shot}."
)
text = """Ivanoe Bonomi [iˈvaːnoe boˈnɔːmi] (18 October 1873 – 20 April 1951) was an Italian politician and statesman before and after World War II. Bonomi was born in Mantua. He was elected to the Italian Chamber of Deputies in ...
"""
category = _rac.predict(text)
print(text)
print(category)
text = """Michel Rocard, né le 23 août 1930 à Courbevoie et mort le 2 juillet 2016 à Paris, est un haut fonctionnaire et ...
"""
category = _rac.predict(text)
print(text)
print(category)
Both input restors to predict “Primemister” while the second example is in French while the training dataset is completely english. This shows the common skills of this method across the same languages.
We use the DBPEPIA classes Dataset's L3 sections (, License CC BY-SA 3.0.) For our assessment. This data has more than 200 categories and 240000 training samples.
We look at the retrieval way to get a divorce disagreement to the tax collectors that sound against Knn Classifier with the most nivote and get the following DBPEPIA Data results L3 categories:
| Accuracy | Latency measure | Passing (combined with threaded) | |
| Know Classifier | 87% | 24s | 108 Predictions / s |
| Only llm at the beginning | 88% | ~ 600ms | 47 forecasts / s |
| Jab | 96% | ~ 1S | 27 predictions / s |
By looking at, the best accuracy I found on Kaggle Boutbooks of this Dataset's L3 section is around you 94% Using the custom ML models.
We note that combining KNN search for the llM consultation skills allows us to earn 9% accurate points but is facing the cost of low-quality and high latency.
Store
In this project we create a Classifer Text “Recovery” to raise the llM power to obtain the correct installation phase. This approach provides several benefits in addition to the traditional ML traditional. This includes the ability to change training data without returning, higher energy due to standardized llms information, and the ability to manage multiple division in accordance with one llm model. This comes at high latency cost and lowest passing and risk of the Celler Class Officer.
This approach should not be geo-to-to-to-to-to-to-work. It can also allow you to find intended to have a separation service and run as soon as the deadline drop 😃.
Sources:
- [1] G. Yu, l
- [2] A. Long, W. Yin, T. Ajanthan, v. Nguyen, P. Purkait, R. Garg, C. Shen and A. Van den henngel, retreeked Augmented CLASSIFIFIATION FOR SOME-Tar Seeing Inview (2022)
Code:



