How to rate the reliability of the largest language response of the model

The basic principle of large languages (llms) is very simple: predicting the next word (or token) in order of words based on the statistical statistical patterns. However, this seems to be the visible skills that are very difficult when making a number of amazing jobs such as the 2nd idea, coping with the code, processing information, and creating information. That means, the llms are the Straight no To predict the next word.
The process of predicting the following word propos both. The llm should choose each word from the probability of spread. In this process, they usually produce false content, made, or inconsistent to produce united answers and fill the vacasions with good but wrong. This item is called HALLUCINATION, an unavoidable feature of the llms confirming the verification and repair of their outcome.
Returning methods to increase (RAG), making a llm work with external information, do to reduce the Hallucinations to a certain extent, but they will not be able to complete them completely. Although advanced rafs can provide text quotes and urls, to ensure these indicators can be busy and at least time. Therefore, we need the purpose of the Objective Policy Policy or reliability of the LLM feedback, whether produced by its information or external information background (RAG).
In this article, we will consider how the output of the llM can be tested for honesty in a reliable language model that provides points for the issue of LLM. We will start to discuss how we can use the reliable language model to provide scores for the llM feedback and explain reliability. Later, we will improve RAG example with Lylamaparse and Lalaindex checking RAG answers to be honest.
The whole code of this article is available on Jobyter NeTebook in GitTub.
Providing credible points in the llM
To show how to give the faithful points to the answer of the llm, I will use the model of the faithful cleanlab model (TLM). Such tlms use a combination of Uncertainty is the size of including Analysis of consistency Combining Faith Schools and descriptions of the LLM answer.
Cleanlab offers free API APIs available by creating an account on their website. First we need to install the Python's Python Client:
pip install --upgrade cleanlab-studio
Cleanlab supports several models concerning 'GPT-4O','GPT-4O-Mini','O1 preview','claude-3-sonnonet','Claude-3.5-Sonnet','Claude-3.5-Sonnet-V2'others. Here is how TLM responds to the faithful points in GPT-4O. Technology Natunges points from 0 to 1, where high prices show great trust.
from cleanlab_studio import Studio
studio = Studio("") # Get your API key from above
tlm = studio.TLM(options={"log": ["explanation"], "model": "gpt-4o"}) # GPT, Claude, etc
#set the prompt
out = tlm.prompt("How many vowels are there in the word 'Abracadabra'.?")
#the TLM response contains the actual output 'response', trustworthiness score and explanation
print(f"Model's response = {out['response']}")
print(f"Trustworthiness score = {out['trustworthiness_score']}")
print(f"Explanation = {out['log']['explanation']}")
The above code checked the GPT-4O response with the question “How many suggestions' in the word 'Abracadadabra'.?“. The TLM release contains a model feedback (response), reliable points, and description. Here is the result of this code.
Model's response = The word "Abracadabra" contains 6 vowels. The vowels are: A, a, a, a, a, and a.
Trustworthiness score = 0.6842228802750124
Explanation = This response is untrustworthy due to a lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either):
5.
It may seem that the model of a high-improved language model organizes such simple jobs and produces the wrong result. Here is the answer and the Treatworthiness Score for the same question Claude-3.5-Sonnet-V2.
Model's response = Let me count the vowels in 'Abracadabra':
A-b-r-a-c-a-d-a-b-r-a
The vowels are: A, a, a, a, a
There are 5 vowels in the word 'Abracadabra'.
Trustworthiness score = 0.9378276048845285
Explanation = Did not find a reason to doubt trustworthiness.
Claude-3.5-Sonnet-V2 produces the right result. Let's compare two models of models in another question.
from cleanlab_studio import Studio
import markdown
from IPython.core.display import display, Markdown
# Initialize the Cleanlab Studio with API key
studio = Studio("") # Replace with your actual API key
# List of models to evaluate
models = ["gpt-4o", "claude-3.5-sonnet-v2"]
# Define the prompt
prompt_text = "Which one of 9.11 and 9.9 is bigger?"
# Loop through each model and evaluate
for model in models:
tlm = studio.TLM(options={"log": ["explanation"], "model": model})
out = tlm.prompt(prompt_text)
md_content = f"""
## Model: {model}
**Response:** {out['response']}
**Trustworthiness Score:** {out['trustworthiness_score']}
**Explanation:** {out['log']['explanation']}
---
"""
display(Markdown(md_content))
Here is the answer of the two Models:
We can also produce faithful llms open llms points. Let's look at the latest opening openen-sourle, Deepseek-R1. I will use Deepseek-R1-DISTILL-LLAMA-70BBased on meta Llama-3.3-70BB teaching model and tormented with a 671 billion's large Deepseeek combination of parorts (moe) model. Distance Distair is a method of learning a machine intended to transfer the largest training model, “teacher model,” a small student model.
import streamlit as st
from langchain_groq.chat_models import ChatGroq
import os
os.environ["GROQ_API_KEY"]=st.secrets["GROQ_API_KEY"]
# Initialize the Groq Llama Instant model
groq_llm = ChatGroq(model="deepseek-r1-distill-llama-70b", temperature=0.5)
prompt = "Which one of 9.11 and 9.9 is bigger?"
# Get the response from the model
response = groq_llm.invoke(prompt)
#Initialize Cleanlab's studio
studio = Studio("226eeab91e944b23bd817a46dbe3c8ae")
cleanlab_tlm = studio.TLM(options={"log": ["explanation"]}) #for explanations
#Get the output containing trustworthiness score and explanation
output = cleanlab_tlm.get_trustworthiness_score(prompt, response=response.content.strip())
md_content = f"""
## Model: {model}
**Response:** {response.content.strip()}
**Trustworthiness Score:** {output['trustworthiness_score']}
**Explanation:** {output['log']['explanation']}
---
"""
display(Markdown(md_content))
Here is the effect of Deepseek-R1-DISTILL-LLAMA-70B the model.

To develop a reliable rag
Now we will improve the RAG to show how we can rate the trust of the LLM response in the RAG. This rag will be enhanced with cleaning data from the links provided, to distinguish it from a Markdown format, and create a vector store.
The following libraries require the next code.
pip install llama-parse llama-index-core llama-index-embeddings-huggingface
llama-index-llms-cleanlab requests beautifulsoup4 pdfkit nest-asyncio
Providing HTML into PDF format, also required to include Whtmltopdf The command line line is on their website.
The following libraries will be imported:
from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex
import requests
from bs4 import BeautifulSoup
import pdfkit
from llama_index.readers.docling import DoclingReader
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.cleanlab import CleanlabTLM
from typing import Dict, List, ClassVar
from llama_index.core.instrumentation.events import BaseEvent
from llama_index.core.instrumentation.event_handlers import BaseEventHandler
from llama_index.core.instrumentation import get_dispatcher
from llama_index.core.instrumentation.events.llm import LLMCompletionEndEvent
import nest_asyncio
import os
The following steps will include threatening data from the URLs are provided using Python's Beauturesoup Library, savings Data spent on PDF files using IPKFKITand distinguishing data from PDFs (s) to mark the operating file Lalanaparse What platform is GENAI – traditionally formed of llms and charges of using the llm.
We will start to prepare a llm to be used by Cleanlabtlm and a roding model (Refresh embark on the model Baai / BGE-SCEL-En-V1.5) That will be used to include inflation from the Vector store.
options = {
"model": "gpt-4o",
"max_tokens": 512,
"log": ["explanation"]
}
llm = CleanlabTLM(api_key="", options=options)#Get your free API from
Settings.llm = llm
Settings.embed_model = HuggingFaceEmbedding(
model_name="BAAI/bge-small-en-v1.5"
)
Now we will explain the custom event officer, GeckrustworthworthinesscoreThat is based on the foundation class of the foundation. This hand officer is due to the end of the completion of the llm and generates honest points from the answer metadata. Helpful Work, display_responseindicates the llM feedback and the reference to reliable points.
# Event Handler for Trustworthiness Score
class GetTrustworthinessScore(BaseEventHandler):
events: ClassVar[List[BaseEvent]] = []
trustworthiness_score: float = 0.0
@classmethod
def class_name(cls) -> str:
return "GetTrustworthinessScore"
def handle(self, event: BaseEvent) -> Dict:
if isinstance(event, LLMCompletionEndEvent):
self.trustworthiness_score = event.response.additional_kwargs.get("trustworthiness_score", 0.0)
self.events.append(event)
return {}
# Helper function to display LLM's response
def display_response(response):
response_str = response.response
trustworthiness_score = event_handler.trustworthiness_score
print(f"Response: {response_str}")
print(f"Trustworthiness score: {round(trustworthiness_score, 2)}")
Now we will produce PDFs by cleaning data from the given URLs. By the show, we will only skip data in this Wikipedia's title about Model Models (Creative Commons Attribution-Sharelike 4.0 license).
Booklet: Students are advised to check constantly the content of the content / data they make to rupture and make sure they are allowed to do so.
The next piece of code attacks data from the URLs provided for the HTTP application and use Beauturesoup Python Library to combine HTML content. HTML content is cleaned by converting URLs related to complete goals. Later, deceptive content is transformed into a PDF file using IPKFKIT.
##########################################
# PDF Generation from Multiple URLs
##########################################
# Configure wkhtmltopdf path
wkhtml_path = r'C:Program Fileswkhtmltopdfbinwkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=wkhtml_path)
# Define URLs and assign document names
urls = {
"LLMs": "
}
# Directory to save PDFs
pdf_directory = "PDFs"
os.makedirs(pdf_directory, exist_ok=True)
pdf_paths = {}
for doc_name, url in urls.items():
try:
print(f"Processing {doc_name} from {url} ...")
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
main_content = soup.find("div", {"id": "mw-content-text"})
if main_content is None:
raise ValueError("Main content not found")
# Replace protocol-relative URLs with absolute URLs
html_string = str(main_content).replace('src=" 'src=".replace('href=" 'href="
pdf_file_path = os.path.join(pdf_directory, f"{doc_name}.pdf")
pdfkit.from_string(
html_string,
pdf_file_path,
options={'encoding': 'UTF-8', 'quiet': ''},
configuration=config
)
pdf_paths[doc_name] = pdf_file_path
print(f"Saved PDF for {doc_name} at {pdf_file_path}")
except Exception as e:
print(f"Error processing {doc_name}: {e}")
After producing PDFs (s) from expensive information, we attach PDFs using Lalanaparse. We put the distinguishing instructions to extract content in the Markdown format and combine smart pages and document name and page number. These issues issued (pages) are called nodes. The Parser has in prison above issued areas and renewed each node metadata by inserting a quotation head indicating later.
##########################################
# Parse PDFs with LlamaParse and Inject Metadata
##########################################
# Define parsing instructions (if your parser supports it)
parsing_instructions = """Extract the document content in markdown.
Split the document into nodes (for example, by page).
Ensure each node has metadata for document name and page number."""
# Create a LlamaParse instance
parser = LlamaParse(
api_key="", #Replace with your actual key
parsing_instructions=parsing_instructions,
result_type="markdown",
premium_mode=True,
max_timeout=600
)
# Directory to save combined Markdown files (one per PDF)
output_md_dir = os.path.join(pdf_directory, "markdown_docs")
os.makedirs(output_md_dir, exist_ok=True)
# List to hold all updated nodes for indexing
all_nodes = []
for doc_name, pdf_path in pdf_paths.items():
try:
print(f"Parsing PDF for {doc_name} from {pdf_path} ...")
nodes = parser.load_data(pdf_path) # Returns a list of nodes
updated_nodes = []
# Process each node: update metadata and inject citation header into the text.
for i, node in enumerate(nodes, start=1):
# Copy existing metadata (if any) and add our own keys.
new_metadata = dict(node.metadata) if node.metadata else {}
new_metadata["document_name"] = doc_name
if "page_number" not in new_metadata:
new_metadata["page_number"] = str(i)
# Build the citation header.
citation_header = f"[{new_metadata['document_name']}, page {new_metadata['page_number']}]nn"
# Prepend the citation header to the node's text.
updated_text = citation_header + node.text
new_node = node.__class__(text=updated_text, metadata=new_metadata)
updated_nodes.append(new_node)
# Save a single combined Markdown file for the document using the updated node texts.
combined_texts = [node.text for node in updated_nodes]
combined_md = "nn---nn".join(combined_texts)
md_filename = f"{doc_name}.md"
md_filepath = os.path.join(output_md_dir, md_filename)
with open(md_filepath, "w", encoding="utf-8") as f:
f.write(combined_md)
print(f"Saved combined markdown for {doc_name} to {md_filepath}")
# Add the updated nodes to the global list for indexing.
all_nodes.extend(updated_nodes)
print(f"Parsed {len(updated_nodes)} nodes from {doc_name}.")
except Exception as e:
print(f"Error parsing {doc_name}: {e}")
Now set the Vector store with the question of question. We describe the fastest customer template to guide the llM behavior in answering questions. Finally, we build a question engine with an indicator created to answer questions. For each question, we return 3 high places from the Vector store based on the same as the Semantic such as the question. The llm uses these places restored to produce the final response.
##########################################
# Create Index and Query Engine
##########################################
# Create an index from all nodes.
index = VectorStoreIndex.from_documents(documents=all_nodes)
# Define a custom prompt template that forces the inclusion of citations.
prompt_template = """
You are an AI assistant with expertise in the subject matter.
Answer the question using ONLY the provided context.
Answer in well-formatted Markdown with bullets and sections wherever necessary.
If the provided context does not support an answer, respond with "I don't know."
Context:
{context_str}
Question:
{query_str}
Answer:
"""
# Create a query engine with the custom prompt.
query_engine = index.as_query_engine(similarity_top_k=3, llm=llm, prompt_template = prompt_template)
print("Combined index and query engine created successfully!")
Let us now examine RAG some questions and their corresponding credibility scores.
query = "When is mixture of experts approach used?"
response = query_engine.query(query)
display_response(response)

query = "How do you compare Deepseek model with OpenAI's models?"
response = query_engine.query(query)
display_response(response)

Providing Faith LLM response, whether it is produced directly or by rag, it helps, to explain the reliability of AI issuers and prioritize the person's guarantee where required. This is especially important for the critical backgrounds where the wrong or unfaithful answer can have serious consequences.
That's all folks! If you like an article, please follow me Medium including LinkedIn.



