ANI

LLM Monitors: Measuring AI 'Hallucination' and Verbosity

0 2 5 minutes read

LLM Monitors: Measuring AI 'Hallucination' and Verbosity

# Introduction

Major language models (LLMs) have a taste for using “flowery”, sometimes overly verbose language in their responses. Ask a simple question, and you're likely to be bombarded with paragraphs of detailed, passionate, and complex prose. This normal behavior is based on their training, as they are developed to be as helpful and conversational as possible.

Unfortunately, verbosity it's a critical factor to have under the radar, and it can be argued that it's often associated with increasing odds of a major problem: hallucinations. The more words that are produced in the response, the more likely it is to drift away from grounded knowledge and into the “art of construction”.

Overall, strong guardrails are needed to prevent this problem on both sides, starting with the verbosity test. This article shows how to use the Textstat A Python library to measure readability and find more complex answers before they reach the end user, forcing the model to refine its answer.

# Setting Up a Complex Budget with Textstat

The Textstat Python library can be used to calculate scores such as the automated readability index (ARI); it estimates the grade level (reading level) required to understand a piece of text, such as a model answer. If this complex metric exceeds a budget or threshold — such as 10.0, which equates to a 10th grade reading — the feedback loop can be automatically triggered to require a shorter, simpler response. This strategy not only eliminates flowery language but may help reduce the risk of hallucinations, because the model adheres to the basic truths more tightly as a result.

# Using the LangChain Pipeline

Let's see how we can use the strategy described above and combine it into a LangChain a pipeline that can easily be used on a Google Colab notebook. You will need a A Hugging Face The API token, which is available for free at Create a new “secret” called HF_TOKEN in the left menu of Colab by clicking on the “Secrets” icon (looks like a key). Paste the generated API token into the “Value” field, and you're all set!

To begin, install the required libraries:

!pip install textstat langchain_huggingface langchain_community

The following code is specific to Google Colab, and you may need to modify it accordingly if you are working in a different environment. It focuses on returning a stored API token:

from google.colab import userdata

# Obtain Hugging Face API token saved in your Colab session's Secrets
HF_TOKEN = userdata.get('HF_TOKEN')

# Verify token recovery
if not HF_TOKEN:
    print("WARNING: The token 'HF_TOKEN' wasn't found. This may cause errors.")
else:
    print("Hugging Face Token loaded successfully.")

In the next piece of code, we perform several actions. First, it sets up parts of local text generation with a pre-trained Hugging Face model – specially distilgpt2. After that, the model is integrated into the LangChain pipeline.

import textstat
from langchain_core.prompts import PromptTemplate
# Importing necessary classes for local Hugging Face pipelines
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain_community.llms import HuggingFacePipeline

# Initializing a free-tier, local-friendly, compatible LLM for text generation
model_id = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Creating a text-generation pipeline
pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=100,
    device=0 # Use GPU if available, otherwise it will default to CPU
)

# Wrapping the pipeline in HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=pipe)

Our core approach to measuring and managing verbosity is implemented next. The next function generates a summary of the passed text (assumed to be the LLM response) and tries to ensure that the summary does not exceed the difficulty level. Note that if you use a suitable quick template, the production models like it distilgpt2 can be used to obtain summaries of text, although the quality of those summaries may not be the same as that of summarization-oriented models. We chose this model because of its reliability in local use in a confined space.

def safe_summarize(text_input, complexity_budget=10.0):
    print("n--- Starting Summary Process ---")
    print(f"Input text length: {len(text_input)} characters")
    print(f"Target complexity budget (ARI score): {complexity_budget}")

    # Step 1: Initial Summary Generation
    print("Generating initial comprehensive summary...")
    base_prompt = PromptTemplate.from_template(
        "Provide a comprehensive summary of the following: {text}"
    )
    chain = base_prompt | llm
    summary = chain.invoke({"text": text_input})
    print("Initial Summary generated:")
    print("-------------------------")
    print(summary)
    print("-------------------------")

    # Step 2: Measure Readability
    ari_score = textstat.automated_readability_index(summary)
    print(f"Initial ARI Score: {ari_score:.2f}")

    # Step 3: Enforce Complexity Budget
    if ari_score > complexity_budget:
        print("Budget exceeded! Initial summary is too complex.")
        print("Triggering simplification guardrail...")
        simplification_prompt = PromptTemplate.from_template(
            "The following text is too verbose. Rewrite it concisely "
            "using simple vocabulary, stripping away flowery language:nn{text}"
        )
        simplify_chain = simplification_prompt | llm
        simplified_summary = simplify_chain.invoke({"text": summary})

        new_ari = textstat.automated_readability_index(simplified_summary)
        print("Simplified Summary generated:")
        print("-------------------------")
        print(simplified_summary)
        print("-------------------------")
        print(f"Revised ARI Score: {new_ari:.2f}")
        summary = simplified_summary
    else:
        print("Initial summary is within complexity budget. No simplification needed.")

    print("--- Summary Process Finished ---")
    return summary

Note also in the code above that ARI scores are calculated to measure text difficulty.

The last part of the code example tests a predefined function, passing sample text and complex budget 10.0, and printing the final results.

# 1. Providing some highly verbose, complex sample text
sample_text = """
The inextricably intertwined permutations of cognitive computational arrays within the 
realm of Large Language Models often precipitate a cascade of unnecessarily labyrinthine 
lexical structures. This propensity for circumlocution, whilst seemingly indicative of 
profound erudition, frequently obfuscates the foundational semantic payload, thereby 
rendering the generated discourse significantly less accessible to the quintessential layperson.
"""

# 2. Calling the function
print("Running summarizer pipeline...n")
final_output = safe_summarize(sample_text, complexity_budget=10.0)

# 3. Printing the final result
print("n--- Final Guardrailed Summary ---")
print(final_output)

The resulting printed messages can be quite long, but you will see a subtle decrease in the ARI score after calling the pre-trained model to condense. Don't expect spectacular results, though: the chosen model, while lightweight, is not good at summarizing text, so the ARI score reduction is modest. You can try using other similar models google/flan-t5-small to see how they work in text summarization, but be warned — these models will be complex and difficult to use.

# Wrapping up

This article shows how to use the infrastructure to measure and control overly verbose LLM responses by calling an auxiliary model to summarize them before validating their level of complexity. Hallucinations are a product of the loud voice in many cases. Although the implementation shown here focuses on verbosity testing, there are specific tests that can also be used to measure false positives – such as semantic consistency tests, natural language inference (NLI) cross-encoders, and LLM-as-a-judge solutions.

Iván Palomares Carrascosa is a leader, author, speaker, and consultant in AI, machine learning, deep learning and LLMs. He trains and guides others in using AI in the real world.

Source link

nimda 6 hours ago

0 2 5 minutes read