Generative AI

Hands-on Tutorial: Create a LLM test pipe in Mudular LLM with Google Orandatory Ai and Langchain

Examining the llms has come up as a very important challenge for improving reliability and use of intelligence for both educational and industrial settings. As the skills of these models extend, then the need for difficult, multiple test measures. In this lesson, we provide a complete examination of one of the most representatives of the field: systematically testing and llMS restrictions on various sizes. Using Google's cutting models as the bakes and the Langchain's cutting systems as our orchestral library, we launch a powerful and tense test pipeline associated with the implementation on Google Colab. This framework consists of compliance from conditions, relating to accuracy, compliance, compliance, and deceptive comparisons, comparable to the rich visual assessment to bring effective and effective understanding. Medical quotations are organized and the format of the world's true, which is in a pervertable diatitality, providing researchers and tools an instrument convenient for use, which agrees on the closed LLM.

!pip install langchain langchain-google-genai ragas pandas matplotlib

It includes the main libraries of the Python's ability to create ai, Langchain of the Langchain of Lym-covered work of llm (with Langchain-Google Suna Google's Ai), and Pandas Plus Matplotlib to deceive data and viewing data.

import os
import pandas as pd
import matplotlib.pyplot as plt
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.evaluation import load_evaluator
from langchain.schema import HumanMessage

We include Core Python resources, including natural management, pandas administrators, and matrames, and matrotlib.pyplot editing client client, and the Langchain client, and the Ntustema Scheme to build conversation.

os.environ["GOOGLE_API_KEY"] = "Use Your API Key"

Here, we prepare your environment by keeping your Google API key on Google_ap_key Flexibility, allows the Langchain Ai Generative Ainction client to confirm secure.

def create_evaluation_dataset():
    """Create a simple dataset for evaluation."""
    questions = [
        "Explain the concept of quantum computing in simple terms.",
        "How does a neural network learn?",
        "What are the main differences between SQL and NoSQL databases?",
        "Explain how blockchain technology works.",
        "What is the difference between supervised and unsupervised learning?"
    ]
   
    ground_truth = [
        "Quantum computing uses quantum bits or qubits that can exist in multiple states simultaneously, unlike classical bits. This allows quantum computers to process certain types of information much faster than classical computers for specific problems.",
        "Neural networks learn through a process called backpropagation where they adjust the weights between neurons based on the error between predicted and actual outputs, gradually minimizing this error through many iterations of training data.",
        "SQL databases are relational with structured schemas, fixed tables, and use SQL for queries. NoSQL databases are non-relational, schema-flexible, and designed for specific data models like document, key-value, wide-column, or graph formats.",
        "Blockchain is a distributed ledger technology where data is stored in blocks that are linked cryptographically. Each block contains transaction data and a timestamp, creating an immutable chain. Consensus mechanisms verify transactions without central authority.",
        "Supervised learning uses labeled data where the algorithm learns to predict outputs based on input-output pairs. Unsupervised learning works with unlabeled data to find patterns or structures without predefined outputs."
    ]
   
    return pd.DataFrame({"question": questions, "ground_truth": ground_truth})

We are building a small datafarama examining five questions in Ai and database with their right answers – true answers, making it easier to look at the llm answers from the relevant results.

def setup_models():
    """Set up different Google Generative AI models for comparison."""
    models = {
        "gemini-2.0-flash-lite": ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0),
        "gemini-2.0-flash": ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=0)
    }
    return models

Now, this work has removed two Zero-LevelgeGenatively, one of the “Gemini-0-2.0-number model”.

def generate_responses(models, dataset):
    """Generate responses from each model for the questions in the dataset."""
    responses = {}
   
    for model_name, model in models.items():
        model_responses = []
        for question in dataset["question"]:
            try:
                response = model.invoke([HumanMessage(content=question)])
                model_responses.append(response.content)
            except Exception as e:
                print(f"Error with model {model_name} on question: {question}")
                print(f"Error: {e}")
                model_responses.append("Error generating response")
       
        responses[model_name] = model_responses
   
    return responses

This work enters each model prepared and each question to the Database, issuing a model to produce a response, and put the errors of each word), and place the map each dictionary of their answers.

def evaluate_responses(models, dataset, responses):
    """Evaluate model responses using different evaluation criteria."""
    evaluator_model = ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0)
   
    reference_criteria = ["correctness"]
    reference_free_criteria = [
        "relevance",  
        "coherence",    
        "conciseness"  
    ]
   
    results = {model_name: {criterion: [] for criterion in reference_criteria + reference_free_criteria}
               for model_name in models.keys()}
   
    for criterion in reference_criteria:
        evaluator = load_evaluator("labeled_criteria", criteria=criterion, llm=evaluator_model)
       
        for model_name in models.keys():
            for i, question in enumerate(dataset["question"]):
                ground_truth = dataset["ground_truth"][i]
                response = responses[model_name][i]
               
                if response != "Error generating response":
                    eval_result = evaluator.evaluate_strings(
                        prediction=response,
                        reference=ground_truth,
                        input=question
                    )
                    normalized_score = float(eval_result.get('score', 0)) * 2
                    results[model_name][criterion].append(normalized_score)
                else:
                    results[model_name][criterion].append(0)  
   
    for criterion in reference_free_criteria:
        evaluator = load_evaluator("criteria", criteria=criterion, llm=evaluator_model)
       
        for model_name in models.keys():
            for i, question in enumerate(dataset["question"]):
                response = responses[model_name][i]
               
                if response != "Error generating response":
                    eval_result = evaluator.evaluate_strings(
                        prediction=response,
                        input=question
                    )
                    normalized_score = float(eval_result.get('score', 0)) * 2
                    results[model_name][criterion].append(normalized_score)
                else:
                    results[model_name][criterion].append(0)  
    return results

This work detects the “Gemini-Flash-Lite” for each model points with accuracy rate and trust metrics (compatibility, contained of Conference Maps)

def calculate_average_scores(evaluation_results):
    """Calculate average scores for each model and criterion."""
    avg_scores = {}
   
    for model_name, criteria in evaluation_results.items():
        avg_scores[model_name] = {}
       
        for criterion, scores in criteria.items():
            if scores:
                avg_scores[model_name][criterion] = sum(scores) / len(scores)
            else:
                avg_scores[model_name][criterion] = 0
               
        all_scores = [score for criterion_scores in criteria.values() for score in criterion_scores if score is not None]
        if all_scores:
            avg_scores[model_name]["overall"] = sum(all_scores) / len(all_scores)
        else:
            avg_scores[model_name]["overall"] = 0
           
    return avg_scores

This work processs the specified assessment results to cover the points in each survey in all the questions of the entire model. Also, calculates the normal amount by entering all individual metric scores. The maps of each modeled dictionary in each of its per-crisee measurements and the combined performance of the combined performance.

def visualize_results(avg_scores):
    """Visualize evaluation results with bar charts."""
    models = list(avg_scores.keys())
    criteria = list(avg_scores[models[0]].keys())
   
    plt.figure(figsize=(14, 8))
   
    bar_width = 0.8 / len(models)
   
    positions = range(len(criteria))
   
    for i, model in enumerate(models):
        model_scores = [avg_scores[model][criterion] for criterion in criteria]
        plt.bar([p + i * bar_width for p in positions], model_scores,
                width=bar_width, label=model)
   
    plt.xlabel('Evaluation Criteria', fontsize=12)
    plt.ylabel('Average Score (0-10)', fontsize=12)
    plt.title('LLM Model Comparison by Evaluation Criteria', fontsize=14)
    plt.xticks([p + bar_width * (len(models) - 1) / 2 for p in positions], criteria)
    plt.legend()
    plt.grid(axis="y", linestyle="--", alpha=0.7)
   
    plt.tight_layout()
    plt.show()
   
    plt.figure(figsize=(10, 8))
   
    categories = [c for c in criteria if c != 'overall']
    N = len(categories)
   
    angles = [n / float(N) * 2 * 3.14159 for n in range(N)]
    angles += angles[:1]  
   
    plt.polar(angles, [0] * (N + 1))
    plt.xticks(angles[:-1], categories)
   
    for model in models:
        values = [avg_scores[model][c] for c in categories]
        values += values[:1]  
        plt.polar(angles, values, label=model)
   
    plt.legend(loc="upper right")
    plt.title('LLM Model Comparison - Radar Chart', fontsize=14)
    plt.tight_layout()
    plt.show()

This work creates Bar-by-side charts to compare regular model points in the entire test process. Then offers radar chart to see its operating profiles, making the rapid identification of related weaknesses.

def main():
    print("Creating evaluation dataset...")
    dataset = create_evaluation_dataset()
   
    print("Setting up models...")
    models = setup_models()
   
    print("Generating responses...")
    responses = generate_responses(models, dataset)
   
    print("Evaluating responses...")
    evaluation_results = evaluate_responses(models, dataset, responses)
   
    print("Calculating average scores...")
    avg_scores = calculate_average_scores(evaluation_results)
   
    print("Average scores:")
    for model, scores in avg_scores.items():
        print(f"n{model}:")
        for criterion, score in scores.items():
            print(f"  {criterion}: {score:.2f}")
   
    print("nVisualizing results...")
    visualize_results(avg_scores)
   
    print("Saving results to CSV...")
    results_df = pd.DataFrame(columns=["Model", "Criterion", "Score"])
    for model, criteria in avg_scores.items():
        for criterion, score in criteria.items():
            results_df = pd.concat([results_df, pd.DataFrame([{"Model": model, "Criterion": criterion, "Score": score}])],
                                  ignore_index=True)
   
    results_df.to_csv("llm_evaluation_results.csv", index=False)
    print("Results saved to llm_evaluation_results.csv")
   
    detailed_df = pd.DataFrame(columns=["Question", "Ground Truth"] + list(models.keys()))
   
    for i, question in enumerate(dataset["question"]):
        row = {
            "Question": question,
            "Ground Truth": dataset["ground_truth"][i]
        }
       
        for model_name in models.keys():
            row[model_name] = responses[model_name][i]
       
        detailed_df = pd.concat([detailed_df, pd.DataFrame([row])], ignore_index=True)
   
    detailed_df.to_csv("llm_response_comparison.csv", index=False)
    print("Detailed responses saved to llm_response_comparison.csv")

The main function is orchestrates all the process of preventing complete flow – to the end: build a dataset, starting the models, and finally indicates workshops, and ultimately sends the issuing of CSV files and CSV files.

def pairwise_model_comparison(models, dataset, responses):
    """Compare two models side by side using an LLM as judge."""
    evaluator_model = ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0)
   
    pairwise_template = """
    Question: {question}
   
    Response A: {response_a}
   
    Response B: {response_b}
   
    Which response better answers the user's question? Consider factors like accuracy,
    helpfulness, clarity, and completeness.
   
    First, analyze each response point by point. Then conclude with your choice of either:
    A is better, B is better, or They are equally good/bad.
   
    Your analysis:
    """
   
    pairwise_prompt = PromptTemplate(
        input_variables=["question", "response_a", "response_b"],
        template=pairwise_template
    )
   
    pairwise_chain = LLMChain(llm=evaluator_model, prompt=pairwise_prompt)
   
    model_names = list(models.keys())
   
    pairwise_results = {f"{model_a} vs {model_b}": [] for model_a in model_names for model_b in model_names if model_a != model_b}
   
    for i, question in enumerate(dataset["question"]):
        for j, model_a in enumerate(model_names):
            for model_b in model_names[j+1:]:  
                response_a = responses[model_a][i]
                response_b = responses[model_b][i]
               
                if response_a != "Error generating response" and response_b != "Error generating response":
                    comparison_result = pairwise_chain.run(
                        question=question,
                        response_a=response_a,
                        response_b=response_b
                    )
                   
                    key_ab = f"{model_a} vs {model_b}"
                    pairwise_results[key_ab].append({
                        "question": question,
                        "result": comparison_result
                    })
   
    return pairwise_results

This work compares head-to-head comparison with each pair of stealing “gemini-2.0-flash-lite” to specify their answers with the revised side of the side.

def enhanced_main():
    """Enhanced main function with additional evaluations."""
    print("Creating evaluation dataset...")
    dataset = create_evaluation_dataset()
   
    print("Setting up models...")
    models = setup_models()
   
    print("Generating responses...")
    responses = generate_responses(models, dataset)
   
    print("Evaluating responses...")
    evaluation_results = evaluate_responses(models, dataset, responses)
   
    print("Calculating average scores...")
    avg_scores = calculate_average_scores(evaluation_results)
   
    print("Average scores:")
    for model, scores in avg_scores.items():
        print(f"n{model}:")
        for criterion, score in scores.items():
            print(f"  {criterion}: {score:.2f}")
   
    print("nVisualizing results...")
    visualize_results(avg_scores)
   
    print("nPerforming pairwise model comparison...")
    pairwise_results = pairwise_model_comparison(models, dataset, responses)
   
    print("nPairwise comparison results:")
    for comparison, results in pairwise_results.items():
        print(f"n{comparison}:")
        for i, result in enumerate(results[:2]):
            print(f"  Question {i+1}: {result['question']}")
            print(f"  Analysis: {result['result'][:100]}...")
   
    print("nSaving all results...")
    results_df = pd.DataFrame(columns=["Model", "Criterion", "Score"])
    for model, criteria in avg_scores.items():
        for criterion, score in criteria.items():
            results_df = pd.concat([results_df, pd.DataFrame([{"Model": model, "Criterion": criterion, "Score": score}])],
                                  ignore_index=True)
   
    results_df.to_csv("llm_evaluation_results.csv", index=False)
   
    detailed_df = pd.DataFrame(columns=["Question", "Ground Truth"] + list(models.keys()))
   
    for i, question in enumerate(dataset["question"]):
        row = {
            "Question": question,
            "Ground Truth": dataset["ground_truth"][i]
        }
       
        for model_name in models.keys():
            row[model_name] = responses[model_name][i]
       
        detailed_df = pd.concat([detailed_df, pd.DataFrame([row])], ignore_index=True)
   
    detailed_df.to_csv("llm_response_comparison.csv", index=False)
   
    pairwise_df = pd.DataFrame(columns=["Comparison", "Question", "Analysis"])
   
    for comparison, results in pairwise_results.items():
        for result in results:
            pairwise_df = pd.concat([pairwise_df, pd.DataFrame([{
                "Comparison": comparison,
                "Question": result["question"],
                "Analysis": result["result"]
            }])], ignore_index=True)
   
    pairwise_df.to_csv("llm_pairwise_comparison.csv", index=False)
   
    print("All results saved to CSV files.")

Improved function_meain is expanding the main test pipeline by adding automatic default defaults, integrated printing of progress on each stage, and is detailed by three CSV files, detailed answers, so maintain a complete work environment.

if __name__ == "__main__":
    enhanced_main()

Finally, the guard confirms that when the text is directly made (untroducted), it costs a_aMakini () the end of the pipe.

In conclusion, this teaching introduces a variable framework for assessing and comparisons of llms, strengthening Google's power. Unlike metrics based on support, the method presented here combines a lot of languages, granular examination, comparing to the model model, and visible visible. By kidnapping important qualities, including accuracy, compliance, and integrity, our assessment pipeline makes ineffective makers to see the subtle of direct performance. Results, including CSV is based on CSV, Radar Plots, and Bar graphs, does not support data transformation but also the decisions that are driven by model and distribution.


Here is the Colab Notebook. Also, don't forget to follow Sane and join ours Telegraph station including LinkedIn Grtopic. Don't forget to join ours 90k + ml subreddit.

🔥 [Register Now] Summit of the Minicon Virtual in Agentic AI: Free Registration + Certificate of Before Hour 4 Hour Court (May 21, 9 AM


Asphazzaq is a Markteach Media Inc. According to a View Business and Developer, Asifi is committed to integrating a good social intelligence. His latest attempt is launched by the launch of the chemistrylife plan for an intelligence, MarktechPost, a devastating intimate practice of a machine learning and deep learning issues that are clearly and easily understood. The platform is adhering to more than two million moon visits, indicating its popularity between the audience.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button