Starting with a MLFLOW to check the llm

nimda June 27, 2025

0 9 3 minutes read

MFLWOW is an open source open source of a machine learning life. While tradition is used to track model tests, as well as managing the shipment, MLFLOW has been introduced for test support for large languages (llMS).

In this lesson, we examine how we can use MFFLOW to assess the performance of the LLM-case, the Gemini Model model – a set of accuracy. We will produce responses to the stations based on facts to use Gemini and test their quality using various metrics supported directly MLFLOW.

To set up the dependence

In this lesson, we will use both Opelai and Germin Apis. Mencremic metrics provided with MFLWLOW depends on the MLFLOW Model (eg GPT-4) to act like the same metrics or reliability, so you need to the Opelai API key. You can find:

Installing libraries

pip install mlflow openai pandas google-genai

To Set Up Openai and Google API Keys as Environmental Variations

import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key:')
os.environ["GOOGLE_API_KEY"] = getpass('Enter Google API Key:')

Preparing test data and output downloading from Gemini

import mlflow
import openai
import os
import pandas as pd
from google import genai

To create test data

In this step, it describes a small test dataset that contains their true answers and their correct world responses. This removes Span titles such as science, health, web development, and programs. This organized format allows us to compare the answers produced by Gemini against proper recognition responses using various test marks in MLOW.

eval_data = pd.DataFrame(
    {
        "inputs": [
            "Who developed the theory of general relativity?",
            "What are the primary functions of the liver in the human body?",
            "Explain what HTTP status code 404 means.",
            "What is the boiling point of water at sea level in Celsius?",
            "Name the largest planet in our solar system.",
            "What programming language is primarily used for developing iOS apps?",
        ],
        "ground_truth": [
            "Albert Einstein developed the theory of general relativity.",
            "The liver helps in detoxification, protein synthesis, and production of biochemicals necessary for digestion.",
            "HTTP 404 means 'Not Found' -- the server can't find the requested resource.",
            "The boiling point of water at sea level is 100 degrees Celsius.",
            "Jupiter is the largest planet in our solar system.",
            "Swift is the primary programming language used for iOS app development."
        ]
    }
)

eval_data

Finding Gemini Responses

This Code Block describes the function of help Gemini_coMompting () immediately to Gemini Model in Glash Ai SDK and returns the product generated as a difficult text. We then use this work to achieve each of our test dataset to produce model forecasts, keep them in the new “forecast”. This predicts will be tested by responding to land

client = genai.Client()
def gemini_completion(prompt: str) -> str:
    response = client.models.generate_content(
        model="gemini-1.5-flash",
        contents=prompt
    )
    return response.text.strip()

eval_data["predictions"] = eval_data["inputs"].apply(gemini_completion)
eval_data

To check the gemini out of the MLFLOW

In this step, we start MFFLOW running to evaluate the applications produced by Gemini Model against a group of true true responses. We use MFLW.LLOLECTA () the way with four metrics of light: Answer_sidin (measuring semantic matches between modeling of the model and the fact of the world), exactly_match (Looking at the same words words), suruter (Following tracking time), and Token_Count (Entering the number of exit tokens).

It is important to note that Answer_sidin Metric uses in Opelai's GPT Judgment Model The Semantic Dirt Blends Among the Responses, which is why access to Opelai API is required. This setup provides an effective method of evaluating the llM exit unless loating in logic convoization. The results of the final test is printed and stored in the CSV file for later tested or mental.

mlflow.set_tracking_uri("mlruns")
mlflow.set_experiment("Gemini Simple Metrics Eval")

with mlflow.start_run():
    results = mlflow.evaluate(
        model_type="question-answering",
        data=eval_data,
        predictions="predictions",
        targets="ground_truth",
        extra_metrics=[
          mlflow.metrics.genai.answer_similarity(),
          mlflow.metrics.exact_match(),
          mlflow.metrics.latency(),
          mlflow.metrics.token_count()
      ]
    )
    print("Aggregated Metrics:")
    print(results.metrics)

    # Save detailed table
    results.tables["eval_results_table"].to_csv("gemini_eval_results.csv", index=False)

To view details of detailed test results, we upload a CSV file to the Dataphrame and correct the display settings to ensure the full appearance of each response. This allows us to examine the promotion of individuals, a format of Gemini, true international answers, as well as a metric conditions with the metric conditions, which are most helpful in writing as Colob or Jobyter.

results = pd.read_csv('gemini_eval_results.csv')
pd.set_option('display.max_colwidth', None)
results

Look Codes here. All credit for this study goes to research for this project. Also, feel free to follow it Sane and don't forget to join ours 100K + ml subreddit Then sign up for Our newspaper.

I am the student of the community engineering (2022) from Jamia Millia Islamia, New Delhi, and I am very interested in data science, especially neural networks and their application at various locations.