How to use the llm Arena-AAA-AAA-AAA-AAA-ALD

nimda August 25, 2025

0 19 4 minutes read

How to use the llm Arena-AAA-AAA-AAA-AAA-ALD

In this lesson, we will examine how we can use the WLM Arena-ALT method to check the major modelic language. Esikhundleni sokunikeza izilinganiso zamanani ezihlukanisiwe ekuphenduleni ngakunye, le ndlela yenza ukuqhathanisa ikhanda kuya kwekhanda phakathi kwemiphumela yokuthola ukuthi iyiphi engcono – esekelwe kwinqubo oyichaza, njengokucacisa, ukucaca, ukucaca, ukucaca, ukucaca, ukucaca, ukucaca, ukucaca, ukucaca, ukucaca, ukucaca, ukucaca, ukucaca, ukucaca, ukucaca, ukucaca, ukucaca, ukucaca, ukucaca, ukucacisa. Look Full codes here.

We will use Openai's GPT-4.1 and Gemini 2.5 Pro to produce answers, as well as GPTS-5 as a judge of checking their results. To be shown, we will work in a simple form of email support, where the context is standing as follows:

Dear Support,  
I ordered a wireless mouse last week, but I received a keyboard instead.  
Can you please resolve this as soon as possible?  
Thank you,  
John

To include leaning

pip install deepeval google-genai openai

In this lesson, you will need API keys from Opelai and Google. Look Full codes here.

As we use Deepeval to get testing, the Opelai API key is required

import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key: ')
os.environ['GOOGLE_API_KEY'] = getpass('Enter Google API Key: ')

Defining the context

Next, we will explain the context of our trial. In this example, we apply in a customer support status where the user reports to receive the wrong product. We will build context of the first message from the customer and immediately build a production based on the context. Look Full codes here.

from deepeval.test_case import ArenaTestCase, LLMTestCase, LLMTestCaseParams
from deepeval.metrics import ArenaGEval

context_email = """
Dear Support,
I ordered a wireless mouse last week, but I received a keyboard instead. 
Can you please resolve this as soon as possible?
Thank you,
John
"""

prompt = f"""
{context_email}
--------

Q: Write a response to the customer email above.
"""

Openi model response

from openai import OpenAI
client = OpenAI()

def get_openai_response(prompt: str, model: str = "gpt-4.1") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content

openAI_response = get_openai_response(prompt=prompt)

Gemini Model feedback

from google import genai
client = genai.Client()

def get_gemini_response(prompt, model="gemini-2.5-pro"):
    response = client.models.generate_content(
        model=model,
        contents=prompt
    )
    return response.text
geminiResponse = get_gemini_response(prompt=prompt)

Explaining Arena test case

Here, we put the arenatestcase to compare the effects of two models – GPT-4 and Gemini – with the same input. Both models find the same_mail context, and their produced answers are stored in Openai_rponse and Geminicirport Test. Look Full codes here.

a_test_case = ArenaTestCase(
    contestants={
        "GPT-4": LLMTestCase(
            input="Write a response to the customer email above.",
            context=[context_email],
            actual_output=openAI_response,
        ),
        "Gemini": LLMTestCase(
            input="Write a response to the customer email above.",
            context=[context_email],
            actual_output=geminiResponse,
        ),
    },
)

Setting up Metric of Evaluation

Here, it describes the arenageval metric metric called email e-mail. The test focuses on the fight against empathy, technology and clarity – aimed at seeing the answer to you, respect, and short. The test looks at the context, input, and the results of the model, using GPT-5 as inspector on the verbose sign-enabled to understand better. Look Full codes here.

metric = ArenaGEval(
    name="Support Email Quality",
    criteria=(
        "Select the response that best balances empathy, professionalism, and clarity. "
        "It should sound understanding, polite, and be succinct."
    ),
    evaluation_params=[
        LLMTestCaseParams.CONTEXT,
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
    ],
    model="gpt-5",  
    verbose_mode=True
)

Running testing

metric.measure(a_test_case)

**************************************************
Support Email Quality [Arena GEval] Verbose Logs
**************************************************
Criteria:
Select the response that best balances empathy, professionalism, and clarity. It should sound understanding, 
polite, and be succinct. 
 
Evaluation Steps:
[
    "From the Context and Input, identify the user's intent, needs, tone, and any constraints or specifics to be 
addressed.",
    "Verify the Actual Output directly responds to the Input, uses relevant details from the Context, and remains 
consistent with any constraints.",
    "Evaluate empathy: check whether the Actual Output acknowledges the user's situation/feelings from the 
Context/Input in a polite, understanding way.",
    "Evaluate professionalism and clarity: ensure respectful, blame-free tone and concise, easy-to-understand 
wording; choose the response that best balances empathy, professionalism, and succinct clarity."
] 
 
Winner: GPT-4
 
Reason: GPT-4 delivers a single, concise, and professional email that directly addresses the context (acknowledges 
receiving a keyboard instead of the ordered wireless mouse), apologizes, and clearly outlines next steps (send the 
correct mouse and provide return instructions) with a polite verification step (requesting a photo). This best 
matches the request to write a response and balances empathy and clarity. In contrast, Gemini includes multiple 
options with meta commentary, which dilutes focus and fails to provide one clear reply; while empathetic and 
detailed (e.g., acknowledging frustration and offering prepaid labels), the multi-option format and an over-assertive claim of already locating the order reduce professionalism and succinct clarity compared to GPT-4.
======================================================================

The test results indicate that GPT-4 is out of forming a model in creating a balanced empowerment email, technology and clarity. GPT-4's response was a mystery, respect, and actions, directly dealing with the situation with an error apology, ensuring the following steps to and giving back instructions. The tone was respectful and understanding, completely synchronized with the user's clear and sympathetic response. On the contrary, Gemini's response, while it is empathetic and detailed, it includes multiple response options and unnecessary comments, reducing its clarifications and professionalism. This effect highlights GPT-4 to submit the focus, customer-Centric-Centric-Center-Center-Center-Center-Center.

Look Full codes here. Feel free to look our GITHUB page for tutorials, codes and letters of writing. Also, feel free to follow it Sane and don't forget to join ours 100K + ml subreddit Then sign up for Our newspaper.

I am the student of the community engineering (2022) from Jamia Millia Islamia, New Delhi, and I am very interested in data science, especially neural networks and their application at various locations.

Source link

nimda August 25, 2025

0 19 4 minutes read