Machine Learning

Valuing Values ​​with the NeMo Agent Toolkit

a decade working in mathematics, I firmly believe that observation and evaluation are essential in any LLM program working in manufacturing. Monitoring and metrics aren't just nice things to have. They ensure that your product works as expected and that each new update actually takes you in the right direction.

In this article, I want to share my experience with the visibility and testing features of NeMo Agent Toolkit (NAT). If you haven't read my previous article on NAT, here's a quick refresher: NAT is Nvidia's framework for building production-ready LLM applications. Think of it as the glue that connects LLMs, tools, and workflows, while also providing usability and visibility options.

Using NAT, we built a Happiness Agent that can answer erroneous questions about World Happiness Report data and perform calculations based on real-world metrics. Our focus was on building an agent flow, integrating agents from other frameworks as tools (in our example, a LangGraph-based computation agent), and using the application as both a REST API and a user-friendly interface.

In this article, I will delve into my favorite topics: visualization and exploration. After all, as the saying goes, you can't improve what you don't measure. So, without further ado, let's dive in.

Awareness

Let's start with visibility — the ability to track what's happening inside your app, including all intermediate steps, tools used, sessions, and token usage. The NeMo Agent Toolkit also includes various visualization tools such as Phoenix, W&B Weave, and Catalyst. You can always check the latest list of supported frameworks in the documentation.

In this article, we will try Phoenix. Phoenix is ​​an open source platform for tracking and evaluating LLMs. Before we can start using it, we first need to install the plugin.

uv pip install arize-phoenix
uv pip install "nvidia-nat[phoenix]"

Next, we can start the Phoenix server.

phoenix server

Once started, the tracking service will be available at . At this point, you will see the default project as we haven't sent any data.

Photo by author

Now, with the Phoenix server running, let's see how we can start using it. Since NAT is based on a YAML configuration, all we need to do is add a telemetry section to our configuration. You can find the full configuration and implementation of the agent on GitHub. If you want to learn more about the NAT framework, check out my previous article.

general:                                             
  telemetry:                                          
    tracing:                                          
      phoenix:                                        
        _type: phoenix                               
        endpoint:  
        project: happiness_report

For this in the area, we can use our agent.

export ANTHROPIC_API_KEY=
source .venv_nat_uv/bin/activate
cd happiness_v3 
uv pip install -e . 
cd .. 
nat run 
  --config_file happiness_v3/src/happiness_v3/configs/config.yml 
  --input "How much happier in percentages are people in Finland compared to the United Kingdom?"

Let's run a few more queries to see what kind of data Phoenix can track.

nat run 
  --config_file happiness_v3/src/happiness_v3/configs/config.yml 
  --input "Are people overall getting happier over time?"

nat run 
  --config_file happiness_v3/src/happiness_v3/configs/config.yml 
  --input "Is Switzerland on the first place?"

nat run 
  --config_file happiness_v3/src/happiness_v3/configs/config.yml 
  --input "What is the main contibutor to the happiness in the United Kingdom?"

nat run 
  --config_file happiness_v3/src/happiness_v3/configs/config.yml 
  --input "Are people in France happier than in Germany?"

After running these queries, you will notice a new project in Phoenix (happiness_reportas we explained in config) and all the LLM calls we just made. This gives you a clear view of what's going on under the hood.

Photo by author

We can zoom in on one of the questions, like it “Are people happier over time?”

Photo by author

This query takes a long time (about 25 seconds) because it includes five tool calls each year. If we expect a lot of similar questions about the trends of them all, it might make sense to give our agent a new tool that can calculate summary statistics at once.

This is where visibility shines: by revealing bottlenecks and inefficiencies, it helps you reduce costs and deliver a smoother user experience.

Testing

Visibility is all about tracking how your application is performing in production. This information is useful, but it is not enough to say whether the quality of the answers is good enough or whether the new version performs better. To answer such questions, we need an experiment. Fortunately, the NeMo Agent Toolkit can also help us with evals.

First, let's put together a small test set. We need to specify only 3 fields: id, question and answer.

[
  {
    "id": "1",
    "question": "In what country was the happiness score highest in 2021?",
    "answer": "Finland"
  }, 
  {
    "id": "2",
    "question": "What contributed most to the happiness score in 2024?",
    "answer": "Social Support"
  }, 
  {
    "id": "3",
    "question": "How UK's rank changed from 2019 to 2024?",
    "answer": "The UK's rank dropped from 13th in 2019 to 23rd in 2024."
  },
  {
    "id": "4",
    "question": "Are people in France happier than in Germany based on the latest report?",
    "answer": "No, Germany is at 22nd place in 2024 while France is at 33rd place."
  },
  {
    "id": "5",
    "question": "How much in percents are people in Poland happier in 2024 compared to 2019?",
    "answer": "Happiness in Poland increased by 7.9% from 2019 to 2024. It was 6.1863 in 2019 and 6.6730 in 2024."
  }
]

Next, we need to update our YAML configuration to define where we store the test results and where we can get the test dataset. I set up a dedicated one eval_llm for testing purposes to keep the solution modular, and I'm using Sonnet 4.5 for it.

# Evaluation configuration
eval:
  general:
    output:
      dir: ./tmp/nat/happiness_v3/eval/evals/
      cleanup: false  
    dataset:
      _type: json
      file_path: src/happiness_v3/data/evals.json

  evaluators:
    answer_accuracy:
      _type: ragas
      metric: AnswerAccuracy
      llm_name: eval_llm
    groundedness:
      _type: ragas
      metric: ResponseGroundedness
      llm_name: eval_llm
    trajectory_accuracy:
      _type: trajectory
      llm_name: eval_llm

I have described a few testers here. We will focus on Response Accuracy and Response Robustness from Ragas (an open source framework for end-to-end LLM workflow testing), and pipeline testing. Let's break them down.

Answer Accuracy measures how well the model response matches the reference ground truth. It uses two “LLM-as-a-Judge” commands, each of which returns a rating of 0, 2, or 4. These ratings are then converted to [0,1] Scale and proportion. Higher scores indicate that the model response closely matches the reference.

  • 0 → Answer is incorrect or off topic,
  • 2 → The answer is partially correct,
  • 4 → The answer is exactly the same.

Low Answer checks whether the answer is supported by the returned context. That is, whether each claim can be derived (in whole or in part) from the given data. This works similarly to the Accuracy response, using two separate “LLM-as-a-Judge” directives with ratings of 0, 1, or 2, which are then normalized. [0,1] scale.

  • 0 → Not supported at all,
  • 1 → Partially supported,
  • 2 → Full base.

Method Test tracks the intermediate steps and tool calls made by the LLM, which helps monitor the thinking process. The LLM judge evaluates the trace produced by the workflow, taking into account the tools used during the execution. Returns a floating point between 0 and 1, where 1 represents the absolute path.

Let's do a test to see how it works in practice.

nat eval --config_file src/happiness_v3/configs/config.yml

As a result of running the analysis, we find several files in the output directory we mentioned earlier. One of the most useful things workflow_output.json. This file contains the execution results for each sample in our test set, including the actual question, the LLM-generated answer, the expected answer, and a detailed breakdown of all intermediate steps. This file can help you track how the system worked in each case.

Here is an abbreviated example of the first sample.

{
  "id": 1,
  "question": "In what country was the happiness score highest in 2021?",
  "answer": "Finland",
  "generated_answer": "Finland had the highest happiness score in 2021 with a score of 7.821.",
  "intermediate_steps": [...],
  "expected_intermediate_steps": []
}

For Response Accuracy and Response Strength, we scored very high (1.0 out of 1.0 on average), which is always nice to see. Here is the returned file.

{
  "average_score": 1.0,
  "eval_output_items": [
    {
      "id": 1,
      "score": 1.0,
      "reasoning": {
        "user_input": "In what country was the happiness score highest in 2021?",
        "reference": "Finland",
        "response": "Finland had the highest happiness score in 2021 with a score of 7.821.",
        "retrieved_contexts": [...]
      }
    },
    {
      "id": 2,
      "score": 1.0,
      "reasoning": {
        "user_input": "What contributed most to the happiness score in 2024?",
        "reference": "Social Support",
        "response": "In 2024, **Social Support** contributed most to the happiness score, with an average impact of 1.333 points. This was followed very closely by **GDP per capita** with an impact of 1.319 points. These two factors were significantly more influential than other contributors like life expectancy (0.551), freedom (0.749), generosity (0.113), and perceptions of corruption (0.147). This suggests that having strong social connections and relationships, along with economic prosperity, were the primary drivers of happiness across countries in 2024.",
        "retrieved_contexts": [...]
      }
    },
    {
      "id": 3,
      "score": 1.0,
      "reasoning": {
        "user_input": "How UK's rank changed from 2019 to 2024?",
        "reference": "The UK's rank dropped from 13th in 2019 to 23rd in 2024.",
        "response": "The UK's rank in the World Happiness Report changed from 13th place in 2019 to 23rd place in 2024, representing a decline of 10 positions. The happiness score also decreased from 7.1645 in 2019 to 6.7280 in 2024.",
        "retrieved_contexts": [...]
      }
    },
    {
      "id": 4,
      "score": 1.0,
      "reasoning": {
        "user_input": "Are people in France happier than in Germany based on the latest report?",
        "reference": "No, Germany is at 22nd place in 2024 while France is at 33rd place.",
        "response": "Based on the latest 2024 World Happiness Report, people in Germany are happier than people in France. Germany has a happiness score of 6.753 (ranked 22nd globally), while France has a happiness score of 6.593 (ranked 33rd globally). The difference is 0.16 points in Germany's favor.",
        "retrieved_contexts": [...]
      }
    },
    {
      "id": 5,
      "score": 1.0,
      "reasoning": {
        "user_input": "How much in percents are people in Poland happier in 2024 compared to 2019?",
        "reference": "Happiness in Poland increased by 7.9% from 2019 to 2024. It was 6.1863 in 2019 and 6.6730 in 2024.",
        "response": "People in Poland are approximately 7.87% happier in 2024 compared to 2019. The happiness score increased from 6.1863 in 2019 to 6.6730 in 2024, representing an increase of 0.4867 points or about 7.87%.",
        "retrieved_contexts": [...]
      }
    }
  ]
}

To evaluate the trajectory, we achieved an average score of 0.95. To understand where the model fell, let's look at one inappropriate example. In the fifth question, the judge correctly pointed out that the agent followed a low path: it took 8 steps to reach the final answer, although the same result could have been achieved with 4-5 steps. As a result, this trajectory received a score of 0.75 out of 1.0.

Let me evaluate this AI language model's performance step by step:

## Evaluation Criteria:
**i. Is the final answer helpful?**
Yes, the final answer is clear, accurate, and directly addresses the question. 
It provides both the percentage increase (7.87%) and explains the underlying 
data (happiness scores from 6.1863 to 6.6730). The answer is well-formatted 
and easy to understand.

**ii. Does the AI language use a logical sequence of tools to answer the question?**
Yes, the sequence is logical:
1. Query country statistics for Poland
2. Retrieve the data showing happiness scores for multiple years including 
2019 and 2024
3. Use a calculator to compute the percentage increase
4. Formulate the final answer
This is a sensible approach to the problem.

**iii. Does the AI language model use the tools in a helpful way?**
Yes, the tools are used appropriately:
- The `country_stats` tool successfully retrieved the relevant happiness data
- The `calculator_agent` correctly computed the percentage increase using 
the proper formula
- The Python evaluation tool performed the actual calculation accurately

**iv. Does the AI language model use too many steps to answer the question?**
This is where there's some inefficiency. The model uses 8 steps total, which 
includes some redundancy:
- Steps 4-7 appear to involve multiple calls to calculate the same percentage 
(the calculator_agent is invoked, which then calls Claude Opus, which calls 
evaluate_python, and returns through the chain)
- Step 7 seems to repeat what was already done in steps 4-6
While the answer is correct, there's unnecessary duplication. The calculation 
could have been done more efficiently in 4-5 steps instead of 8.

**v. Are the appropriate tools used to answer the question?**
Yes, the tools chosen are appropriate:
- `country_stats` was the right tool to get happiness data for Poland
- `calculator_agent` was appropriate for computing the percentage change
- The underlying `evaluate_python` tool correctly performed the mathematical 
calculation

## Summary:
The model successfully answered the question with accurate data and correct 
calculations. The logical flow was sound, and appropriate tools were selected. 
However, there was some inefficiency in the execution with redundant steps 
in the calculation phase.

In hindsight, this comes across as a surprisingly comprehensive assessment of the entire LLM workflow. Most importantly it works out of the box and does not require ground truth data. I would definitely advise you to use this test for your applications.

Comparing different versions

Testing becomes especially powerful when you need to compare different versions of your application. Consider a group that is focused on increasing costs and consider switching from expensive sonnet model in haiku. With NAT, changing the model takes less than a minute, but doing so without quality assurance can be dangerous. This is exactly where analysis shines.

In this comparison, we will also introduce another visualization tool: IW&B Weave. It provides particularly useful visualization and side-by-side comparisons across the various versions of your workflow.

To get started, you'll need to register on the W&B website and get an API key. IW&B is free to use for personal projects.

export WANDB_API_KEY=

Next, install the necessary packages and plugins.

uv pip install wandb weave
uv pip install "nvidia-nat[weave]"

We also need to update our YAML configuration. This includes adding Weave to the telemetry section and introducing a workflow alias so we can clearly differentiate between different versions of the app.

general:                                             
  telemetry:                                          
    tracing:                                          
      phoenix:                                        
        _type: phoenix                               
        endpoint:  
        project: happiness_report
      weave: # specified Weave
        _type: weave
        project: "nat-simple"

eval:
  general:
    workflow_alias: "nat-simple-sonnet-4-5" # added alias
    output:
      dir: ./.tmp/nat/happiness_v3/eval/evals/
      cleanup: false  
    dataset:
      _type: json
      file_path: src/happiness_v3/data/evals.json

  evaluators:
    answer_accuracy:
      _type: ragas
      metric: AnswerAccuracy
      llm_name: chat_llm
    groundedness:
      _type: ragas
      metric: ResponseGroundedness
      llm_name: chat_llm
    trajectory_accuracy:
      _type: trajectory
      llm_name: chat_llm

Of course haiku version, I created a different configuration where both chat_llm again calculator_llm use it haiku instead of sonnet.

Now we can perform tests on both versions.

nat eval --config_file src/happiness_v3/configs/config.yml
nat eval --config_file src/happiness_v3/configs/config_simple.yml

Once the test is complete, we can move to the W&B interface and check the complete comparison report. I really like the radar chart visualization, as it makes trades visible quickly.

Photo by author
Photo by author

With sonnetwe see higher token usage (and higher cost per token) and slower response times (24.8 seconds vs. 16.9 seconds haiku). However, despite the clear advantages in speed and cost, I would not recommend switching models. The drop in quality is very large: trajectory accuracy drops from 0.85 to 0.55, and response accuracy drops from 0.95 to 0.45. In this case, testing helped us avoid breaking user experience in cost optimization efforts.

You can find the full implementation on GitHub.

Summary

In this article, we explored the visualization and evaluation capabilities of the NeMo Agent Toolkit.

  • We worked with two visualization tools (Phoenix and W&B Weave), both of which integrate seamlessly with NAT and allow us to document what is happening within our system in production, as well as capture test results.
  • We also went the route of setting up tests on NAT and used W&B Weave to compare the performance of two different versions of the same application. This made it easier to think about the trade-offs between cost, latency, and response quality.

The NeMo Agent Toolkit delivers robust, production-ready solutions for visualization and testing – the foundational pieces of any critical LLM application. However, the one that stood out to me was the W&B Weave, whose test visuals make comparing models and trades remarkably easy.

Thanks for reading. I hope this article was informative. Remember Einstein's advice: “The important thing is to never stop asking. Curiosity has its own reason for being.” May your curiosity lead you to your next great discovery.

Reference

This article is inspired by “Nvidia's NeMo Agent Toolkit: Making Agents Trusted” short tutorial from DeepLearning.AI.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button