How we test our Dev agents

nimda December 6, 2025

0 7 4 minutes read

Why are agents tested so hard?

The AI agent is made as expected which is not easy. Even small tweaks to structures like your instantiations, agent instantiations, and models can have large and unexpected impacts.

Some of the top issues include:

Indeterminate results

The underlying argument is that agents are indeterminate. The same input goes in, two different results can come out.

How do you test for expected outcome when you don't know what the expected outcome will be? Simply put, a strictly defined exclusion test doesn't work.

Random results

A second, and much discussed, challenge of evaluating agentic systems is that the results are often unpredictable. The basis of agentic systems It's a big language models after all.

It is easy to define a test for structured data. For example, the ID field should never be positive or always net. How do you define the quality of a large text field?

Cost and rate

LLM-AS-Adjiad is the most common way to assess the quality or reliability of AI ADETs. However, it is an expensive workload and each user interaction (track) can have hundreds of interactions (spans).

So we have also developed our agent testing strategy. In this post we will share our findings including a new key concept that has proven pivotal in ensuring reliability at scale.

Author status

Check out our agent

We have two agents in production that are read by over 30,000 users. A troubleshooting agent combs through hundreds of signals to find the cause of a data reliability incident while a monitoring agent makes intelligent quality control recommendations.

To find a problem-solving agent we test three main dimensions: semantic distance, foundation and tool use. Here's how we check each other out.

Semantic distance

We find decisive tests where appropriate as they are clear, explainable, and inexpensive. For example, it is very easy to add a test to verify that one of the sugagent results is in json format in JSONIT

However, there are times when limited testing does not get the job done. For example, we examined the embedding of expected and new outputs as sectors and using cosline similarity tests. We thought this would be a cheap and quick way to test the semantic distance (what the same means) between the observed and expected results.

However, we found that there were too many cases where the words were the same, but the meaning was different.

Instead, we now let our LLM judge the expected result from the current configuration and ask it to score 0-1 similarity of the new release.

Anger

To find the basis, we look to ensure that the main context is where it should be, but that the agent will refuse to answer when the main context is missing or the question is out of proportion.

This is important as the LLMS is eager to please and will backfire if not given a good context.

Spending

The use of tools We have LLM-AS-AS-ADIATUSU that the agent is made as expected with a previously defined definition:

No tool was expected and no tool was called
It is expected to be an approved tool
No necessary tools are left
No prohibited instruments were used

Real magic does not use these tests, but how tests are used. Here's our current setup with experience through painful trial and error.

Agent testing best practices

It is important to keep in mind not only your listing agents, but also your LLM exam! These best methods are specially designed to combat those natural flaws.

A soft failure

Hard thresholds can be noisy with non-deterministic tests for obvious reasons. So we coined the concept of 'soft failure.'

The test returns a score between 0-1. Anything below a .5 is a hard fail, while anything above a .8 is a pass. A mild failure occurs with scores between .5 to .8.

Changes can be combined for soft failures. However, if a certain threshold of soft failure is exceeded it is a hard failure and the process is terminated.

In our agent, it is currently set so that if 33% of the tests result in a soft failure or if there are more than 2 total soft failures, then it is considered a hard failure. This prevents the change from merging.

Re-analyze soft failures

A soft failure can be the canary in the coal mine, or in some cases it can be absurd. About 10% of soft failures are the result of analysis of hallucinations. In the case of a soft failure, the test will be re-run. If the tests that lead pass we assume that the original result was wrong.

Definitions

When a test fails, you need to understand why it failed. Now we ask all LLM judges to not only provide a score, but an explanation. It's not perfect, but it helps build confidence in testing and often speeds up debugging.

Removing Flaky Tests

You should test your tests. Especially with the LLM-A-ASKIOS exam, the way the check is made can have a big impact on the results. We run the tests many times and if the Delta of all the results is too big we will immediately update or remove the flaky test.

Monitoring in production

Agent testing is new and challenging, but it's a walk in the park compared to monitoring agent performance and results in production. The input is milky, there is no expected output to the base, and everything is on a larger scale.

Not to say the stakes are too high! The Department's trust problems quickly became business problems.

This is our focus now. We are including agent deviation tools to deal with these challenges and will report new oaths in the next post.

The troubleshooting agent was one of the most impactful things we've ever shipped. Developing reliable agents has been a defining journey and we are excited to share it with you.

Michael Segner Is a Product Stratemist in Monte Carlo and author of the O'reilly report, “improving data + AI reliability through visualization.” This was approved by Elor Arieli and Alik Peltinovich.

Source link

nimda December 6, 2025

0 7 4 minutes read