Advanced Planning for AI Project Testing

to find in businesses right now – there is a proposed product or feature that could involve using AI, such as an agent based on LLM, and discussions begin about how to integrate the project and build it. Product and engineering will have great ideas of how useful this tool can be, and how much excitement it can bring to the business. However, when I'm in that room, the first thing I want to know after a project is proposed is “how are we going to test this?” Sometimes this will lead to questions about whether AI testing is really important or necessary, or whether this can wait until later (or not).
Here's the truth: you only need an AI test if you want to know if it works. If you're comfortable building and migrating without knowing the impact on your business or your customers, you can skip the test – however, most businesses won't actually be good at that. No one wants to think of themselves as building things without being sure it works.
So, let's talk about what you need before you start building an AI, so you're ready to test it.
The purpose
This may sound obvious, but what should your AI do? What is its purpose, and what will it look like when it works?
You'd be surprised how many people get into building AI products without an answer to this question. But it is very important that we stop and think hard about this, because knowing what we represent when we see the success of the project is necessary to know how to set the parameters of that success.
It is also important to spend time on this question before you start, because you may find that you and your colleagues/leaders actually disagree on the answer. Organizations often decide to add AI to their product in some way, without clearly defining the scope of the project, because AI is seen as important on its own terms. Then, as the project progresses, internal conflict about what constitutes success emerges when one person's expectations are met, and another's are not. This can be a real mess, and it will only come out after a ton of time, energy, and effort has been put into it. The only way to fix this compromise is early, obviously, about what you're trying to achieve.
KPIs
It's not just a matter of coming up with a mental picture of the situation where this product or AI feature works, though. This idea needs to be broken down into measurable forms, such as KPIs, so that we can later create the necessary evaluation tool to calculate it. Although qualitative or ad hoc data can be of great help in finding a color or doing a “sniff test”, if people try an AI tool ad hoc, without a systematic plan and process, it will not produce enough information to be integrated with the success of the product.
If we rely on vibes, “seems okay”, or “no one is complaining”, to evaluate the results of a project, it is lazy and ineffective. Collecting data to get a statistically significant picture of a project's results can sometimes be expensive and time-consuming, but other than pseudoscientific assumptions about how things worked. You can't trust that site checks or volunteer feedback are truly representative of the broader experience people will have. People usually don't bother to share their information, good or bad, so you need to ask them in a structured way. Furthermore, your test cases for an LLM-based tool can't just be made up – you need to decide which scenarios you care about, define the tests that will capture those, and run them enough times to be confident about the range of results. Defining and implementing tests will come later, but you need to identify use cases and start planning for them now.
Set the Goals Before the Game
It's also important to think about testing and measuring before you start so you and your teams aren't tempted, implicitly or implicitly, to play the numbers. Figuring out your KPIs after the project is built, or after it's released, may lead to choosing metrics that are easier to measure, easier to achieve, or both. In social science research, there is a concept that distinguishes between what you can measure, and what matters, known as “measurement validity”.
For example, if you want to measure the health of people in a research study, and decide whether your intervention has improved their health, you need to define what you mean by “health” in this context, break it down, and take several measurements of the different components that make up health. If, instead of doing all that work and spending time and money, you just measured height and weight and calculated BMI, you wouldn't have the validity of the measurement. BMI may, depending on your point of view, have some relationship with health, but it is certainly not a perfect measure of this concept. Health cannot be measured by something like BMI alone, although it is cheap and easy to measure people's height and weight.
For this reason, after finding out what your vision of success is in practical terms, you need to formalize this and break down your vision into measurable goals. The KPIs you define may eventually need to be further broken down, or made more granular, but until the work of developing your AI tool begins, there will be a certain amount of information that you will not know. Before you start, do your best to set your goalposts and stick to them.
Think Risky
Especially in using LLM-based technology, I think having an honest conversation within your organization about risk tolerance is very important before exiting. I recommend risk discussion early in the process because like defining success, this may reveal differences in thinking between the people involved in the project, and those differences need to be resolved for the AI project to continue. This can affect how you define success, and will also affect the types of tests you create later in the process.
LLMs are nondeterministic, meaning that given the same input they may respond differently in different situations. In business, this means that you accept the risk that LLM's response to certain inputs may be novel, unpleasant, or just plain weird at times. You cannot, of course, guarantee that an AI or LLM agent will behave as you expect. Even if it behaves as you expect 99 times out of 100, you need to find out what the character of that hundredth case will be, understand the failure or error paths, and decide if you can accept the risk you are building – this is part of what AI testing is.
The conclusion
This may sound like a lot, I realize. I give you the entire to-do list before anyone writes a line of code! However, evaluation of AI projects is more important than many other types of software projects due to the non-deterministic nature of the LLMs I have described. Producing an AI project that generates value and improves business requires careful consideration, planning, and honest self-evaluation about what you hope to achieve and how you will handle the unexpected. As you continue to build the AI test, you will think about what kind of problems may occur (bad ideas, misuse of tools, etc) and how to code when these occur, both to reduce their frequency and to be prepared when they do happen.
Read more about my work at www.stephaniekirmer.com



