Automate Writing Your LLM Prompts

0 1 16 minutes read

Image created by Serj Smorodinsky, co-author of Building LLM Applications with DSPy

we’ve probably all had the experience of getting responses that weren’t quite what we wanted. Usually we’ll try rewording the prompts a few times until we get something reasonable. We sometimes have to be more clear, more precise, give examples, describe why we need the response, present a persona, or otherwise provide enough context and information that the LLM is able to provide a suitable response.

This can be fine when we’re working directly with the LLM. However, it’s quite different when we’re writing an LLM-based application — software that will execute on its own, and that doing so will interact with one or more LLMs. Here, the software will work with predefined prompts and will pass these to the LLMs. If it doesn’t go well, we’re not there to reword the prompts and try again. Which means, they have to be written in a way that’s robust and reliable in the first place — we need prompts that we can be confident will work consistently well in production.

Creating such a prompt can be tricky. In this article, we’ll go over why that is, and also how a Python tool called DSPy can support creating prompts that will be reliable. DSPy not only generates prompts automatically for you, it also evaluates them thoroughly, so you can be confident of how well they’ll likely work in production.
I’ll also provide an excerpt from my most recent book with Manning Publishing, Building LLM Applications with DSPy, co-authored with Serj Smorodinsky. That provides a complete description of DSPy and how to use it to create LLM-based applications.

Book cover image

The trick of creating a prompt that can work reliably in production

Part of what makes it difficult to create a reliable prompt is that we can’t fully predict the input we’ll have for the prompt. Say, for example, we’re creating a software application that will process documents. The documents may be found online, or possibly submitted by users of the software. As part of processing the documents, the application may ask an LLM to summarize them, translate them, extract key pieces of information, or to perform some other such task. For this example, let’s say the software will ask the LLM to critique how plausible the content in the documents appears to be. To do that we may write a prompt such as:

prompt_text = f"Assess how plausible the following text is: {document_text}"

That uses a Python f-string to form the prompt, with a slot for the text of the document. Other prompts may have multiple slots for the inputs, but for simplicity, we’ll assume here that each prompt has just one input — the piece of content you’ll want the LLM to process (which is the part that’s unpredictable).

This prompt may work sufficiently well, but it also may not. There are any number of ways the LLM may respond in a way we don’t like, at least occasionally. We may find that the LLM picks up on irrelevant details in the documents. Or may have a different sense of ‘plausible’ than we intended. Or it may indicate almost every document is fully plausible (or the opposite, that almost none are). Or the responses may not be formatted as we wish.

We may need to tweak the prompt to consistently get the responses we would expect. To get started, we can try this and a few other simple prompts, but the final prompt may end up being considerably longer and more detailed that this.

Usually, as we test with more inputs (in this case, more documents), we’ll find more cases where the current prompt doesn’t handle the input well, so we’ll tweak the prompt to handle these cases better. Sometimes we may reword the prompt to be more clear, and other times add some sentences to the prompt to handle these specific cases. For example, “If the document makes claims that are metaphorical, assess the general intent and not the literal meaning.” We can end up with any number of extra instructions like this in the prompt, which can help the prompt work well for these cases, but, of course, can also cause the prompt to work worse for other inputs.

And, as the prompts get longer and more complicated, they can get harder to tweak. It can get less and less clear what the effect will be of adding, removing, re-ordering, or re-wording phrases in the prompt will be.

Other LLM-based applications may work with other types of text data: text messages, emails, essays, journal articles, patent applications, and so on. Or may process image, audio, video, or other modalities. But, regardless of the type of input, for a non-trivial application, the specific input the application encounters (and passes on to the LLM) will be at least somewhat unpredictable. Which means, we’ll need a robust, well-specified prompt to handle a wide range of realistic input.

To take the example of email, if an LLM-based application is processing a collection of emails (that it will encounter in production, and that we can’t fully predict), there can be emails that are unusually: long, complex, nuanced, confusing, meandering, or otherwise not as we expected when forming the prompt. The only way to test that your application will work reliably in production is to test with a large, diverse, and realistic set of inputs (in this case, a large, diverse collection of realistic emails).

And for each test case, we need to carefully examine the LLM’s response and check that it is suitable. In some cases, this is straightforward. For example, we may pass some text to an LLM and ask to classify it in some way. The LLM may classify the text in terms of identifying the language (English, French, etc.), the sentiment, toxicity, and so on. In these cases, there’s a true class for each input, and there’s the class the LLM returns. We just have to check they’re the same: if the text is in Spanish and the LLM predicts Spanish, it’s correct; otherwise not. Many other LLM tasks produce output that’s easy to evaluate as well.

In some cases, though, evaluating the responses is not so straightforward. An example is where we ask the LLM to generate a longer response, such as a summary, translation, critique, suggestions for follow-up steps, or any other such long-form output based on the input. If you’ve ever looked at two or more different responses from an LLM (where both are one or more full sentences long, and possibly much longer) and tried to assess which is better, you know this is time consuming. And error prone. Some may be more succinct, others more nuanced, others more clear. Nevertheless — as hard as these are to evaluate — we do need to evaluate them in order to assess how well each prompt we try is working. One of the nice things about DSPy is, it lets you automate this evaluation.

Prompt Engineering

To see the value of tools like DSPy, it’s good to look at the alternative, and at the problem that DSPy is solving. Normally how we work with LLMs is using a technique known as prompt engineering. Doing this, we write one prompt, test it (usually with just a few inputs and simply eye-balling the outputs), write another prompt, test it in a similar way, and continue.

In simpler cases, this can work, but it does have a number of limitations. One is: it’s very time-consuming to test each candidate prompt with more than a small number of inputs. So in practice, we normally test each prompt far less than we should. Which can cause problems — testing each prompt with very few inputs can give us a poor sense of which prompts work better.

Making this more complicated — with each input, we really should test the prompt multiple times (and not just once), since the LLMs are stochastic. If given the same prompt (including the same values in the slots) multiple times, an LLM may return different responses each time. And some may be better than others. If we have, say, 20 documents to test with (in example where the LLM will be used to estimate the plausibility of each document), ideally we’d test each several times. If we test each 3 times, that means 60 tests in total. Which, realistically, we won’t actually do. Probably not even close.

And, as indicated, this is even harder where where the LLMs return longer outputs, as it’s time-consuming to read them, and almost impossible to be consistent in how we evaluate them.

So, testing each candidate prompt is time consuming. Testing many candidate prompts is much more so. And it’s not clear we can really compare them fairly.

All this means that, in general, prompt engineering has the interesting quality of being both time-consuming and unreliable. It’s a very slow, tedious, and error-prone process. Experienced developers can often spend hours, or even days, on a single prompt. And in the end, can’t be certain the one they chose is really the strongest.

Is there a better way?

If we step back for a minute, we can look at how we handle a similar situation when working with machine learning. If we’re building a neural network, Random Forest, XGBoost model (or anything along those lines), each time we train it, we don’t manually test each element in the test set one at a time. In fact, the idea of doing that feels a bit silly. The process is automated; testing is quite simple. We simply run each element in the test set through the model, get a prediction for each, and execute a function to generate an overall score.

For example, we may use Mean Squared Error or R Squared for a regression problem, and possibly F1 Score, MCC, or AUROC for a classification problem. Using a tool such as scikit-learn, we can take the model’s predictions for the test set and the corresponding ground truth values, and simply pass these to a function to calculate the overall score. We then have a single number indicating how well that model worked.

We can next, if we wish, try again with different features, different hyperparameters, different training data (or some other such change from the previous model), re-train, and re-execute the testing — getting another score.

So, with ML projects, we have a process that’s clean and efficient. But when working with LLMs, we tend to do something quite different, something closer to prompt engineering — working without a framework to ensure consistency, repeatability, and efficiency. We essentially ignore decades of experience developing best practices for software development.

However, that’s not necessary. Working with LLMs, there are a number of tools that let us work in a similar way as we do when creating machine learning models — in a way that’s efficient, thorough, and repeatable. DSPy is likely the state of the art of these, at least at the moment. Using it, we specify our test data and a method to evaluate how good a response is. There is some time required to do that, but once that is done, pretty much everything else is handled for us.

In the example where we ask an LLM to estimate the plausibility of documents, we could gather a set of documents (possibly 10 or 20 or 30, though more is better) to be our test set. And for each, we could provide a ground truth for its plausibility. This could be a numeric value, let’s say, on a scale from 0 to 10.

We also have to provide a way for DSPy to assess how strong each LLM response is — in the form of a Python function. This will be a function that accepts the input to the LLM and the LLM’s response, and that returns either: 1) a numeric value (indicating how good the response is); or 2) a boolean value (indicating simply if the response is good or bad). In this example, the function can be fairly simple, along the lines of:

def evaluate_answer(test_instance, model_prediction):
   return abs(test_instance.ground_truth - model_prediction)

This isn’t precisely the DSPy syntax (I’m skipping some small details for simplicity here, but this gives the general idea). In this case, we assume each test instance contains a document that can be sent to the LLM and a ground truth value (a number between 0 and 10 — indicating how plausible it truly is, probably based on human evaluation). And we assume the model prediction is also a number between 0 and 10. To score the response, we simply take the difference between these two scores, so the smaller the difference, the better the response (the closer it was to the ground truth).

To test a given prompt, DSPy would automatically execute the prompt on a specified LLM, once for each of the test documents. In this example, for each, it would ask for a score from 0 to 10 indicating their plausibility, and would compare the response to the ground truth.

It would then give an overall score on the test set (averaged over all test instances in the test set), which is our estimate of how strong that prompt is.

Then, if we wish to try a different prompt, or a different LLM, we can simply re-execute the testing process. That will generate another score, indicating how strong that combination of LLM and prompt is. If we try several prompts (or several LLMs), we can see which works best just by taking the one with the best overall score.

It’s a process that makes a lot of sense. It does require us to collect a decent amount of test data, but this is necessary if we want to provide any kind of evaluation of a prompt in any case. And it requires us to write a function that can, given an input to the LLM and the LLM’s response, score how strong the response is. This can be a bit of work to do in some cases (we do explain how to do this in the book!), but, once written, we can evaluate any number of responses to any number of prompts. And it lets us do so in a way that’s consistent and unbiased.

As indicated, if the LLM returns a short answer, such as with a classification problem, writing the function is going to be very easy. And, as we just saw, where the LLM returns a numeric score, the function can also be quite easy.

If the LLM returns a longer answer, often (though not always) we’ll use an LLM-as-a-judge approach, where we get one LLM to evaluate the response of another LLM. This isn’t perfect, but it does remove human biases, and it can be automated. Which makes it feasible to test many candidate prompts and to test each thoroughly.

So, DSPy essentially does for you what you’d likely end up coding yourself if you took a step back and thought about how you could automate this process — how you could automate searching for a strong prompt. At least, you’d likely end up coding this yourself if you had an enormous amount of free time, and were the only person in the world solving this problem — the problem of having to craft and evaluate many candidate prompts for each LLM-based task. However, given so many of us are facing the same challenges, having tools take care of the repetitive work for us is, at least in retrospect, very natural.

What DSPy does for you

DSPy does for you much of the work that you’d need to do manually if taking a prompt engineering approach. It does at least three major things (actually, it does a bit more, but for this article, we’ll just look at what are likely the most important).

It automatically generates a prompt for you. You simply need to provide a short, high-level overview of the task, which can be provided in a string (or in other formats, but strings are the simplest). In this example, we may specify: “document -> assessment_of_plausibility”. Another example may be: “journal_article -> summary, critique”, which indicates that the LLM should take a journal article and return a summary of it and a critique. DSPy does allow us to provide more information about the task as well, but generally we can keep it quite high-level.
It automatically evaluates the prompt for you. You do need to provide the test data and a Python function to evaluate each response, but given that, DSPy allows you to fully, and consistently, evaluate each prompt (and each LLM) you try.
It automatically optimizes the prompt for you. This is possibly the most powerful element of DSPy. I’ll describe this next.

Optimizing your prompts

To optimize your prompts DSPy essentially goes into a loop that looks like the following (this is a bit over simplified; we do describe it fully in the book, but this gives the general idea):

best_prompt = ""
loop
  generate a new candidate prompt
  evaluate this candidate prompt
  if this is the best prompt so far:
    best_prompt = current prompt

This loops for as long as you indicate (the longer it searches for better prompts, the stronger prompts it will tend to find, though there are, of course, diminishing returns). As it loops, it generates new candidate prompts. To do this, DSPy uses a technique called meta-prompting, where one LLM is used to generate the prompt used for another LLM. For each candidate prompt generated, DSPy then evaluates it.

With weaker prompts, DSPy may actually use early stopping for efficiency, and so may quit evaluation early for any prompts that appear to perform poorly relative to the previously-tested candidate prompts. That is, if it generates any prompts that do poorly on a portion of the test data, there’s no need to test these prompts on the full test set. It will, though, completely evaluate the more promising prompts, and so can identify with confidence the strongest of the prompts that were tested.

DSPy includes a number of different processes to generate the prompts. The more effective actually learn as they go. As each candidate prompt is evaluated, DSPy can learn where each prompt performs well and where it performs poorly (it can see which test cases do well and poorly, but DSPy can actually also see why each prompt does well in some cases and poorly in others). It can then take advantage of this to suggest more and more promising candidate prompts, and so the prompts tend to work better and better as the process continues.

After running DSPy

Once you’ve run DSPy, you’ll have a prompt for your task and you’ll also have an estimate of how well it will work in production — based on how well it behaves on your test data. (Much like with machine learning, we generally divide the data we have into training, validation, and test data, so will ideally have a hold out set used only for a final evaluation).

That can provide a good basis for deciding if it’s strong enough to put in production or not. If not, you can allocate more time to optimizing the prompt. Or you can look at another LLM — once your code is set up, evaluating another LLM just requires specifying the LLM and re-executing the code. You will have to pay for the LLM calls (unless using a hosted LLM), but you’ll have likely zero additional work to do.

Sample code

Most of the time the code you’ll need to write to use DSPy will be pretty short and simple. I’ll include an example here, though won’t fully explain it (I will, hopefully, in future articles). This should, though, give you the gist of what’s involved with working with DSPy. It does require a pip install and some imports. Once you have that, it’s all fairly straightforward.

import dspy

OPENAI_API_KEY = [indicate your API key]
lm = dspy.LM("openai/gpt-4o-mini", api_key=OPENAI_API_KEY)
dspy.settings.configure(lm=lm) 

predictor = dspy.Predict("question, context -> answer, confidence") 
prediction = predictor(question="What is the capital of France?", context="")
print(prediction.answer, prediction.confidence)

This code doesn’t include any optimization or evaluation (it will simply produce a prompt and handle interacting with the LLM), but does show a fully working DSPy programme. It first imports dspy, then specifies the LLM to use and the API key for that. In this example, an OpenAI model is used, but DSPy supports dozens of different providers. It then specifies at a high level the task: given a question and some context, the LLM should return the answer and the confidence for that answer. It then asks a specific question (in this example, “What is the capital of France?”, without any additional context), and displays the answer. In testing this, we consistently received:

Paris, High

This indicates the answer is Paris and that the LLM has high confidence in the answer.

Given some evaluation and optimization, the code will be a bit longer, but not gigantically. This example shows a very simple task, but with more difficult tasks, evaluation and optimization will normally be important. Doing this is all quite manageable, as DSPy keeps most of the complexity under the hood.

Conclusions

DSPy can’t guarantee an extremely effective prompt for every task with every LLM. But, it does save you a lot of labour, and will tend to do as well, or better, than a professional prompt engineer will do. In future articles, I’ll hopefully cover some experiments pitting DSPy against manual prompt engineering, but in a nutshell, DSPy has come out ahead consistently so far. For any LLM-based applications we create, it’s usually worth using DSPy to create and evaluate the prompts. The framework doesn’t take too long to learn, and once you do, you’re set on any projects you work on.

Realistically, I won’t always use DSPy in contexts where I don’t need a strong prompt, or where the task is so simple for an LLM that any basic prompt will do. But any time I’m in a situation where it looks like I may need to do some prompt engineering, I’d use DSPy to automate all that work for me. Instead of manually creating and testing every candidate prompt, I can just set up some DSPy code and let it do the work. It’s like having my own prompt engineering assistant.

It can take some time to execute. I’ll often let it run for 20 or 30 minutes or more to get a good prompt. But it’s doing the work, not me. One thing to watch for is LLM costs, though DSPy does let you monitor that. In most cases, having higher quality prompts is cheaper in the long run, though in some cases that won’t be true, and we should constrain the time DSPy spends trying to come up with stronger prompts.

This is easy enough to do — we just have to be careful to specify to spend a reasonable amount of time searching for the best prompt it can find. We can, for example, specify to just try a small number of candidate prompts and take the strongest. In other cases it can be well worth letting it test many candidate prompts.

I’ll hopefully get some more articles up explaining DSPy in the future.

Source link

nimda 3 hours ago

0 1 16 minutes read