How to make your AI application faster and more interactive with feedback streams

In my last post, I talked a lot about fast caching and caching in general, and how it can improve your AI application in terms of cost and latency. However, even with a fully optimized AI application, sometimes the answers will take time to generate, and there is nothing we can do about it. If we ask large results from the model or we need to think or think deeply, the model will naturally take longer to respond. While this makes sense, waiting too long for a response can frustrate the user and reduce their overall user experience using the AI application. Fortunately, there is a simple and straightforward way to improve this issue response stream.
Streaming means getting the model feedback incrementally, gradually, as it is generated, rather than waiting for all the feedback to be generated and then showing it to the user. Normally (without streaming), we send a request to the model's API, wait for the model to generate a response, and when the response is complete, we send it back to the API in one step. For streaming, however, the API returns incomplete results while the response is being generated. This is a common concept because many user-facing AI applications like ChatGPT, from the time they first appeared, used live streaming to show their responses to their users. But beyond ChatGPT and LLMs, live streaming is used everywhere on the web and in modern applications, such as live notifications, multiplayer games, or live news feeds. In this post, we'll go further and explore how we can integrate streaming into our applications to model APIs and achieve the same result in custom AI applications.
There are a few different ways to implement the streaming concept in an application. However, in AI applications, there are two widely used types of streaming. Specifically, those are:
- HTTP Streaming Over Server Sent Events (SSE): That's a simple, one-way type of streaming, which only allows live communication from server to client.
- Streaming via WebSockets: That is the most advanced and complex form of streaming, which allows two live communication between server and client.
In the context of AI applications, HTTP streaming with SSE can support simple AI applications where we just need to stream the model response for latency and UX reasons. However, as we move beyond simple request–response patterns to more advanced setups, WebSockets become especially useful as they allow live, two-way communication between our application and the model API. For example, in code assistants, multi-agent systems, or tool calling workflows, the client may need to send intermediate updates, user interactions, or feedback back to the server while the model is still generating feedback. However, for many simple AI applications where we just need a model to provide feedback, WebSockets are often overloaded, and SSE is sufficient.
In the rest of this post, we'll be taking a closer look at streaming simple AI applications using HTTP streaming with SSE.
. . .
What about HTTP streaming with SSE?
HTTP Streaming Over Server-Sent Events (SSE) is based on HTTP streaming.
. . .
HTTP streaming means that a server can send whatever it has to send in chunks, rather than sending it all at once. This is achieved by the server not terminating the connection with the client after sending the response, but instead leaving it open and sending the client any additional events that occur immediately.
For example, instead of getting the answer in one paragraph:
Hello world!
we can get it in parts using the raw HTTP stream:
Hello
World
!
If we were to use HTTP streaming from scratch, we would have to handle everything ourselves, including parsing the stream text, handling any errors, and reconnecting to the server. In our example, using a raw HTTP stream, we'll have to somehow explain to the client that 'Hello world!' one event in the mind, and everything after that will be a different event. Fortunately, there are several frameworks and wrappers that simplify HTTP streaming, one of them HTTP Streaming Over Server Sent Events (SSE).
. . .
Therefore, Server Sent Events (SSE) provides a standardized way to use HTTP streams by organizing server output into well-defined events. This structure makes it very easy to analyze and process the broadcast responses on the client side.
Each event usually includes:
- i
id - i
eventtype - a
databurden of payment
or more appropriately..
id:
event:
data:
Our example using SSE would look like this:
id: 1
event: message
data: Hello world!
But what is the event? Anything can qualify as an event – a single word, a sentence, or thousands of words. What really qualifies as an event in our particular application is defined by the API setup or the server we are connected to.
On top of this, SSE comes with various other useful features, such as automatically reconnecting to the server if the connection is lost. Another thing is that incoming broadcast messages are clearly marked as text/event-streamwhich allows the client to manage them correctly and avoid errors.
. . .
Roll up the sleeves
Frontier LLM APIs like OpenAI's API or Claude API natively support HTTP streaming with SSE. This way, integrating streaming into your applications becomes easier, as it can be achieved by changing a parameter in the application (eg, enabling stream=true parameter).
Once streaming is enabled, the API no longer waits for a full response before responding. Instead, it returns small parts of the model output as they are produced. On the client side, we can iterate over these patches and display them continuously to the user, creating a standard ChatGPT typing effect.
But, let's do a small example of this using, as always, OpenAI's API:
import time
from openai import OpenAI
client = OpenAI(api_key="your_api_key")
stream = client.responses.create(
model="gpt-4.1-mini",
input="Explain response streaming in 3 short paragraphs.",
stream=True,
)
full_text = ""
for event in stream:
# only print text delta as text parts arrive
if event.type == "response.output_text.delta":
print(event.delta, end="", flush=True)
full_text += event.delta
print("nnFinal collected response:")
print(full_text)
In this example, instead of receiving a single completed response, we loop through a chain of events and print each piece of text as it arrives. At the same time, we also keep the pieces in full response full_text to use later if we want.
. . .
So, should I just hit broadcast = True on all requests?
The short answer is no. Although useful, with great potential to greatly improve the user experience, streaming is not a one-size-fits-all solution for AI applications, and we must use our discretion to assess where and when it should be used.
Specifically, adding streaming to an AI app works best in setups where we expect long responses, and value the user experience and app responsiveness above all else. One such case would be consumer-facing chatbots.
On the other hand, in simple applications where we expect the provided responses to be short, adding streaming will not provide significant benefits to the user experience and does not make much sense. On top of this, streaming only makes sense in cases where the output of the model is free text and does not produce structured output (eg json files).
Most importantly, the biggest challenge of live streaming is that we cannot review the full response before showing it to the user. Remember, LLMs generate tokens one by one, and the meaning of the response is formed as the response is generated, not in advance. If we make 100 applications to LLM with the exact same input, we will get 100 different answers. That is, no one knows before the answers are finished what they will say. As a result, since streaming is enabled it is very difficult to update the output of the model before showing it to the user, and apply any guarantees to the generated content. We can always try to check the partial completion, but again, small completions are very difficult to check, as we have to guess where the model is going with this. Adding that these tests must be performed in real time not just once, but repeatedly for different model component responses, makes this process even more challenging. Actually, in such cases, validation is done on all output after the response is completed. However, the problem with this is that by now, it may be too late, as we may have already shown incorrect user content that does not pass our verification.
. . .
In my mind
Streaming is a feature that has no real impact on the performance of an AI application, or the associated cost and latency. However, it can have a significant impact on how the user perceives and experiences the AI application. Streaming makes AI systems feel faster, more responsive, and more interactive, even when the time to generate a complete response remains exactly the same. That said, streaming is not a silver bullet. Different applications and environments can benefit more or less from introducing live streaming. As with most decisions in AI engineering, it's less about what's possible and more about what makes sense for your particular use case.
. . .
If you've made it this far, you may find pialgorithms useful — a platform we've been building that helps teams securely manage organizational information in one place.
. . .
Did you like this post? Join me 💌A small stake and 💼LinkedIn
. . .
All photos by the author, unless otherwise noted.



