How to reduce the cost of the LLM by 90% have 5 lines of the code

feeling when all it seems to In order to work just well, until you look under the hood and note your plan burning 10 × more?
We have been with a client text that shot Requests to ensure our promotion, which is built with async Python code and running the Jobyter book. Clean, simple, and quick. We have run to test our models and collect test data. No red flags. No warnings.
But under this tragedy, something wasn't going well.
We didn't see failure. We were not available out. We couldn't even recognize fluency. But our program made a lot of work than it needed, and we didn't see it.
In this post, we will go the way we have found a problem, what has caused, and that a The simple change of structure in our Async code reduced the LLM Traffic and costs by 90%, Probably no loss on speed or work.
Now, the appropriate warning, Learning this post will not be driven by magical costs with 90%. But Takeaway here is a broader: Small, eyes, sometimes a few lines of the code, can lead to great malfunction. And to be the only means your code can work for us to save time, money, frustration over time.
Adjustment itself can feel the niche at the beginning. It includes the matters of the ethics of the Oyyython's behavior, how tasks are organized and sent. If you are familiar with Python and async/awaitYou will find a lot from the code examples, but whether you are missing, there is still left taking. Because the real story here is about llms or Python, is about The relevant engineering, practical.
Let's hold you.
To set
To use verification, we use pre-defined dataset and resurrect our program through the client text. Verification focuses on a little layer In data data, so the client code stands only after receiving a certain amount of answers.
Here is a simplified version of our Cython client:
import asyncio
from aiohttp import ClientSession
from tqdm.asyncio import tqdm_asyncio
URL = "
NUMBER_OF_REQUESTS = 100
STOP_AFTER = 10
async def fetch(session: ClientSession, url: str) -> bool:
async with session.get(url) as response:
body = await response.json()
return body["value"]
async def main():
results = []
async with ClientSession() as session:
tasks = [fetch(session, URL) for _ in range(NUMBER_OF_REQUESTS)]
for future in tqdm_asyncio.as_completed(tasks, total=NUMBER_OF_REQUESTS, desc="Fetching"):
response = await future
if response is True:
results.append(response)
if len(results) >= STOP_AFTER:
print(f"n✅ Stopped after receiving {STOP_AFTER} true responses.")
break
asyncio.run(main())
This sker reads applications from the Database, shoot them with them together, and stand as soon as we collect enough true the answers of our assessment. In production, the logic is more complex and based on the diversity of the answers we need. But the structure is the same.
Let's use the Dummy Fastapi server to imitate the real way:
import asyncio
import fastapi
import uvicorn
import random
app = fastapi.FastAPI()
@app.get("/example")
async def example():
sleeping_time = random.uniform(1, 2)
await asyncio.sleep(sleeping_time)
return {"value": random.choice([True, False])}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
Now let's cry out the dummy server and use the client. You will see something like this from the client terminal:
Can you see the problem?

It's good! It's quick, clean, and … Waiting is all working as expected?
In the face, it It seems to As a client makes the right thing: To send requests, to get 10 true Answers, then stop.
But is that so?
Let's add a few print statements to our server to see what you actually do under the hood:
import asyncio
import fastapi
import uvicorn
import random
app = fastapi.FastAPI()
@app.get("/example")
async def example():
print("Got a request")
sleeping_time = random.uniform(1, 2)
print(f"Sleeping for {sleeping_time:.2f} seconds")
await asyncio.sleep(sleeping_time)
value = random.choice([True, False])
print(f"Returning value: {value}")
return {"value": value}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0", port=8000)
Now use everything.
You will start to see such logs:
Got a request
Sleeping for 1.11 seconds
Got a request
Sleeping for 1.29 seconds
Got a request
Sleeping for 1.98 seconds
...
Returning value: True
Returning value: False
Returning value: False
...
Take the server logs. You will see something unexpected: Instead of processing just 14 applications such as us in the progress document, the server holds all 100. Or client stops after receiving 10 true Answers, they are still sending all applications before. As a result, the server should work everything.
It is an easy mistake of missingEspecially because everything seems to work well on the client's point of view: Answers are quickly entered, progress development, and text comes out early. But after these conditions, all 100 applications are immediately sent, regardless of when we decide to listen. This results in 10 More Traffic There is a need, to drive expenses, to increase the load, and pricing limits.
The key question is therefore: Why is this possible, and how can we assure you that we only send the applications we need? The answer turned little but powerful change.
The root of the problem is on the way for jobs. In our original code, we create a list of 100 tasks at the same time:
tasks = [fetch(session, URL) for _ in range(NUMBER_OF_REQUESTS)]
for future in tqdm_asyncio.as_completed(tasks, total=NUMBER_OF_REQUESTS, desc="Fetching"):
response = await future
When you pass the COROUTINES list to as_completedPython immediately threatens each Coroutine to Task and edit it to the event loop. This happened before you first entered the loop body. When Coroutine becomes TaskLoop event begins to drive it back then.
as_completed itself does not control the consistency, simply waits for activities to complete and give them one for one in order. Think of it as Iterator is gone by the future, not a traffic controller. This means that when you start out, All 100 applications are already ongoing. Out after 10 true The results are stopping in singing, but does not stop them from being sent.
To fix this, introduced a It is in Mthather to reduce the limit. The Semaphore adds a lack of light in fetch So that only the prescribed amount of applications can start at the same time. Rest is stay suspended, expecting slot. As soon as we beat our standing, the appointed tasks have never found a key, so they never submit their requests.
Here is a fixed version:
import asyncio
from aiohttp import ClientSession
from tqdm.asyncio import tqdm_asyncio
URL = "
NUMBER_OF_REQUESTS = 100
STOP_AFTER = 10
async def fetch(session: ClientSession, url: str, semaphore: asyncio.Semaphore) -> str:
async with semaphore:
async with session.get(url) as response:
body = await response.json()
return body["value"]
async def main():
results = []
semaphore = asyncio.Semaphore(int(STOP_AFTER * 1.5))
async with ClientSession() as session:
tasks = [fetch(session, URL, semaphore) for _ in range(NUMBER_OF_REQUESTS)]
for future in tqdm_asyncio.as_completed(tasks, total=NUMBER_OF_REQUESTS, desc="Fetching"):
response = await future
if response:
results.append(response)
if len(results) >= STOP_AFTER:
print(f"n✅ Stopped after receiving {STOP_AFTER} true responses.")
break
asyncio.run(main())
For this change, we are still explaining 100 applications before, but Only small group allowed to run at the same time15 That example. When we reach our stand-up status early, the event stops before submit multiple applications. This keeps the behavior reacting while reducing unnecessary calls.
Now, the server logs will only display about 20 "Got a request/Returning response" entries. On the side of the client, the progress bar will appear like the actual.

With this change in the area, we saw a quick impact: Reduction of 90% on Volume and LLM Costwithout a visual deterioration in client experience. It also promoted the calling to the whole group, reduced in line, and was completed measures of measurements from our LLM providers.
This formal model correction makes our verifying pipes work very well, without adding more hardness to the code. It is a good memorial that the Async programs, the flow of control you always behave unless you are submerged in how to establish jobs and how to work.
Bonus Insight: Closing Loop Loop
If we were conducting the original customer code outside asyncio.runMay have noticed the problem before.
For example, if we used the manual Loop event management:
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()
Python would paint alerts such as:

These warnings come from when the program comes out while the permanent Async activities are set up in LOOP. If we saw the screen full of those warnings, we may have made the red flag soon.
So why do we not see that warning when we use asyncio.run()?
Therefore asyncio.run() He cares for cleaning behind the scenes. It does not just drive your Coroutine and get out, also cancel any of the jobs left, waiting for them to finish, and then close the event loop. This built-in net security block prevents the “” Ratable Work “in display, whether your code is silent in peace more activities than before.
As a result, it pressures those “waiting for work” When you manually close the loop with loop.close() afterwards run_until_complete()Any recurring resources have not been expected to be worthwhile. Python sees that closing the loop while the work is sorted, and warned about it.
This does not mean that all Asyn Python's plans should avoid asyncio.run() or regularly use loop.run_until_complete() For a booklet loop.close(). But it highlights the important thing: You should know what jobs worked when your system comes out. At least, it is a good idea to monitor or enter any underweight activities before shutdown.
The last thoughts
By turning back and thinks control of control, we were able to make our confirmation process very efficient – Not by adding additional infrastructure, but through what we've given you carefully. A few lines of the code change has resulted in 90% Reduction of Cost Likely there is no chance to get on. Solved limited rate errors, the reduced system of the program, and allow the group to work with a frequent test without causing bottles.
It is an important reminder thatAsync Code “clean” asynC does not always say the effective code of using program resources are important. The right engineering, functional is more than just writing the code. In terms of design programs that respect time, money, and resources all stolen, especially in the areas. If you are carrying an Accounte as a stolen property instead of an incomplete pool, everyone benefits: good doses, quickly traveling groups, expenses traveling quickly, and expenses are always predicted.
Therefore, even if you make LLM calls, introduce jobs of the jobs, or process data to batches, breaks and wonder: I only use that Need Really?
Usually, the answer and development is one code line away.



