ML model working with Fastapi and Redis for quick prediction

Have you ever waited too long to get the model to retrieve predictions? We all were there. The machine learning types, especially large, complex, can be slow to work at the real time. Users, on the other hand, are waiting for a quick response. That's when the saddle becomes a real problem. In a technology, one of the biggest problems to integrate the unwanted integration is when the input creates the same slow procedure. In this blog, I'll show you how to fix that. We will create a Fastapapi-based ML service and combine discis redirect to retrieve repetitive predictions on milliseconds.
What is Fastapi?
The Fastapi is today's framework, which works the most API Web Construction Web Building. Using Python type of data verification and automatic generation of the API correspondence using Swagger Ui and Renlic. Designed on top of Starlette and Pydantic, Fastapi supports the asynchronous system, which makes comparisons to node.js and then go. Its design is helping a speedy development of strong, ready productivity, which makes it a very good decision to send machine models as restless services.
What is redis?
Redis (remoc remote cute) server) is Open-Source, In-Memory Daza workplace serving as data, cache, and dealer. By storing data in memory, the Redis provides the lowest ultra latency for learning and writing tasks, which makes it ready for encouraging or artistic functioning as a machine for a machine. Buite various data structures, including cables, list, sets, and hashes, and provide features such as the TTL to expire Cache (TTL).
Why Combine Surapi and Redis?
Integrating Fastapi with Redis Creates a Respondent and Working Program. Fastapi works as a prominent and reliable display of API applications, while redis serves as a temporary storage layer to maintain previous integration effects. When the same entries are available again, the result can be detected quickly from redis, transfers the need for recycling. This method decreases the latency, reducing the computational burden, and improves the measuring of your application. In the broader areas, the Redis serves as a central cache that is available in many Fastapi conditions, making it a great deal of production.
Now, let's go about the implementation of the Fastapi application using a machine-study device for repaying temporary storage. This setup guarantees that repeated applications by the same installation is provided immediately from the cache, reducing the time of consolidation and improving response times. The steps are mentioned below:
- Loading a model-trained model
- Building the Fastapi fate by prediction
- Resetting a temporary preserving
- Measuring the profit of the performance
Now, let's look at these steps with more details.
Step 1: Loading the Prevented Model
First, imagine you have already had a trained model for a trained device that is prepared. In fact, most of the uninfected model model, model of TensorFlow / Pytorch, etc.), are being uploaded to a functional app. In our example, we will create a simple Squikit-Learn the classifier to train in the popular IRIS dataset and maintained. If you already have a saved model file, you can skip the training part and download it. Here's how you can train the model and download it by serving:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import joblib
# Load example dataset and train a simple model (Iris classification)
X, y = load_iris(return_X_y=True)
# Train the model
model = RandomForestClassifier().fit(X, y)
# Save the trained model to disk
joblib.dump(model, "model.joblib")
# Load the pre-trained model from disk (using the saved file)
model = joblib.load("model.joblib")
print("Model loaded and ready to serve predictions.")
In the code above, we have used a built-in iris data, trained for its carer classifier, and saves that provision to the file called Model.joblib. After that, we are loaded back using joblib.load. JOBLIB Library is very common when it comes to keep skikit-readable models, especially because it is good to handle NUMS Arranga inside the models. After this step, we have something ready to predict predictions with new data. Just head-up, however, you can use any pre-train model here, how you serve the Fastapi, and the saved results will be more or less. The only thing, the model should have a predicate method that takes some of the installation and produces the result. Also, make sure the model forecast remains the same every time you give the same input (so it is noted). If not, temporary savings can be a problem in the models that are not limited as it will return the wrong results.
Step 2: Creating Fastapapi Prediction Endpoint
Now that we have a model, let's use API. We will be using the Fastapi to create a web server that visits requests for predictions. Fastapi makes it easier to describe the ending and parameters of the map request in the Pension Dictionary. In our example, we will assume that the model accepts four factors. And it will build the last place /predict
That accepts these features as questionnaires and returns the model prediction.
from fastapi import FastAPI
import joblib
app = FastAPI()
# Load the trained model at startup (to avoid re-loading on every request)
model = joblib.load("model.joblib") # Ensure this file exists from the training step
@app.get("/predict")
def predict(sepal_length: float, sepal_width: float, petal_length: float, petal_width: float):
""" Predict the Iris flower species from input measurements. """
# Prepare the features for the model as a 2D list (model expects shape [n_samples, n_features])
features = [[sepal_length, sepal_width, petal_length, petal_width]]
# Get the prediction (in the iris dataset, prediction is an integer class label 0,1,2 representing the species)
prediction = model.predict(features)[0] # Get the first (only) prediction
return {"prediction": str(prediction)}
In the code above, we have made the Fastapi app, and where you make the file, starts the API server. Fastapi is very quick for Python, so it can manage many applications easily. Then we upload the model at the beginning because we do it again in all applications will be slow, so we keep it with memory, ready for use. We have created a /predict
the conclusion with @app.get
Make checks easier since we can pass just things in the URL, but in real projects, you will want to use posts, especially when you send greater and complex tables such as js. This work takes 4 input: sepal_length
, sepal_width
, petal_length
beside petal_width
And Fastapap Auto reads from URL. Inside the work, we put all in the input list of 2D (because skikit-read only accept 2D list), and then we call model.predict()
And it gives us a list. Then we return it as JSON { “prediction”: “...”}
.
So, now it works, can you run using uvicorn main:app --reload
hit /predict
Endpoint and get results. Even if you send the same input again, the model remains the same, which is not good, so the next step adds redis to the cache previous results and re-jumped.
Step 3: Adding to Receive Retention for predicting predictions
The modeling of the model, we will use the Redis. First, make sure the Redis server works. You can put it in your area or just run a docker container; Usually applies to Durban 6379 by accident. We will be using the Pythop Redis library to talk to the server.
So the idea is simple: When the application comes, create a unique key that represents the input. Then look at the key at Redis; If the key is already there, meaning we have already eliminated this before, so we return the saved result, no need to call the model again. If not, we do model.predict
Find output, save by returning, and send for predictions.
Now let's update the Fastapi app to add a cache logic.
!pip install redis
import redis # New import to use Redis
# Connect to a local Redis server (adjust host/port if needed)
cache = redis.Redis(host="localhost", port=6379, db=0)
@app.get("/predict")
def predict(sepal_length: float, sepal_width: float, petal_length: float, petal_width: float):
"""
Predict the species, with caching to speed up repeated predictions.
"""
# 1. Create a unique cache key from input parameters
cache_key = f"{sepal_length}:{sepal_width}:{petal_length}:{petal_width}"
# 2. Check if the result is already cached in Redis
cached_val = cache.get(cache_key)
if cached_val:
# If cache hit, decode the bytes to a string and return the cached prediction
return {"prediction": cached_val.decode("utf-8")}
# 3. If not cached, compute the prediction using the model
features = [[sepal_length, sepal_width, petal_length, petal_width]]
prediction = model.predict(features)[0]
# 4. Store the result in Redis for next time (as a string)
cache.set(cache_key, str(prediction))
# 5. Return the freshly computed prediction
return {"prediction": str(prediction)}
In the above code, I have added re-reset. First, we made the client using redis.Redis()
. Connects to Redis server. Using DB = 0 automatically. Then we created the cache key by joining installation values. Here it works because of the simple numbers, but difficult problems it is better to use the hash or JSON cable. The key must be different by each installation. We used us cache.get(cache_key)
. When it gets the same key, the return, what you do soon, and here, there is no need to renew the model. But if not available in the cache, we need to run the model and get predictions. Finally, keep that in redis using cache.set()
. So next time, when the same is included, it is already there, and the preservation will soon be.
Step 4: Checking and estimateing work achievement
Now that our Fastapi app is worked and connected to Redis, it is time to check how a temporary conservation improves responding. Here, I will show how I use the Nthon library to call API twice by entering the same and estimated the time taken by each call. Also, make sure you start your Fastapi before using the test code:
import requests, time
# Sample input to predict (same input will be used twice to test caching)
params = {
"sepal_length": 5.1,
"sepal_width": 3.5,
"petal_length": 1.4,
"petal_width": 0.2
}
# First request (expected to be a cache miss, will run the model)
start = time.time()
response1 = requests.get(" params=params)
elapsed1 = time.time() - start
print("First response:", response1.json(), f"(Time: {elapsed1:.4f} seconds)")

# Second request (same params, expected cache hit, no model computation)
start = time.time()
response2 = requests.get(" params=params)
elapsed2 = time.time() - start
print("Second response:", response2.json(), f"(Time: {elapsed2:.6f}seconds)")

When you run this, you should see the first request we find outcome. Then the second request returns the same outcome, but as soon as possible. For example, you can find that the first call took the tens of Millileonds (depending on the hardware difficulty), and the second call can be few miliseclons or less. In our simile with a lacking model, the difference can be small (as the model itself is faster), but the result is great in difficult models.
Description
To look at this, let us consider what we accomplished:
- Without catcher: All applications, even the boys, will hit the model. If the model takes 100 ms one predicting, 10 similar applications will be met ~ 1000 ms.
- With a caching: The first request takes a full beat (100 ms), but the following 9 applications may take, saying, 1-2 MS each (just a return of returns and returning data). So those 10 applications can be seen in all ~ 120 ms instead of 1000 MS, a ~ 8x speed-up of the situation.
In real testing, temporary storage can lead to order development. In e-commerce, for example, Using Redis meant recovery recommendations to microseconds for repeated applications, compare to postpone them with a complete Pipeline model. Relationships will depend on how much your model's agreement is. More sophisticated model, the more you benefit from choosing the CACH with repeated calls. It also detects request patterns: If all applications are different, the cache will not help (no repeats to serve in memory), but many programs recognize more applications (eg.
You can also check your redis cache directly to verify the last buttons.
Store
In this blog, show how strapi and redis can work together to speed up the ML MODEL. Fastapi provides a fast and easy-to-predictive API layer, and redis adds a latency-reduction layer and CPU load to get repeated lattens. By avoiding repeated model calls, we have improved replying and enabling the program to manage many applications with similar resources.
Sign in to continue reading and enjoy the content by experts.