Machine Learning

Deploying a Multistage Multimodal Recommender System on Amazon Elastic Kubernetes Service

, multimodal recommender system is not trivial especially when it needs to scale, adapt in near real time, and run reliably on cloud.

In this post, I walk through my experience designing and deploying such a system end‑to‑end covering data preparation, model training to serving the models in production.

We’ll explore the full pipeline including retrieval, filtering, scoring, and ranking along with the infrastructure and important decisions that makes it all work. This includes feature stores, Bloom‑filters, Kubeflow, near real‑time preference adaptation, and a major latency win from in‑memory feature caching.

It’s a long read, but if you’re building or scaling recommender systems, you’ll find practical patterns here that you can apply directly to your own projects.

The main sections of this post

  1. Some information about the system
  2. Why the current design was chosen
  3. System components
  4. Data source
  5. Full Training and Deployment pipeline
  6. Continual fine-tuning pipeline
  7. Processing requests through the 14 models in NVIDIA Triton Inference server
  8. Improving item feature lookup latency with in-memory caching
  9. Autoscaling the Triton Inference Server on EKS
  10. Validating contextual recommendations, Bloom filter filtering, and near real-time recommendation updates (with Demo)
  11. Limitations and Future Work
  12. Conclusion
  13. Resources

Some information about the system

The recommender system consists of four main stages: a Two-Tower model generates candidates, a Bloom filter temporarily hides items the user recently interacted with, a DLRM ranker scores the remaining items using user, item, and context features, and a final reranking stage orders and samples from these scores to produce the final recommendations. The models use both tabular collaborative features and precomputed CLIP image embeddings and Sentence-BERT text embeddings.

In the retrieval model, these pretrained embeddings are fed into the candidate tower together with learned item features, providing the candidate tower with both content-based semantic signals and collaborative signals. The dot product between the query-tower output and candidate-tower output is then used as a learned relevance score in this shared embedding space.

In the DLRM ranker, the pretrained image and text embeddings participate in the dot-product interaction layer. These pairwise interactions are then passed to the top MLP, allowing content-based signals from the pretrained embeddings to complement the collaborative and contextual signals used for click prediction.


Why the current design was chosen

The target use case is an ecommerce platform that needs to recommend relevant products as soon as users land on the homepage. The platform serves both registered users and anonymous visitors, and user behavior can vary substantially with the request context, such as device type, time of day, or day of week. That means the recommendation service must provide reasonable cold-start recommendations for new users and must adapt recommendations to the context of the current request.

The solution also needs to scale. As more retailers are onboarded, the product catalog could grow to millions of items. At that point, scoring the full catalog on every request is impractical. A multistage design solves this problem by using a light weight retrieval stage to fetch candidates quickly and a heavier ranking stage to score those candidates.

Also, the recommendation models need to stay up to date with new interactions, however rebuilding the full retrieval stack every day is not practical. For this reason, two Kubeflow pipelines are defined. The first pipeline sets up the preprocessing workflows, trains the models from scratch, builds the ANN index, and deploys the Triton server and models. The second pipeline manages daily finetuning which primarily updates the query tower and the ranker; the models are updated with new interaction signals but the item embeddings are not regenerated.


System components

All components of the system work together to ensure the overall goal of serving relevant recommendations fast and at reasonable scale is achieved.

  • Kubeflow Pipelines manages both the full training workflow and the daily fine-tuning workflow on the Kubernetes-based system.
  • The NVIDIA Merlin stack handles GPU-accelerated feature engineering, preprocessing, training retrieval and ranking models. Triton Inference server hosts the multistage serving graph as a single ensemble model.
  • FAISS serves as the approximate nearest neighbor index for candidate retrieval.
  • Feast manages the user and item features across training and serving. ElastiCache for Valkey (Redis) backs the online feature store, manages each user’s Bloom filter to allow filtering of already-seen items from a user’s recommendation list, and stores global and category-based item popularity information based on interaction counts. Amazon Athena (with S3 and Glue) backs the offline feature store.
  • Amazon Elastic Kubernetes Service (EKS) runs the containerized machine learning workflows and scales compute to meet changing workload demands.
Figure 2: Recommender system MLOps with Kubeflow on Amazon Elastic Kubernetes Service (image by author)

Data source

The training data comes from a modified version of the AWS Retail Demo Store interaction generator. The user pool was scaled to 300,000 while the product catalog was kept at 2,465 items, with the associated images and descriptions. The dataset contains 13 million interactions across 14 days, stored as daily partitioned parquets (day_00.parquet — day_13.parquet).


Full Training and Deployment pipeline

The first Kubeflow pipeline handles the initial data copy, data preprocessing, model training, FAISS indexing, and Triton Inference Server deployment.

Figure 3: Kubeflow UI showing the components of the full Training and deployment pipeline (image by author)

Data copy

The pipeline begins by copying all the inputs needed by downstream tasks from S3 bucket to a persistent volume mounted at a local path. These include the interaction data, feature tables, product images, pretrained CLIP and Sentence-BERT models.

Preprocessing

The preprocessing step merges interaction data with user and item feature tables, then defines and fits three NVTabular workflows, one for the user features [jump to CODE], one for the item features [ jump to CODE] , and one for the context features [jump to CODE]. It also compiles the subgraphs into a full workflow. Splitting the workflows made it easier to build separate triton models for feature transformations which can be independently updated.

Another preprocessing step simulates cold-start conditions (see code snippet below) during training. In 5% of training rows, the user ID, gender, and top_category features are replaced with sentinel values, followed by a separate 5% random masking of device type. Transformation with the NVTabular workflows maps the sentinels to out-of-vocabulary (OOV) index.

#MASK some users and context features in train data with 5% probability 
ANONYMOUS_USER = -1
OOV_GENDER = -1
OOV_TOP_CATEGORY = -1
OOV_DEVICE = -1

masked_train_dir = os.path.join(input_path, "masked_train")
os.makedirs(masked_train_dir, exist_ok=True)

for i in range(train_days):
    day = cudf.read_parquet(os.path.join(input_path, f"train_day_{i:02d}.parquet"))
    n=len(day)
    user_mask = cupy.random.random(n) < 0.05
    day.loc[user_mask, "user_id"] = ANONYMOUS_USER
    day.loc[user_mask, "gender"] = OOV_GENDER
    day.loc[user_mask, "top_category"] = OOV_TOP_CATEGORY
        
    device_mask = cupy.random.random(n) < 0.05
    day.loc[device_mask, "device_type"] = OOV_DEVICE
    day.to_parquet(os.path.join(masked_train_dir, f"train_day_{i:02d}.parquet"), index=False)
    del day
    gc.collect()
    
masked_train_paths = [os.path.join(masked_train_dir, f"train_day_{i:02d}.parquet") for i in range(train_days)]
masked_train_ds = Dataset(masked_train_paths)

full_workflow.transform(masked_train_ds).to_parquet(os.path.join(output_path, "train"))
full_workflow.transform(valid_raw).to_parquet(os.path.join(output_path, "valid"))

To obtain the multimodal item features, the product images are encoded using OpenAI CLIP and the product descriptions are encoded using Sentence-BERT. Both embeddings are reduced to 64-dimensional vectors through PCA and stored as lookup tables keyed by the NVTabular transformed item IDs. The mean age computed by the user workflow is saved for later injection into the feast_user_lookup model config. Another step prepares the offline and online feature artifacts. This step adds timestamps to the user and item features, writes the resulting features to the offline store, and materializes them into the online store for serving. At the same time, global and category-specific popularity information are computed from the interaction data and written to the Valkey database (db=3).

Figure 4: the Valkey database for item popularity (image by author)

Training the retrieval model

The Two-Tower model [jump to CODE] is trained on user and item features only, with in-batch negatives and a contrastive loss. The query tower ingests the user-side features while the candidate tower consumes the item features together with the precomputed image and text embeddings. See Figures 5 and 6 for information about the NVTabular preprocessing and the input block processing steps for each tower.

Figure 5: an illustration of the feature transforms with NVTabular and the steps in the input block of the candidate tower. (image by author, and inspired by prior work from Jeremy and Jordan)

Training uses the first 9 days of interaction data; evaluation uses days 10 through 12. After training, the candidate encoder is run over the full item catalog to compute item embeddings. For this, a custom LookupEmbeddings operator (based on Merlin’s BaseOperator) handles the multimodal embedding lookup when loading items features in batches with Merlin’s data loader. These item embeddings are used to build the FAISS index for approximate nearest-neighbor retrieval. The query encoder is saved separately for online inference.

Figure 6: an illustration of the feature transforms with NVTabular and the steps in the input block of the query tower. (image by author, and inspired by prior work from Jeremy and Jordan)

Training the ranking model

The DLRM ranker [jump to CODE] is trained on the same interaction data but with an expanded feature set. The feature set includes item features, user features, request-time context features (such as device type and cyclical time-of-day and day-of-week features). The learning target is a binary click label. These context features represent situational factors that can shape a customer’s choice. For instance, a user might engage more with certain items when browsing on their phone versus a desktop, or show different preferences depending on the time of day or day of the week.

Figure 7: the DLRM architecture including the feature transforms (image by author)

Model preparation and deployment

Once both models are trained, the pipeline assembles the serving artifacts needed by Triton. These include the saved query tower, the DLRM ranker, the NVTabular transform models, the FAISS index and the lookup tables for the multimodal item embeddings. The Triton model repository is structured ahead of time, so each deployment only needs to copy the model artifacts into their versioned directory and inject runtime values like the average user age (for cold-start default), the retrieval topK, the ranking topK and diversity mode into the model config files.

A helm chart deploys Triton Inference Server on EKS, starts the server in explicit mode and then loads all the models (see the starting script).

#Triton starting script
set -e
MODELS_DIR=${1:-"/model/triton_model_repository"}

echo "Starting Triton Inference Server"
echo "Models directory: $MODELS_DIR"

tritonserver 
    --model-repository="$MODELS_DIR" 
    --model-control-mode=explicit 
    --load-model=nvt_user_transform 
    --load-model=nvt_item_transform 
    --load-model=nvt_context_transform 
    --load-model=multimodal_embedding_lookup 
    --load-model=query_tower 
    --load-model=faiss_retrieval 
    --load-model=dlrm_ranking 
    --load-model=item_id_decoder 
    --load-model=feast_user_lookup 
    --load-model=feast_item_lookup 
    --load-model=filter_seen_items 
    --load-model=softmax_sampling 
    --load-model=context_preprocessor 
    --load-model=unroll_features 
    --load-model=ensemble_model

Continual fine-tuning pipeline

This Kubeflow pipeline handles daily model updates. The pipeline relies on some of the artifacts generated by the full training pipeline, therefore its components mount the same persistent volume containing the saved artifacts.

Figure 8: Kubeflow Pipelines UI showing the incremental retraining pipeline DAG (image by author)

Copy incremental data

At the start of this run, the pipeline copies the latest interaction data from Amazon S3 together with a smaller replay set of older interactions. The replay portion gives the fine-tuning job a broader behavioral context and prevents the models from overfitting to only the newest pattern.

Preprocess data

This step merges the historical user and item features with the new interaction data, then transforms the data using the fitted NVTabular workflows from the recent full training job.

Fine-tune models

This step updates the query tower and the ranker. It initializes the Two-Tower model from the previous checkpoint but with the candidate encoder frozen so only the query tower parameters are trainable. This allows the model to adapt to the recent user behavior while preserving the item-side embeddings used by the existing ANN index. A summary of the Two-Tower model showing the frozen layers can be found in here.

The pipeline also initializes the DLRM ranker from the previous checkpoint but trains all the parameters using a smaller learning rate and for fewer epochs.

Once training completes, it saves the fine-tuned query tower and the DLRM ranker to new version folders in the existing Triton model repository.

Promote fine-tuned models

This step calls Triton to load the new models. Triton serves in-flight requests on the existing model versions while loading the new models in the background. Then it hot-swaps to the latest model versions once they are ready.

Figure 9: the query_tower and dlrm_ranker are both promoted to new versions after finetuning (image by author)

Processing requests through the 14 models in NVIDIA Triton Inference server

The model repository contains 14 models across two backends. Python backends for feature lookups, feature transforms, and filtering; TensorFlow backends for the query tower and the DLRM ranker. An ensemble configuration wires all these models into a directed acyclic graph (DAG) that NVIDIA Triton Inference server executes.

Figure 10: an illustration of request processing in the Triton Inference Server (image by author)

How context and user features are prepared

Each request arrives with a user ID and an optional device type and request timestamp. If any context was missing, the context_preprocessor imputes the defaults. For example, the current server time is imputed for a missing timestamp and an OOV sentinel is imputed for missing device type. The context workflow transforms the context data into categorified device index and four temporal features (hour sine/cosine, day-of-week sine/cosine).

In the user path, feast_user_lookup fetches the user features from the online feature store (backed by ElastiCache for Valkey), then nvt_user_transform transforms the features using the user workflow before passing them to the query tower (query_tower). The query tower produces the user embeddings which faiss_retrieval uses to perform similarity search, returning the topK item IDs.

Handling user cold-start

When a user ID is not found in the online feature store, feast_user_lookup uses defaults, i.e., user_id = -1, age = the training mean, gender = -1, and top_category=-1. The nvt_user_transform maps these user_id, gender, and top_category sentinels to their OOV indices and the mean age to the normalized value and categorified age bucket. Then the query_tower generates the user embedding from the transformed features. Although faiss_retrieval returns the same popularity-biased candidates for unknown users, the DLRM ranker can still personalize the candidates ordering using available context.

Seen-items filtering with a Bloom Filter

The candidate item IDs are checked against a Bloom filter in ElastiCache for Valkey. This step can eliminate a significant number of candidates, therefore over‑fetching at the retrieval stage is important as it ensures the ranker receives enough candidates to produce a meaningful recommendation list.

The filtered item IDs enter the item feature pipeline where feast_item_lookup retrieves the item features from the online feature store, nvt_item_transform transforms these features using the user workflow, and multimodal_embedding_lookup returns the pretrained CLIP (image) and Sentence BERT (text) embeddings for the items.

Figure 11: RedisInsight UI showing Bloom filter keys (items) stored in ElastiCache, each with a 6-day TTL. (image by author)

Ranking and ordering

The unroll_features model tiles the user and context features to match the retrieval candidate size. Then DLRM ranker (dlrm_ranking) scores the candidates. In softmax_sampling if DIVERSITY_MODE is disabled, the model returns the topK candidates by descending score; if it is enabled, the model uses score-based weighted sampling without replacement to select a diverse topK while still favoring higher-scoring items. Finally, item_id_decoder maps the ordered candidate IDs (NVTabular indices) back to the original item IDs, and Triton returns the selected item IDs together with their corresponding scores.


Improving item feature lookup latency with in-memory caching

Server Profiling with Triton Performance Analyzer at retrieval size of 300 revealed that feast_item_lookup consumes 195 ms, which was roughly 52% of total request latency at concurrency=1. Under load, the queue time ballooned from 36 ms (at concurrency=1) to 988 ms (at concurrency=4). This capped throughput at 2.9 inferences per second regardless of how many concurrent requests were issued.

Figure 12a: Optimizing feature lookup latency with caching (image by author)

The bottleneck was feast_item_lookup fetching features for 300 candidates from Feast’s online store on every request. To alleviate this, Feast calls for item features were replaced with an in-process NumPy array cache. Essentially, at feast_item_lookup initialization, all item features are fetched once from Feast and stored as NumPy arrays indexed by item ID, so every request reads features from memory instead of making network calls to the online feature store. This optimization resulted in about 99.7% improvement in the feast_item_lookup latency, and a 54% improvement in the end-to-end latency (at concurrency=1). Also, the throughput (at concurrency=4) improved by 310%. The only trade-off is that the cached features only refresh on Triton restart, however, for a catalog with fairly static item attributes, this is not problematic.

Figure 12b: Latency results before and after in-memory feature caching (image by author)

After this change, the three NVTabular transform models nvt_user_transform (72ms), nvt_item_transform (41ms), and nvt_context_transform (39ms) accounted for approximately 88% of remaining latency. Further model optimizations are deferred to a future version of this project.


Autoscaling the Triton Inference Server on EKS

in this project, the Triton Inference Server is autoscaled via Kubernetes Horizontal Pod Autoscaler (HPA) based on a custom metric — the average time (in milliseconds) that each request spent waiting in the queue over the last 30 seconds. When this latency exceeds the target, the HPA scales up the Triton deployment by increasing the desired pod replica count. If the new Triton pod cannot be scheduled because no GPU node has capacity for a new pod, Karpenter provisions a new GPU node and adds it to the cluster. Once the node becomes available, the Kubernetes scheduler places the Triton pod on it. Once the new pod is ready, the load balancer can begin routing traffic to it.

Figure 13: Autoscaling Triton Inference Server with K8s HPA and Karpenter (image by author)

Validating contextual recommendations, Bloom filter filtering, and near real-time recommendation updates.

To validate the system, diversity mode was turned off during deployment to isolate its effect from those of context types, Bloom filter filtering, and preference shift on recommendations.

Validating contextual recommendations

To validate contextual recommendations, I experimented with multiple request types, including requests with only a user ID and requests that combined user ID with contextual features such as device type and timestamp. These tests showed that recommendations for unknown users vary with context. A cold-start user can receive different ranked item lists depending on the device type and request time. For existing users, the effect of context was less pronounced. The overall ranking remained largely stable across contexts, although the output scores varied.

A demo of context effects on recommendations for existing (user ID= 1009) and new user (userID = 12345678). Video by author.

Validating Bloom filter seen-items filtering

To validate seen-item exclusion by the Bloom filter, several items from the Recommended for You carousel were clicked. Those items were excluded from subsequent recommendations by the Bloom filter. To avoid shifting the user’s inferred preference and confounding the Bloom filter test, click items from different categories.

In the video demonstrating the Bloom filter filtering, we observe that clicked items such as Decadent Chocolate Dream Cake and Vintage Explorer’s Canvas Backpack are excluded from User 12345678‘s next recommendations.

Video demonstration of the Bloom filter excluding previously interacted items (video by author).

Validating near real-time recommendation updates

To validate near real-time recommendation updates for existing users, the test begins by first fetching recommendations for a user to establish the user’s current preference. This is followed by clicking several items from the same category, for example, items belonging to only Accessories or Furniture or Groceries, then waiting for about five seconds for the updates to take effect. The repeated interactions with items in the same category can shift the user’s inferred preference if that category differs from the user’s current top_category. The top_category feature represents the dominant category among the items a user has interacted within the past 24 hours and is recomputed after each interaction. On the next request, the model can rank items from that newly expressed interest category higher and surface them among the top recommendations.

In the video demonstrating live changes in recommendations, we notice User 1003‘s top recommendations change from Accessories to Home Decor (and furniture) due to repeated interactions with items in the Furniture category.

Demonstration of real‑time ranking changes triggered by shifts in user preference signals (video by author)

Note, however, that the top_category feature is a crude approximation of short-term interest used to demonstrate the system’s ability to adapt to user behavior in real-time. For richer short-term interest modeling, the next iteration of this project would replace the static query tower with a session-based transformer encoder.


Limitations and Future Work

In the current architecture, request-side context, such as device type and timestamp-derived features, is used only by the ranker. This was an implementation choice to keep the retrieval simple, since adding context at retrieval time would require computing additional features during candidate generation. However, if request context influences which items should be retrieved, relevant candidates may be filtered out before the ranker sees them.

A future direction is to add request-side context features to the query tower, so both retrieval and ranking become context-aware. Another direction is to replace the current query tower with a session encoder, which would more faithfully capture short‑term user behaviour than the current behavioural feature approximation (i.e., top_category).


Conclusion

This post walked through a multistage multimodal recommender system for an ecommerce use case, deployed on Amazon EKS. The system combines Two-Tower candidate retrieval, context-aware DLRM ranking, and a score-based diversity ranking. The system utilizes tabular user and item features, multimodal embeddings based on product images and text descriptions, and context information.

Cold-start is addressed through feature masking during training, which forces the models to rely on a learned OOV embedding and context signals when user is new or unknown. This means anonymous and new users receive recommendations that adapt to their device type and the time of their request, rather than a static fallback list. Bloom filters prevent already-seen items from resurfacing across repeated sessions, and in-memory caching of item features helped resolve the latency bottleneck at the item feature lookup stage. Also, real-time adaptation of the system to changing behavioral signal is demonstrated via the top_category feature.

On the MLOps side, two Kubeflow pipelines manage the system lifecycle. One pipeline for full training and deployment, and the other for daily fine-tuning of the query tower and ranker without rebuilding the item embedding index. Karpenter and Kubernetes HPA handle compute scaling in response to request load.

The system shows a production-style recommender systems in which a retrieval stage optimized for speed and recall is combined with a ranking stage optimized for precision, and an infrastructure layer designed to keep models updated without full retraining on every cycle. Please find the full code in this repository: MustaphaU/multistage-recommender-system-on-kubernetes

I hope you enjoyed reading this! I look forward to your questions.


Resources

  1. Mustapha Unubi Momoh, Multistage Multimodal Recommender System on Kubernetes, GitHub repository. Available:
  2. Even Oldridge and Karl Byleen‑Higley, “Recommender Systems, Not Just Recommender Models,” NVIDIA Merlin (Medium), Apr. 2022. Available:
  3. Radek Osmulski, “Exploring Production‑Ready Recommender Systems with Merlin,” NVIDIA Merlin (Medium), Jul. 2022. Available:
  4. Jacopo Tagliabue, Hugo Bowne‑Anderson, Ronay Ak, Gabriel de Souza Moreira, and Sara Rabhi, “NVIDIA Merlin Meets the MLOps Ecosystem: Building a Production‑Ready RecSys Pipeline on Cloud,” NVIDIA Merlin (Medium), Feb. 2023. Available:
  5. Benedikt Schifferer, “Solving the Cold‑Start Problem Using Two‑Tower Neural Networks for NVIDIA’s E‑Mail Recommender Systems,” NVIDIA Merlin (Medium), Jan. 2023. Available:
  6. Ziyou “Eugene” Yan, “System Design for Recommendations and Search,” eugeneyan.com, Jun. 2021. Available:
  7. Haoran Yuan and Alejandro A. Hernandez, “User Cold Start Problem in Recommendation Systems: A Systematic Review,” IEEE Access, vol. 11, pp. 136958–136977, 2023. Available:
  8. Justin Wortz and Justin Totten, “Scaling Deep Retrieval with TensorFlow Recommenders and Vertex AI Matching Engine,” Google Cloud Blog, Apr. 19, 2023. Available:
  9. Sam Partee, Tyler Hutcherson, and Nathan Stephens, “Offline to Online: Feature Storage for Real‑time Recommendation Systems with NVIDIA Merlin,” NVIDIA Technical Blog, Mar. 1, 2023. Available:

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button