Machine Learning

Custom Dining Room Level with Two Towers Embedded Variations

I would like to share a practical alternative to Uber's Two-Tower Embedding (TTE) approach in situations where both user-related data and computing resources are limited. The problem appeared in the high traffic detection widget on the home screen of the food delivery app. This widget displays the selected selections as Italian, Burgers, Sushior You are healthy. Options are created from tags: each restaurant can have multiple tags, and each tile is a tag-defined slice of the catalog (with the addition of some manual options). In other words, the candidate set is already known, so the real problem is not retrieval but quality.

At the time this widget was noticeably underperforming compared to other widgets on the (main) search screen. The final selection was calculated by general preference without considering any personal characteristics. What we found is that users are reluctant to scroll and if they don't find something interesting within the first 10 to 12 positions they usually don't convert. On the other hand the options can be huge at times, in some cases up to 1500 restaurants. In addition one restaurant can be selected for different choices, meaning for example McDonald's can be selected from both. Burgers again Ice creambut it is clear that its popularity is only valid for the first selection, but the standard popular filter will put you at the top of both selections.

The product setup makes the problem less friendly to static solutions such as general noise filtering. These collections are dynamic and constantly changing due to seasonal campaigns, operational needs, or new business initiatives. As a result, training a model dedicated to individual preferences is unrealistic. A useful recommendation should include new tag-based collections from day one.

Before moving on to a two-tower style solution, we tried simple methods like local popularity levels at the city-state level and heavily armed bandits. In our case, nothing brings a limited increase over the normal type of popularity. As part of our research we tried to fix Uber's TTE in our case.

Two-Tower Embeddings Recap

The two-tower model reads two encoders in parallel: one on the operator side and one on the restaurant side. Each tower produces a vector in the shared latent space, and the compatibility is estimated by the similarity result, usually the dot product. The performance advantage is separation: the restaurant embedding can be precompiled offline, while the user embedding is generated online at request time. This makes the method attractive for systems that require fast scoring and reusable representation.

Uber's writing is more focused on recovery, but it also noted that the same structure can serve as the last layer of the level when the production of the candidate has been handled elsewhere and the delay should remain low. That second build was closest to our use case.

Our Way

Author's photo

We kept the two-tower structure but simplified the heavier parts. On the restaurant side, we did not properly adjust the language model within the recommendation. Instead, we also used the TinyBERT model that was already optimized for in-app search and treated it as a frozen semantic encoder. Your text embeds are combined with obvious restaurant features such as price, ratings, and recent service signals, along with a small trainable restaurant ID embed, and the final restaurant vector is entered. This gave us semantic coverage without paying the full cost of end-to-end language model training. For a POC or MVP, a small frozen sentence converter can be a reasonable starting point as well.

We avoided learning the embedded user ID and instead represented each user over time by their previous interactions. The user vector is constructed from the average embedding of the restaurants the customer has ordered from (Uber's post also mentions this source, but the authors do not specify how it was used), as well as user and session characteristics. We also used out-of-order observations as a weak negative signal. That was important when the order history had little or no relevance to the current selection. If the model cannot clearly determine the user's preferences, it still helps to know which restaurants have already been tested and rejected.

The most important modeling option was to filter that history by the current selection tag. Estimating the entire order history created too much noise. If a customer often orders burgers and opens i Ice cream For example, the global average may attract the model to burger places that sell desserts rather than to people who are too strong for ice cream. By filtering past interactions for matching tags before averaging, we made the user representation holistic rather than global. Actually, this was the difference between modeling the long-term taste and the current purpose of modeling.

Finally, we trained the model at the session level and used multi-task learning. The same restaurant can be good for one session and good for another, depending on the current intent of the user. The level head predicts click, add-to-basket, and order together, with a simple limit of the funnel: P(order) ≤ P(add-to-basket) ≤ P(click). This made the model smaller and improved the level quality compared to developing one target separately.

Offline validation was also more robust than random splits: testing used out-of-date data with users not seen during training, which made the setup closer to production behavior.

Results

According to A/B testing the final system showed a significant increase in conversion rate. Most importantly, it wasn't tied to a single widget. Because the model receives a user-restaurant pair rather than a fixed list, it is prone to new selections without structural changes since the tags are part of the restaurant's metadata and can be returned without a selection in mind.

That transfer made the model useful beyond the original level area. We later used it again in advertising, where CTR-based output was applied to individual promoted restaurants with great results. So the same learning setup worked both for choice estimation and for other recommendation problems like placement within an app.

Further Research

The next most obvious step is multimodality. Restaurant images, icons, and possible menu views can be added as additional branches to the restaurant tower. That's important because click behavior is heavily influenced by presentation. A pizza place within a pizza option might not work well if its main image doesn't show pizza, while a budget restaurant might look great because of its hero image. Text and table features don't cover that gap well.

Key Takeaways:

  • Two-Tower models can work even with limited data. You don't need an Uber-scale infrastructure if candidate retrieval is solved and the model is only focused on the ranking phase.
  • Reuse pre-trained embeddings instead of training from scratch. A lightweight frozen language model (eg, TinyBERT or tiny sentence converter) can provide robust semantic signals without expensive fine-tuning.
  • The average embedding of pre-ordered restaurants works surprisingly well when the user history is small.
  • Content filtering reduces noise and helps the model capture the user's current intent, not just long-term taste.
  • Negative signals help in overlapping areas. Restaurants that users have viewed but not ordered from provide useful information when positive signals are limited.
  • Learning through many activities stabilizes the level. Predicting clicks, add-to-baskets, and orders in conjunction with funnel constraints produces consistent scores.
  • Design for reuse. A model that scores user-restaurant pairs rather than specific listings can be reused across product areas such as selection, search ranking, or advertisements.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button