Generative AI

This AI Paper from Walmart Shows the Power of Multimodal Learning for Improved Product Recommendations

In the rapid development of personalized recommendation systems, using a variety of data methods is critical to providing accurate and relevant recommendations to users. Traditional recommendation models often rely on single data sources, which limits their ability to fully understand the complex and multifaceted nature of user behavior and product characteristics. This limitation limits its effectiveness in delivering high-quality recommendations. The challenge lies in combining various data methods to improve system performance, ensuring a deep and comprehensive understanding of user preferences and object characteristics. Addressing this issue remains a major focus for researchers.

Efforts to improve recommendation systems have led to the development of multi-behavioral recommendation systems (MBRS) and approaches based on the Large Language Model (LLM). MBRS uses behavioral assistance data to develop targeted recommendations, using sequence-based methods such as temporal graph transforms and graph-based techniques such as MBGCN, KMCLR, and MBHT. In addition, LLM-based systems improve user object representations by using contextual data or evaluate in-context learning to make specific recommendations. However, while methods like ChatGPT offer new opportunities, their recommendation accuracy often falls short compared to traditional systems, highlighting ongoing challenges in achieving optimal performance.

Researchers from Walmart have proposed a novel framework called Triple Modality Fusion (TMF) for multi-modality recommendations. This approach uses a combination of visual, textual, and graphical data methods in alignment with LLMs. Visual data captures contextual and aesthetic object characteristics, textual data provides detailed user experience and object characteristics, and graph data displays relationships in graphs of various object behaviors. In addition, the researchers developed a modality fusion module based on attention and self-attention modalities to combine different modalities from other models in the same embedding environment and combine them with LLM.

The proposed TMF framework is trained on real-world customer behavior data from Walmart's e-commerce platform, which includes categories such as Electronics, Pets, and Sports. Customer actions, such as viewing, adding to cart, and purchasing, define a sequence of behaviors. Data other than purchasing behavior are not included, each category creates a dataset that is analyzed for the complexity of user behavior. TMF uses Llama2-7B as its backbone model, CLIP for image and text encoders, and MHBT for object behavior embedding. The test uses metrics such as ground truth identification from candidate sets, which ensures a robust test of the recommendation's accuracy. TMF and other ground models are tested to find the ground truth object in the candidate set.

Experimental results reveal that the TMF framework outperforms all baseline models on all data sets. It achieves over 38% in HitRate@1 on the Electronics and Sports datasets, demonstrating its effectiveness in handling complex user-object interactions. Even on the simple pet dataset, TMF outperforms the Llama2 baseline using mode integration, which improves the recommendation accuracy. However, TMF with modality integration may improve performance with the same valid #Item/#User ratio for production quality. The proposed AMSA module greatly improves performance, suggesting that including multiple methods of object information in the model allows LLM-based recommender to better understand objects by integrating image, text, and graph data.

In conclusion, the researchers present a Triple Modality Fusion (TMF) framework that improves multi-modality recommendation systems by combining visual, textual, and graph data with LLMs. This integration enables a deeper understanding of user behavior and product characteristics, leading to more accurate and contextual recommendations. TMF uses a modality fusion module based on your attention and attention modalities to successfully align different data. Extensive experiments confirm the high performance of TMF in recommendation tasks, while the extraction studies highlight the importance of each method and confirm the effectiveness of the cross-attention method in improving the accuracy of the model.


Check out Paper. All credit for this study goes to the researchers of this project. Also, don't forget to follow us Twitter and join our Telephone station again LinkedIn Grup. Don't forget to join our 60k+ ML SubReddit.

🚨 UPCOMING FREE AI WEBINAR (JAN 15, 2025): Increase LLM Accuracy with Artificial Data and Experimental IntelligenceJoin this webinar for actionable insights into improving LLM model performance and accuracy while protecting data privacy.


Sajjad Ansari is a final year graduate of IIT Kharagpur. As a Tech Enthusiast, he examines the applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to convey complex AI concepts in a clear and accessible manner.

✅ [Recommended Read] Nebius AI Studio expands with vision models, new language models, embedded and LoRA (Enhanced)

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button