DeepMMSearch-R1: Enabling Multimodal LLMs in Multimodal Web Search

Multimodal Large Language Models (MLLMs) in real-world applications need access to external information sources and must constantly respond to dynamic and ever-changing real-world information to address user queries that require knowledge and expertise. Existing methods, such as retrieval augmented generation (RAG) methods, search agents, and MLLMs equipped for search, often suffer from rigid pipelines, excessive search calls, and poorly formed search queries, resulting in inefficiencies and inferior results. To address these limitations, we present DeepMMSearch-R1, the first multimodal LLM capable of performing high-demand, multi-curve web searches and generating dynamic queries for both image and text search tools. Specifically, DeepMMSearch-R1 can run a web search based on the relevant plants of an input image to make image search more efficient, and it can repeatedly modify text search queries based on the returned information, thus enabling self-reflection and self-correction. Our approach relies on a two-phase training pipeline: a supervised cold start and a maintenance phase followed by an online reinforcement learning development. For training, we present DeepMMSearchVQA, a novel multi-objective VQA dataset created with an automated pipeline mixed with real-world information from web search tools. This dataset contains various, multi-hop queries that combine textual and visual information, teaching the model when to search, what to search for, which search tool to use and how to consult the returned information. We perform extensive tests on a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach. Finally, we analyze the results and provide important insights into improving multimodal web search.
- † Johns Hopkins University
- ** Work done while at Apple


