ToolHop: A Novel Dataset Designed to Evaluate LLMs in Multi-Hop Tool Usage Scenarios

nimda January 11, 2025

0 26 3 minutes read

ToolHop: A Novel Dataset Designed to Evaluate LLMs in Multi-Hop Tool Usage Scenarios

Multi-hop queries have always given LLM agents a hard time with their solutions, requiring multiple steps of reasoning and information from different sources. They are important for analyzing the model's understanding, reasoning, and ability to call a function. At this time when large new models flourish every day with claims of unparalleled capabilities, multi-hop tools truly test them by providing a complex question, a model that needs to be decomposed into atomic parts and solved iteratively by calling and using the appropriate tools. . In addition, the evaluation of multi-hop tools has emerged as important for the development of models in general intelligence.

Existing works in this field fall short of providing a reliable assessment method. The methods proposed so far rely on tool-driven data construction methods where queries are modeled by a specific set of tools. This deficiency points to a gap in verifying the dependencies of other clustered tools and testing multi-hop logic. Additionally, the lack of confirmatory responses introduces model bias and analysis errors. This article discusses recent research that presents a reliable method for evaluating the multi-hop capabilities of a large language model.

Researchers from Fudan University and ByteDance have introduced ToolHop, a dataset expressly designed for multi-hop tool testing with 995 rigorously designed user queries and 3,912 related tools. Toolhop claims to solve all the problems mentioned above by using cross-questioning, locally usable tools, logical dependencies, detailed feedback, and verifiable answers. The authors propose a new query-driven data architecture that can extend a single multi-hop query into a complete test case for using a multi-hop tool.

The proposed novel system includes three main phases: tool development, documentation refinement, and code generation.

Creating a Tool: The first set of tool documents is created according to the query provided by the multi-hop user. The document is designed to keep it interdependent and convenient by breaking down questions into atomic parts and handling them one by one. In this way, the document captures the essence of the question and the structure itself to generate similar questions, ensuring flexibility and consistency.

Document Development: The modified tool document is completely filtered to support the testing of models for complex multi-hop scenarios. Here, new features such as filtering results and customized formats are introduced to extend functionality while maintaining originality. Similarly, the number of parameters is increasing, and their types are well developed.

Code Generation: In this phase, the executable functions in the environment are performed by the modified tool. With these operations, the tools are used externally, allowing seamless multi-turn interaction between the model and the tools.

The research team applied this method to questions taken from the MoreHopQA dataset. In addition, to validate the evaluation with ToolHop, a rigorous five-dimensional analysis was performed. ToolHop was then tested on fourteen LLMs from five families, including open and non-closed source models. The test method was designed in such a way that accuracy of response was ensured and errors of asking were minimized. The authors observed that using the tools increased the performance of the models up to 12 % on average and up to 23 % for the GPT models. The best performing model can get 49.04% of response accuracy even after scaling. Also, without using tools to answer multi-hop queries, models are identified about 10% of the time.

Conclusion:

This paper presents a comprehensive dataset for solving multi-hop queries using specially designed queries and tools. The main finding of the study was that although LLMs have significantly increased their ability to solve complex multi-shop questions through the use of tools, their ability to use multi-shop tools still leaves a lot of room for improvement.

Check out Paper. All credit for this study goes to the researchers of this project. Also, don't forget to follow us Twitter and join our Telephone station again LinkedIn Grup. Don't forget to join our 60k+ ML SubReddit.

🚨 UPCOMING FREE AI WEBINAR (JAN 15, 2025): Increase LLM Accuracy with Artificial Data and Experimental Intelligence–Join this webinar for actionable insights into improving LLM model performance and accuracy while protecting data privacy.

Adeeba Alam Ansari is currently pursuing a Dual Degree at the Indian Institute of Technology (IIT) Kharagpur, pursuing B.Tech in Industrial Engineering and M.Tech in Financial Engineering. With a deep interest in machine learning and artificial intelligence, he is an avid reader and a curious person. Adeeba strongly believes in the power of technology to empower society and promote well-being through innovative solutions driven by empathy and a deep understanding of real-world challenges.

✅ [Recommended Read] Nebius AI Studio expands with vision models, new language models, embedded and LoRA (Enhanced)

Source link

nimda January 11, 2025

0 26 3 minutes read

ToolHop: A Novel Dataset Designed to Evaluate LLMs in Multi-Hop Tool Usage Scenarios

nimda

Leave a Reply Cancel reply

Subscribers, Revenue, Market Share & Global Reach

5-return back to the base

Gemma 3 270m: Model of a hyper-effective compact of AI

Meta Superintelligence Labs Releases Muse Spark 1.1: A Multimodal Reasoning Model for Agentic Tasks in the Meta Model API

Cut researchers present the work that calls llms: Eliminating SQL relief to improve the accuracy of information and efficiency

OASIS: Simuleringar av social interaction mellan en miljon agent

FALCON 3 models are now available at Amazon Sagemaker Jumpstart

This AI paper introduces codesters: Physical models are symbolic language with code / guide

Meta SAM 2.1 is now available in Amazon SageMaker JumpStart

nimda

Subscribe to our mailing list to get the new updates!

This AI Paper Explores Parallelism, Concentration, Reasoning, and Memory: Basic Principles for Developing AGI Systems

SepLLM: An Efficient Low-Attention AI Approach to Large-Scale Language Models

Related Articles

Meta Superintelligence Labs Releases Muse Spark 1.1: A Multimodal Reasoning Model for Agentic Tasks in the Meta Model API

OpenAI Releases GPT-5.6 (Sol, Terra, Luna): A Three-Class Model Family with a Tool for Calling the Response API

Hlangana ne-Nemotron Labs 3 Puzzle 75B A9B: I-Compressed Hybrid MoE LLM Iletha 2.03x Server throughput

I-NVIDIA Ikhipha I-Nemotron-Labs-3-Puzzle-75B-A9B: I-Compressed Hybrid MoE LLM Iletha 2.03x Server throughput at Matched User throughput

Leave a Reply Cancel reply

Subscribers, Revenue, Market Share & Global Reach

5-return back to the base

Gemma 3 270m: Model of a hyper-effective compact of AI

Meta Superintelligence Labs Releases Muse Spark 1.1: A Multimodal Reasoning Model for Agentic Tasks in the Meta Model API

Cut researchers present the work that calls llms: Eliminating SQL relief to improve the accuracy of information and efficiency

OASIS: Simuleringar av social interaction mellan en miljon agent

FALCON 3 models are now available at Amazon Sagemaker Jumpstart

This AI paper introduces codesters: Physical models are symbolic language with code / guide

Meta SAM 2.1 is now available in Amazon SageMaker JumpStart