Generative AI

ToolHop: A Novel Dataset Designed to Evaluate LLMs in Multi-Hop Tool Usage Scenarios

Multi-hop queries have always given LLM agents a hard time with their solutions, requiring multiple steps of reasoning and information from different sources. They are important for analyzing the model's understanding, reasoning, and ability to call a function. At this time when large new models flourish every day with claims of unparalleled capabilities, multi-hop tools truly test them by providing a complex question, a model that needs to be decomposed into atomic parts and solved iteratively by calling and using the appropriate tools. . In addition, the evaluation of multi-hop tools has emerged as important for the development of models in general intelligence.

Existing works in this field fall short of providing a reliable assessment method. The methods proposed so far rely on tool-driven data construction methods where queries are modeled by a specific set of tools. This deficiency points to a gap in verifying the dependencies of other clustered tools and testing multi-hop logic. Additionally, the lack of confirmatory responses introduces model bias and analysis errors. This article discusses recent research that presents a reliable method for evaluating the multi-hop capabilities of a large language model.

Researchers from Fudan University and ByteDance have introduced ToolHop, a dataset expressly designed for multi-hop tool testing with 995 rigorously designed user queries and 3,912 related tools. Toolhop claims to solve all the problems mentioned above by using cross-questioning, locally usable tools, logical dependencies, detailed feedback, and verifiable answers. The authors propose a new query-driven data architecture that can extend a single multi-hop query into a complete test case for using a multi-hop tool.

The proposed novel system includes three main phases: tool development, documentation refinement, and code generation.

Creating a Tool: The first set of tool documents is created according to the query provided by the multi-hop user. The document is designed to keep it interdependent and convenient by breaking down questions into atomic parts and handling them one by one. In this way, the document captures the essence of the question and the structure itself to generate similar questions, ensuring flexibility and consistency.

Document Development: The modified tool document is completely filtered to support the testing of models for complex multi-hop scenarios. Here, new features such as filtering results and customized formats are introduced to extend functionality while maintaining originality. Similarly, the number of parameters is increasing, and their types are well developed.

Code Generation: In this phase, the executable functions in the environment are performed by the modified tool. With these operations, the tools are used externally, allowing seamless multi-turn interaction between the model and the tools.

The research team applied this method to questions taken from the MoreHopQA dataset. In addition, to validate the evaluation with ToolHop, a rigorous five-dimensional analysis was performed. ToolHop was then tested on fourteen LLMs from five families, including open and non-closed source models. The test method was designed in such a way that accuracy of response was ensured and errors of asking were minimized. The authors observed that using the tools increased the performance of the models up to 12 % on average and up to 23 % for the GPT models. The best performing model can get 49.04% of response accuracy even after scaling. Also, without using tools to answer multi-hop queries, models are identified about 10% of the time.

Conclusion:

This paper presents a comprehensive dataset for solving multi-hop queries using specially designed queries and tools. The main finding of the study was that although LLMs have significantly increased their ability to solve complex multi-shop questions through the use of tools, their ability to use multi-shop tools still leaves a lot of room for improvement.


Check out Paper. All credit for this study goes to the researchers of this project. Also, don't forget to follow us Twitter and join our Telephone station again LinkedIn Grup. Don't forget to join our 60k+ ML SubReddit.

🚨 UPCOMING FREE AI WEBINAR (JAN 15, 2025): Increase LLM Accuracy with Artificial Data and Experimental IntelligenceJoin this webinar for actionable insights into improving LLM model performance and accuracy while protecting data privacy.


Adeeba Alam Ansari is currently pursuing a Dual Degree at the Indian Institute of Technology (IIT) Kharagpur, pursuing B.Tech in Industrial Engineering and M.Tech in Financial Engineering. With a deep interest in machine learning and artificial intelligence, he is an avid reader and a curious person. Adeeba strongly believes in the power of technology to empower society and promote well-being through innovative solutions driven by empathy and a deep understanding of real-world challenges.

✅ [Recommended Read] Nebius AI Studio expands with vision models, new language models, embedded and LoRA (Enhanced)

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button