Generative AI

Google proposed Tumix: Facing multi-agent testing time with a mixture of tool

What if, instead of renovating one agent, can you press Gemini-2.5 Pro to 34.1% in the best 12-16 instruments sharing time? Google Cloud Ai study, as well as participants from MIT, Harvard, and Google Depmind, launched Tumix (mixture of tool)-Room outline for the test time for agent Hette (text-only, code, to search, target variables) and allow them Share the mediums of a few mediums for refiningthen stop early by a jury based on the llm. Result: Higher accuracy with low cost at the hard consulting benches such as Wilderness, GPQA-diamondbeside AIME (2024/2025).

So, what's really different?

  • Mix more than the way, not many samples: Tumix runs ~ 15 ambassador Spanning Chain-of-tempent (CON), Code kill, web murmuring, Dual Tools, and target variables. Each cycle, every supplier sees (a) the first question and (b) other early earlier answers, raise the refined response. This Message pass It raises up the average accuracy of time while the division is slowly down – so it stops the news.
  • Referral Change of Pre-Time: A Llm-brothes The rehabilitation of the replacements display strong agreement (with a small rotation limit). This keeps the accuracy At ~ 49% of the measurement costs vs. Basic refinement; The cost of the Token drops ~ 46% because late cycles is a Token – Survive.
  • Automatically designed agents: Over the Personal Arts, Tumix promotes the Base of the LLM to produce new agents; Mixing this by Manual Set Update An MORE ~ + 1.2% a lift rate except for additional costs. “A Good Search” ~ 12-16 Embassy Styles.

How does this work?

TumIX is running a group of heterogeneous agents – the only chain-address, web site, and the small amount of integration, and enter a small amount of reflections when it is like previous ideas and pre-agent answers. After each cycle, a lLM based judge checks the consistency / consistency to decide Breakfast; If you have confidence is insufficient, one cycle is caused, if not the application concludes with a simple adgregation (eg a large vote or a selector). This mixture of Design Design Designes Brute-Force Re-sampling for Different ways of thinkingImproving the coverage of the right voters while managing token budgets / Tools; In revenues, benefits were filled with Agen styles around 12-16, and blocking pre-timers and lows worth the accuracy

Lets discuss results

Under the Mismanal Business Measuring the background tools accessing the tools (figurative hobbies, dei, Scimaster, GSA), Tumix Reveal Interior accurate rates; different is limited (Tumit +) It moves forward with more compute:

🚨 [Recommended Read] Vipe (Video Pose Pose): A Powerful and Powerful Tool of Video 3D video of AI

  • Best (Personal Personal Evaluation): Pro: 21.6% → 34.1% (Tumix +); Flash: 9.7% → 23.1%.
    (Good you are a 2,500 question, tough, Multi-Domain Benchmark completed by 2025.)
  • GPQA-Diamonds: Pro: Until 88.3%; Flash: Upgrading 82.1%. (GPQA-Diamond is the hardest of all 198 basic subsidiaries authorized by domain experts.)
  • AIME 2024/25: Pro: 96.7%; Flash: 86.7% with tumix (+) during testing.

In work, Tumix + 335% of the best testing tool – dislikes to the tax collectors we see on the same costbeside + 78% / + 17.4% above the and the SCalling of the Pro / Flash, respectively.

Tumix is ​​a good way from Google because time test frames as a search problem above heterogeneous policies instead of brute-force sample. The corresponding committee (text, code, the search) promotes an appointment, and the llm judge gives timely arrival to be different and reduce the token / tool usable under latency budgets. Type species (34.1% in Gemini-2.5 Pro) to agree with Benchmark completed 2,500 questions, and agent styles ~ 12-16 “Display the selection – not a distant generation.


Look Paper. Feel free to look our GITHUB page for tutorials, codes and letters of writing. Also, feel free to follow it Sane and don't forget to join ours 100K + ml subreddit Then sign up for Our newspaper. Wait! Do you with a telegram? Now you can join us with a telegram.


Asphazzaq is a Markteach Media Inc. According to a View Business and Developer, Asifi is committed to integrating a good social intelligence. His latest attempt is launched by the launch of the chemistrylife plan for an intelligence, MarktechPost, a devastating intimate practice of a machine learning and deep learning issues that are clearly and easily understood. The platform is adhering to more than two million moon visits, indicating its popularity between the audience.

Follow MarkteachPost: We have added like a favorite source to Google.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button