FACTS Focusing: A new benchmark for assessing the validity of large linguistic varieties

nimda December 17, 2024

0 11 3 minutes read

FACTS Focusing: A new benchmark for assessing the validity of large linguistic varieties

Responsibility and safety

Published: 17 December 2024
Writers: A group of FACTS

Our comprehensive ranking board and online leaderboard provide a much-needed measure of how accurately LLMs base their answers on a given source and avoid plagiarism.

Large-scale linguistic models (LLMs) are revolutionizing the way we access information, but their grip on the accuracy of reality remains imperfect. They can “see” false information, especially when presented with complex ideas. In turn, this can erode confidence in LLMs and limit their real-world applications.

Today, we are introducing FACTS Grounding, a comprehensive benchmark to test the ability of LLMs to generate answers that are not only factually accurate with respect to the given input, but also detailed enough to provide satisfactory answers to user queries.

We hope our benchmark will spur industry-wide progress on facts and fundamentals. To track progress, we're also introducing a FACTS leaderboard on Kaggle. We've already tested the top LLMs using FACTS Grounding and populated the first leaderboard with their basic scores. We will maintain and update the leaderboard as the field progresses.

The current level of the leaderboard

FACTS A data set

To properly assess the authenticity and supportability of any given LLM, the FACTS Grounding dataset contains 1,719 examples, each carefully designed to require long-form responses based on a given context text. Each instance includes a document, a system command that requires LLM to reference only the given document, and a corresponding user request.

An example from the FACTS Grounding dataset

All examples are divided into a “public” set (860) and a “private” set (859) held. We're releasing the public set today so anyone can use it to test the LLM. Of course, we know that issues of benchmark defacement and leaderboard hacking are important to protect against, so following standard industry practice, we keep private checks in place. FACTS leaderboard scores are average performance across public and private sets.

To ensure diversity of input, FACTS Grounding Examples include documents of varying lengths, up to 32,000 tokens (approximately 20,000 words), covering domains such as finance, technology, retail, medicine, and law. User applications are similarly broad in scope, including summarizing applications, Q&A generation, and rewriting tasks. We didn't include any examples that might require art, math, or complex reasoning — skills that might require the model to use more advanced reasoning than support.

Fast distribution

Collected judgment on leading LLMs

To be successful in the given example, the LLM must integrate the complex information in the document and produce a long-form response that is both a comprehensive response to the user's request and fully attributed to that document.

FACTS Grounding evaluates model responses automatically using three LLM borderline judges — Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet. We selected a mix of different judges to minimize any potential bias of the judge giving higher scores to the responses produced by his or her model family member. The automated judgment models are thoroughly analyzed against a standardized test set to find the most efficient judgment templates and to confirm agreement with human raters.

Each FACTS The first example was judged in two stages. First, responses are checked for validity, and are not valid if they do not correctly address the user's request. Second, answers are considered to be factually accurate if they are fully based on the information contained in the given document, without any omissions.

With the validity and accuracy based on a given LLM response evaluated separately by multiple AI judge models, the results are then combined to determine whether the LLM handled the instance effectively. The final scores for the self-reporting task are the average of all the model judges for all examples. Find out more about our FACTS testing method in our paper.

An authentic response that fails to address the user's request correctly fails the evaluation instance. Here we see three cases of model answers that the default LLM judges consider inappropriate

FACTS Layoffs will continue to emerge

We recognize that benchmarks can be quickly overtaken by progress, so the introduction of our REAL Focus benchmark and leaderboard is just the beginning. Facts and support are among the key factors that will shape the future success and usefulness of LLMs and broader AI programs, and we aim to grow and replicate FACTS Grounding as the field progresses, raising the bar continuously.

We encourage the AI community to engage with FACTS Grounding, test their models in an open set of examples or submit their models for testing. We believe that comprehensive measurement methods, coupled with ongoing research and development will continue to improve AI systems.

Thank you

FACTS is a collaboration between Google DeepMind and Google Research.
FACTS Grounding is led by: Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Dipanjan Das, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, and Nate Keating.

We also greatly appreciate the contributions from: Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, and Sasha Goldstein.

We would also like to thank Avinatan Hassidim, D. Sculley, Fernando Pereira, Koray Kavukcuoglu, Slav Petrov, Ya Xu, and Yossi Matias for their continued support.

Source link

nimda December 17, 2024

0 11 3 minutes read