Testing modern AI on Kaggle

Today, Kaggle is launching Community Benchmarks, which allows the global AI community to design, run and share their own custom benchmarks for testing AI models. This is the next step after launching Kaggle Benchmarks last year, to provide reliable and transparent access to tests from leading research groups such as Meta's MultiLoKo and Google's FACTS suite.
Why community-driven evaluation is important
AI capabilities have evolved so rapidly that it is difficult to evaluate the performance of a model. Recently, accuracy scores on static datasets have been sufficient to determine model quality. But today, as LLMs evolve into collaborative thinking agents, who write code and use tools, those static metrics and simple assessments are no longer enough.
Kaggle Community Benchmarks provide developers with a transparent way to validate their specific use cases and bridge the gap between test code and production-ready applications.
These real-world use cases require a flexible and transparent testing framework. Kaggle's Community Benchmarks provide a flexible, robust and continuous way to test AI models – shaped by the users who build and run these systems every day.
How to create your own benchmarks on Kaggle
Measurements begin with construction tasks, which can range from multi-step logic testing and code generation to the use of a test tool or image visualization. Once you have tasks, you can add them to a benchmark to test and measure the selected models on how they perform against all the tasks in the benchmark.
Here's how to get started:
- Create a job: Tasks test the performance of an AI model on a specific problem. They allow you to perform repeatable tests across different models to compare their accuracy and power.
- Create a benchmark: Once you've created one or more jobs, you can combine them into a Benchmark. Benchmark allows you to run tasks on a series of advanced AI models and generate a leaderboard to track and compare their performance.



