Meta AI Releases NeuralBench: An Open Source Integrated Framework to Benchmark NeuroAI Models on 36 EEG Tasks and 94 Datasets

0 5 6 minutes read

Meta AI Releases NeuralBench: An Open Source Integrated Framework to Benchmark NeuroAI Models on 36 EEG Tasks and 94 Datasets

Testing AI models trained on brain signals has long been a messy, controversial topic. Different research groups use different pre-processing pipelines, train models on different datasets, and report results on a narrow set of tasks – making it almost impossible to know which model actually works best, or what. A new framework from the Meta AI team is designed to fix that.

Meta researchers released NeuralBencha compact, open-source framework for AI models for the estimation of brain activity. Its first release, NeuralBench-EEG v1.0is the largest open benchmark of its kind: 36 downstream tasks, 94 datasets, 9,478 subjects, 13,603 hours of electroencephalography (EEG) data, and 14 deep learning structures tested under one common interface.

The Problem NeuralBench Solves

The broad field of NeuroAI where deep learning meets neuroscience has exploded in recent years. Supervised learning methods built on language, speech and imagery are now being prepared for construction basic models of the brain: large models pre-trained on non-invasive brain recordings and fine-tuned for tasks ranging from detecting seizures to interpreting what a person sees or hears.

But the test site is badly divided. Existing benchmarks such as MOABB cover up to 148 brain-computer interfacing (BCI) datasets but limit testing to only 5 downstream tasks. Other efforts – EEG-Bench, EEG-FM-Bench, AdaBrain-Bench – each tied in their own ways. In areas such as magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI), there is no systematic benchmark.

The result – claims that basic models are “practical” or “basic” are often based on stone-picked activities that have no common reference point.

What is NeuralBench?

NeuralBench is built on it three basic Python packages make up the modular pipeline.

NeuralFetch manages dataset acquisition, pulling selected data from public repositories including OpenNeuro, DANDI, and NEMAR. NeuralSet prepares data as PyTorch-ready data loaders, wraps existing neuroscience tools like MNE-Python and nilearn for pre-processing, and HuggingFace for extracting stimulus embeddings (for tasks including images, speech, or text). NeuralTrain provides modular training code built on PyTorch-Lightning, Pydantic, and exca execution and caching library.

Once installed with pip install neuralbenchthe framework is managed through a command line interface (CLI). Running a job is as easy as three commands: download data, prepare cache, and execute. Every task is configured with a lightweight YAML file that specifies the data source, train/validation/test classification, preprocessing steps, target processing, training parameters, and evaluation metrics.

What NeuralBench-EEG v1.0 Includes

The first release focuses on EEG and includes eight functional categories: cognitive decoding (picture, sentence, speech, typing, video, and recording words), brain-computer interfacing (BCI), it evoked responses, clinical activities, internal situation, sleep, phenotypingagain various.

Three categories of models are compared:

Job-specific properties (~1.5K–4.2M parameters, trained from scratch): ShallowFBCSPNet, Deep4Net, EEGNet, BDTCN, ATCNet, EEGConformer, SimpleConvTimeAgg, and CTNet.
Basic models of EEG (~3.2M–157.1M parameters, pretrained and fine-tuned): BENDR, LaBraM, BIOT, CBraMod, LUNA, and REVE.
Basics of handmade features: sklearn-style pipelines that use symmetric positive definite (SPD) matrix representations fitted to logistic or Ridge regression.

All base models were fine-tuned from end-to-end using a shared training recipe — AdamW optimizer, learning rate 10⁻⁴, weight decay 0.05, cosine-annealing with 10% warmup, up to 50 epochs with early stopping (tolerance=10). The only exception is BENDR, where the learning rate is reduced to 10⁻⁵ and the gradient cutoff is used at 0.5 to obtain stable learning curves. This intentional configuration removes model-specific optimization techniques – such as intelligent learning rate decomposition, two-stage evaluation, or LoRA – so that the architecture and pre-training method are the ones being evaluated.

Data classification is handled differently for each type of work to reflect real-world generalization issues: predefined classifications when given to a dataset research team, leave-the-mind-out in brain imaging tasks (all subjects seen in training, but a fixed set of stimuli used for testing), cross-subject classification in many clinical and BCI tasks, and subject classification among datasets with very few participants. Each model is trained three times for each task using three different random seeds.

Evaluation metrics were measured by task type: moderate accuracy for binary and multi-class classification, maximum F1 score for multi-label classification, Pearson correlation for regression, and maximum accuracy for 5 retrieval tasks. All results are reported as standardized scores (s̃), where 0 corresponds to dummy-level performance and 1 corresponds to perfect performance, allowing comparison of different functions regardless of the metric scale.

One important methodological note: some EEG base models are pre-trained on datasets that overlap with NeuralBench's downstream test sets. Rather than discarding these results, the benchmark marks them with hash bars in the result statistics so that readers can identify possible leaks in the training data in advance – there is no strong trend to suggest that the leaks increased the observed performance, but the light is kept.

The benchmark offers two variants: NeuralBench-EEG-Core v1.0which uses a single representative data set for each activity to cover more, and NeuralBench-EEG-Full v1.0expands up to 24 datasets per activity to study the variability of activity across recorded hardware, labs, and course statistics. Kendall's τ of 0.926 (p < 0.001) between the Core and Full rankings confirms that the Core variant is a reliable representative - although several model rankings change, including CTNet surpassing LUNA when additional data sets are included.

Two Important Findings

Finding 1: Baseline models perform better than task-specific models. The top ranked models overall are REVE (69.2M parameters, mean rank 0.20), LaBraM (5.8M, rank 0.21), and LUNA (40.4M, rank 0.30). But several task-specific models trained from scratch — CTNet (150K parameters, rank 0.32), SimpleConvTimeAgg (4.2M, rank 0.35), and Deep4Net (146K, rank 0.43) — lag behind. CTNet actually outperforms the LUNA base model to rank third in Total variance, despite having nearly 270× fewer parameters. This indicates the gap between the task-specific and baseline models is small enough that expanding the dataset coverage alone is sufficient to change global standards.

Finding 2: Most jobs are always really hard. Brain imaging tasks – recovering dense representations of images, speech, sentences, video, or words from brain activity – are particularly challenging, and even the best models score under the roof. Tasks such as mental imagery, sleep arousal, description of psychopathology, and mixed motor imagery and P300 planning often reveal performance close to dummy level. These functions represent the best benchmarks for stress testing the next generation of EEG baseline models.

Tasks approaching space saturation include SSVEP classification, disease detection, seizure detection, sleep stage classification, and phenotyping tasks such as age regression and gender classification.

Beyond EEG: MEG and fMRI

Even in this first EEG-focused release, NeuralBench already supports MEG and fMRI functions as proof of concept. Notably, the REVE model – specifically pre-trained on EEG data – achieves the best performance among all tested models in the MEG decoding task. This is an impressive early signal that pretrained EEG representations can reasonably transfer across brain imaging methods, an idea that is set to be rigorously tested in future releases.

The infrastructure is clearly designed for expansion to intracranial EEG (iEEG), functional near-infrared spectroscopy (fNIRS), and electromyography (EMG).

How To Get Started

Installation takes one command: pip install neuralbench. From there, applying the audiovisual stimulus classification task to the EEG looks like this:

neuralbench eeg audiovisual_stimulus --download   # Download data
neuralbench eeg audiovisual_stimulus --prepare    # Prepare cache
neuralbench eeg audiovisual_stimulus              # Run the task

Performing all 36 tasks against all 14 EEG models, i -m all_classic all_fm flag handles the orchestration. The full storage requirements of the benchmark are large: about 11 TB in total (~3.2 TB raw data, ~7.8 TB pre-processed cache, ~333 GB included results), with one GPU at least 32 GB VRAM per task – although the maximum GPU usage measured in all tests is only ~1.3 GB (max ~30.3 GB).

A full run of NeuralBench-EEG-Full v1.0 required approximately 1,751 GPU hours across 4,947 tests.

Key Takeaways

Meta AI's NeuralBench-EEG v1.0 is an open EEG benchmark — 36 tasks, 94 datasets, 9,478 subjects, and 14 deep learning structures under one standardized interface.
Despite up to 270× more parameters, EEG-based models like REVE only marginally outperform lightweight task-specific models like CTNet (150K parameters) in every benchmark.
Cognitive recording tasks (speech, video, sentence, word recording from brain activity) and clinical predictions remain a major challenge, with many models scoring close to the dummy level.
REVE, pre-trained only on EEG data, outperformed all other models in the collection of MEG typing — an early signal of cross-modal transmission.
NeuralBench is licensed by MIT.

Check it out Paper again GitHub Repo. Also, feel free to follow us Twitter and don't forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us

Source link

nimda 3 weeks ago

0 5 6 minutes read