ANI

The Stroop Test Reveals the Inherent LLM Error

0 2 5 minutes read

The Stroop Test Reveals the Inherent LLM Error

Summary: A new study of artificial intelligence understanding has revealed a fundamental, systemic flaw in the attentional mechanisms of the large-language model (LLM). By giving the classic mental “Stroop task” to premier frontier models, including the GPT-5, Claude Opus 4.1, and Gemini 2.5, the researchers revealed significant cognitive impairments in the machine's decision-making.

While the human biological brain normally suppresses automaticity to maintain stable accuracy throughout a long data sequence, the attention of a transformer-based machine quickly declines under the pressure of length, dropping to near-zero accuracy when forced to inhibit its primary training sense.

Important Facts

The Machine Attention Audit: Led by researcher Suketu Patel and a group of experts, the study aimed to explore the structural differences between the attention of a transformer-based machine and the attention of a human mind. The investigators used the “Stroop task”, a pure clinical test in which color words are printed in different colored ink, to test executive control and some ability to inhibit an automatic response.
Height-Dependent Performance Crashes: The research team pointed out that while LLMs handle short data sequences well, their maximum control breaks down as token length scales. When testing short lists of five different words, the models worked well. However, as the length of the lists increased, the AIs experienced a dramatic, catastrophic drop in performance stability.
Frontier Model Degradation Metrics: Research has shown a failure of precision calculations in all top class models:
- GPT-4o: You have gained a solid 91% accuracy. in 5 words, fall down 57% accuracy. in 10 words, and fall down completely 15% accuracy. in 40 words.
- Claude 3.5 Sonnet: Maintained relative stability with the 20-word list but crashed into significance 24% accuracy. when expanded to 40 words.
Mixed List Near Zero Failure: In complex tests showing arrays containing chaotic mixtures of both matching and non-matching colors, LLMs performed worse. Under these mixed conditions, the accuracy of the machine decreased to approx 0% with different materials, resulting in a complete loss of functionality.
Glitch coverage: This performance vulnerability is not limited to older platforms. The same patterns of mental collapse and reduced concentration are confirmed in next-generation systems, incl GPT-5, Claude Opus 4.1again Gemini 2.5.
Biological vs. Synthetic Attention: Both humans and LLMs are fundamentally better trained in text-based word learning than in raw color naming. However, the human brain can effectively use top-down control to suppress the automatic impulse to read words, keeping its focus pure in the long run. The overall performance degradation of LLMs reveals a fundamental structural limitation in synthetic attention compared to biological attention.

Source: PNAS Nexus

Giving AI a classic psychological test reveals an inherent weakness in the LLM's decision-making skills.

Suketu Patel and colleagues tested how transformer-based machine attention differs from human attention by testing AI models on a “Stroop task,” in which color words are printed in colored ink, and participants are asked to name the ink color of each word while ignoring its meaning.

The task is used clinically to assess executive control, specifically a person's ability to inhibit an automatic response. Although people generally take longer to respond correctly when words and colors conflict than when they match, they can still perform stably and with high accuracy even on long lists of words.

The authors found that when the word and ink color did not match, LLMs performed better with a list of five words. But as the inventory grows longer, the AI's performance drops dramatically. The GPT-4o dropped from 91% accuracy on 5 words to 57% accuracy on 10 words and 15% accuracy on 40 words. Claude 3.5 Sonnet was stable at 20 words, but dropped to 24% accuracy at 40 words. On trials with lists of words in both matching and mismatching colors, LLM's performance was even worse, dropping close to 0% accuracy for mismatched items.

Similar results were obtained with GPT-5, Claude Opus 4.1, and Gemini 2.5. LLMs struggle to stay focused on naming colors instead of getting used to reading words.

As with humans, LLMs are better trained at reading words than at naming colors, yet humans can compress the reading of words into long lists and maintain focus on the task at hand. According to the authors, the collapse of the performance of LLMs suggests fundamental limitations compared to biological attention.

Important Questions Answered:

Q: Why does a simple word and color game completely break the decision making engine of advanced AI models?

A: Because the Stroop task tests a specific ability called executive control—the ability to intentionally inhibit an automatic response. LLMs are trained above all to read and predict text. When forced to ignore the meaning of a word and report only its font color, the AI's primary text-reading training overrides its instructions such as length scales of data sequences, causing the model to automatically revert to its auto-learning routine.

Q: How badly did next-generation systems like GPT-5 and Claude Opus 4.1 perform on long lists?

A: They crashed and almost completely failed. While models such as GPT-4o started out strong with 91% accuracy on the shortest list, expanding the dataset to just 40 words dragged its accuracy down to a mere 15%. When researchers tested the newest platforms, including GPT-5, Claude Opus 4.1, and Gemini 2.5, with a mixed list of similar and inconsistent words, the systems fell to nearly 0% accuracy on mismatches.

Q: What does this failure tell neuroscientists about the difference between human attention and AI?

A: Shows that transformer-based machine attention has a fundamental, structural limitation compared to biological minds. Although both humans and AIs are naturally better at reading text than coloring, the human brain can maintain top-down control to compress automatic learning loops across large data streams. LLMs completely lack this stable downward focus, proving that artificial attention struggles to resist its training bias when dealing with complex datasets.

Editor's Notes:

This article was edited by a Neuroscience News editor.
The journal paper is fully revised.
More content has been added by our staff.

About this AI thought research issues

Author: Jin's follower
Source: PNAS Nexus
Contact person: Jin Fan – PNAS Nexus
Image: Image posted in Neuroscience News

Actual research: Closed access.
“Negative control of transformer attention” by Suketu Chandrakant Patel, Hongbin Wang, and Jin Fan. PNAS Nexus
DOI:10.1093/pnasnexus/pgag149

Abstract

The control of the missing officer in the attention of the transformer

Although transformers in large linguistic models (LLMs) successfully use the attention method that has transformed natural language processing, they do not have the clear structure of attention control found in humans, which is important for conflict resolution and choosing the right information before competing calculations and is important for adaptive behavior.

To investigate the impact of this limitation on LLMs, we used the classic color Stroop task, widely considered the gold standard, to test attentional control in these models.

Our results revealed a general conflict effect of poor performance in terms of accuracy in the incongruent condition (e.g. naming the color word RED in blue) compared to the congruent condition (e.g. naming the color word RED in red), on short word lists, similar to human performance.

However, as the length of the word list increased, performance in the incongruent condition decreased to an almost complete collapse of performance, as accuracy in the congruent condition remained excellent, and word reading (e.g. reading the word RED. [in red] or RED [in blue]regardless of color) was almost perfect.

These findings indicate that adaptive attention mechanisms are fundamentally limited in their ability to resolve conflicts across extended contexts, and fail to maintain adaptive control under increasing distraction.

We suggest that incorporating higher control mechanisms similar to those in biological attention is essential to achieving general artificial intelligence.

Source link

nimda 4 hours ago

0 2 5 minutes read