Artificial intelligence systems may be getting faster, larger, and more multimodal by the month, but a new empirical study suggests that many of today’s most advanced models still trip up on the kind of basic visual reasoning that humans take for granted.
Intriguingly, some of the most heavily hyped frontier models performed worse than expected, while a quieter OpenAI release emerged as the most accurate and consistent system in the field.
The findings come from a peer-reviewed study set to be published in the April 2026 edition of Pattern Recognition, where researchers evaluated nine state-of-the-art multimodal large language models (LLMs)—including OpenAI’s ChatGPT-4o and ChatGPT-o1, Google DeepMind’s Gemini 2.0, xAI’s Grok 3, and DeepSeek’s Janus models—on a battery of tests designed to probe how well they make sense of multiple images at once.
The benchmark didn’t simply measure whether a model could point to the right object in a picture. Instead, it challenged these systems to demonstrate stable, reliable reasoning, or whether they’re simply guessing.
At the heart of the study is a question with serious implications for any field hoping to rely on AI for decision-making: Can these systems tell the difference between what they know and what they don’t? And more importantly, can they do it consistently?
To answer that, the authors built an evaluation framework that introduces visual reasoning tasks across multiple images, shuffling answer options to expose positional bias. Notably, researchers used a new metric to measure the “entropy” of a model’s reasoning. Low entropy reflects consistent answers even when the test format changes. High entropy indicates instability or guesswork rather than true comprehension.
“By employing multi-image contexts, rejection mechanisms, and entropy-based consistency metrics, this benchmark sets a new standard for evaluating multimodal LLMs, enabling a more robust and reliable assessment of next-generation AI systems,” the researchers write.
The results paint a revealing, at times surprising, picture of the current AI landscape.
OpenAI’s ChatGPT-o1 performed best, with an overall accuracy of 82.5 percent, outperforming larger, better-known competitors. It also demonstrated the lowest entropy score of any model tested, meaning it was the least likely to change its answer when the options were rearranged.
In other words, ChatGPT-o1’s reasoning was not only strong but steady.
Google DeepMind’s Gemini 2.0 Flash Experimental model followed behind, and ChatGPT-4o also delivered strong, consistent reasoning. However, size didn’t equal success across the board.
Grok 3, xAI’s massive 2.7-trillion-parameter flagship model, posted accuracy scores well below the top performers and displayed an unusually high tendency to “over-abstain” by selecting the option “None of the provided choices” even when the correct answer was present.
This pattern, researchers note, suggests an overly conservative reasoning style that may cause the model to freeze or refuse an answer rather than commit to one.
Meanwhile, DeepSeek’s Janus models struggled differently. Both Janus 7B and Janus 1B were highly susceptible to positional bias, scoring the worst entropy values of the entire group.
Their answers changed frequently when the order of the multiple-choice options was shuffled, indicating a lack of stable reasoning and a reliance on superficial patterns.
Researchers note that this behavior points to “reasoning variability and susceptibility to positional biases,” adding that Janus models “rely more on surface-level patterns rather than genuine comprehension.”
The findings represent an important correction to the popular narrative around China’s DeepSeek and its rapid rise in the AI ecosystem.
While DeepSeek’s R1 model drew attention for competing with much larger Western systems in text-based reasoning, the company’s multimodal Janus series performed far worse than expected in visual reasoning, showing instability, positional bias, and difficulty generalizing across tasks.
The study also tested how well these systems handle uncertainty. Forty of the benchmark’s questions were intentionally unanswerable, and the correct response was to reject all the options.
Only two models—ChatGPT-o1 and QVQ-72B-Preview—performed well on this task. Many others avoided choosing “None of the provided options” even when it was correct. This demonstrated a troubling pattern that suggests overconfidence and a reluctance to acknowledge uncertainty.
Researchers warn that this form of miscalibration could pose real risks in safety-critical environments where refusing to answer is the correct and necessary choice.
“Evaluating a model’s ability to handle unanswerable questions is essential for deploying reliable AI systems,” the researchers note, emphasizing that abstention is as important as accuracy in environments like medicine, aviation, and defense.
Another unexpected finding emerged from the Qwen family of models. Despite strong performance in several areas, they frequently refused to process cartoon images due to overly aggressive content restrictions—a limitation that caused them to miss many questions in the Cartoon Understanding task. The study notes that this restrictive filtering significantly limits Qwen’s applicability in real-world multimodal reasoning.
Across all categories, one message remained clear: today’s multimodal AI still struggles to replicate the stable, consistent reasoning humans apply without thinking.
Even top-tier models faltered when asked to combine information across multiple images or when deprived of familiar answer positions.
Many systems also showed signs of overfitting to existing benchmarks that focus heavily on single-image tasks, raising questions about how well these models generalize to new and unfamiliar visual contexts.
Researchers argue that this is precisely why new approaches—such as entropy measurement—are needed to expose weaknesses that conventional accuracy scores mask.
“This benchmark offers a methodological shift in how we evaluate vision-language systems,” the researchers write, arguing that next-generation AI must demonstrate not just correctness but “consistency, uncertainty calibration,” and resistance to “heuristic-driven shortcuts.”
For researchers, this study offers a clearer window into what multimodal AI can and cannot do. For companies racing to deploy AI tools into mission-critical applications—from autonomous surveillance to medical diagnostics—the warning is equally sharp: visual intelligence remains an unsolved problem, and real-world reasoning demands more than raw parameter counts.
However, for OpenAI, the study delivers some rare good news at a time of intense competition. ChatGPT-o1 not only achieved the highest accuracy but also displayed the most consistent reasoning of any model tested. That performance suggests that OpenAI’s recent pivot toward more structured, reasoning-optimized training techniques may be paying off.
As multimodal systems expand into robotics, AR interfaces, and real-time decision-making, these studies offer a necessary stress test of their capabilities.
These new findings come on the heels of a 2024 study, in which researchers evaluated cutting-edge vision-language models on classical visual puzzles known as Bongard problems—tasks designed to test abstract pattern recognition and conceptual reasoning.
That earlier study found that even top-tier models such as GPT-4o managed only about 17 percent accuracy, while human participants scored near 84 percent, underscoring a major gap between AI and human-level visual cognition.
Now, this latest benchmark reinforces and extends that conclusion by revealing similar limitations not only in abstract puzzles but also in more complex, variable, multi-image reasoning tasks.
Taken together, research continues to reveal that many high-profile AI models stumble where humans would not. This strengthens the case that despite rapid advances, visual reasoning remains one of AI’s most elusive frontiers.
“Our findings reveal not just where models fail, but how they fail, offering a pathway to targeted improvements in consistency and uncertainty calibration,” the researchers conclude. “As multimodal systems continue to scale and enter domains where reasoning robustness matters, such as healthcare, education, and legal AI, benchmarks must evolve accordingly.”
Tim McMillan is a retired law enforcement executive, investigative reporter and co-founder of The Debrief. His writing typically focuses on defense, national security, the Intelligence Community and topics related to psychology. You can follow Tim on Twitter: @LtTimMcMillan. Tim can be reached by email: tim@thedebrief.org or through encrypted email: LtTimMcMillan@protonmail.com
