A new study has revealed a surprising gap in the reasoning capabilities of today’s most advanced AI vision-language models.
Despite impressive performance in various established benchmarks, a recent preprint study published on arXiv finds that Vision-Language Models (VLMs) like OpenAI’s GPT-4o struggle to solve Bongard problems—a set of visual puzzles requiring high-level human-like abstract reasoning.
The study, involving a competitive analysis of these models against human participants, challenges assumptions about AI’s cognitive prowess in interpreting the visual world.
“While VLMs occasionally succeed in identifying discriminative concepts and solving some of the problems, they frequently falter, failing to understand and reason about visual concepts,” researchers wrote. “Surprisingly, even elementary concepts that may seem trivial to humans, such as simple spirals, pose significant challenges.”
“Moreover, even when asked to explicitly focus on and analyze these concepts, they continue to falter, suggesting not only a lack of understanding of these elementary visual concepts but also an inability to generalize to unseen concepts.”
In a world where artificial intelligence (AI) has rapidly evolved to tackle tasks that once seemed exclusive to human cognition, a recent study offers a reality check on machine intelligence’s limitations in understanding complex visual cues.
Researchers from various European institutions evaluated advanced Vision-Language Models (VLMs), such as GPT-4o and Claude, against a suite of classic puzzles called Bongard problems.
Developed in the 1960s, these visual challenges test pattern recognition and abstract reasoning by requiring participants to decipher conceptual rules from simple geometric shapes. For AI, these puzzles are far from simple.
Bongard problems (BPs) require analyzing a set of 12 diagrams split into two groups, each group following a specific, often abstract, rule. For instance, one side of the diagrams might exclusively contain vertically elongated shapes, while the other features horizontally elongated ones.
Humans are naturally adept at such tasks, which demand not just identifying basic patterns but forming abstract concepts from minimal data. This makes BPs particularly challenging for machine learning models, especially compared to typical image-recognition benchmarks.
For this study, researchers evaluated the performance of various vision-language models, including OpenAI’s GPT-4o, Claude, and two versions of the model LLaVA. Each model was tasked with solving 100 Bongard problems, with their answers assessed by a Large Language Model “judge” to ensure objective grading.
The findings were clear and striking: humans significantly outperformed AI across all categories. On average, human participants achieved an 84% success rate, while the best-performing vision-language model, GPT-4o, managed only 17%. This gap highlights humans’ unique cognitive abilities, particularly in visual reasoning and abstract thinking.
The researchers divided the Bongard problems into five categories: existence, size, concept, number, and spatial relationships. Humans performed best in the “existence” (presence or absence of a feature) and “spatial” (spatial orientation) categories, with scores over 90%.
In contrast, vision-language models struggled immensely with spatial tasks, failing to achieve more than 10% accuracy across models. GPT-4o excelled slightly in abstract “concept” problems, possibly due to its extensive training on varied data, yet it still fell short of human performance.
To investigate the root of AI’s limitations further, researchers examined vision-language model performance on specific Bongard problems, focusing on whether these models could identify fundamental concepts.
They selected four representative BPs, each requiring a different form of visual understanding: BP#16 (direction of a spiral), BP#29 (counting shapes), BP#36 (relative positioning), and BP#55 (left-right orientation). In each case, the models struggled.
For example, when prompted to identify whether a spiral rotated clockwise or counterclockwise, GPT-4o and Claude often produced incorrect results. The models tended to make errors in consistency, mistaking one rotational direction for the other across various attempts.
Similarly, only Claude performed accurately when analyzing BP#29, which required distinguishing between shapes inside or outside a more extensive form. Most models misinterpreted or failed to count correctly, suggesting challenges in AI’s visual counting capabilities.
For BP#55, which involved spatial orientation, vision-language models consistently failed, unable to determine whether a circle appeared on the left or right of a cavity in a larger shape. This specific issue underscores the broader challenge VLMs face with spatial relationships, aligning with other research suggesting that spatial reasoning is a critical AI limitation.
While these findings may highlight the limitations of current AI, they also point toward opportunities for future innovation. The researchers suggest that specialized training, perhaps involving intermediate stages to better differentiate between concepts, could improve performance.
For instance, a multi-stage approach in which models first identify possible patterns and then test these patterns could refine AI’s ability to navigate abstract problems like Bongard puzzles. Other strategies may include revisiting the models’ visual encoding processes and using more advanced techniques to improve pattern recognition and abstract reasoning.
Translating Bongard problems to real-world scenarios might also help AI models develop better perceptual and cognitive abilities. Using real-life analogs to solve these puzzles, researchers could explore whether visual context aids AI in forming abstract concepts and reasoning more effectively. This line of research could lead to more versatile vision-language models capable of comprehending everyday visual cues with greater depth.
The findings challenge assumptions about AI’s ability to mirror human cognition and raise critical questions about the adequacy of standard benchmarks for evaluating AI performance.
Despite success in tasks like image classification and captioning, the vision-language model’s shortcomings with BPs reveal that more advanced tests may be necessary to measure true AI comprehension. As the authors suggest, translating complex, abstract challenges like Bongard problems into real-world contexts may provide insights into AI’s ability to process and reason about visual information on a human level.
While VLMs like GPT-4o and Claude have achieved impressive feats in bridging text and vision, this study reveals that the journey to genuine human-like understanding remains challenging.
As AI evolves, overcoming these perceptual limitations will be essential for creating systems that can interact with the world as seamlessly as humans do. The study reminds us of the complexity of human cognition, encouraging AI researchers to think beyond existing benchmarks and aim for advancements that bring machines closer to human-level perception and reasoning.
Ultimately, this research illuminates the complexities of human cognition and the road AI still has to travel to replicate it. While current vision-language models represent a significant advancement, the failure to solve Bongard problems reminds us of the inherent challenges in modeling abstract reasoning and visual perception.
As AI research progresses, understanding and addressing these gaps will be crucial to achieving systems that can understand and interpret the world as humans do.
Tim McMillan is a retired law enforcement executive, investigative reporter and co-founder of The Debrief. His writing typically focuses on defense, national security, the Intelligence Community and topics related to psychology. You can follow Tim on Twitter: @LtTimMcMillan. Tim can be reached by email: tim@thedebrief.org or through encrypted email: LtTimMcMillan@protonmail.com