large language model
(Image Credit: Google DeepMind/Unsplash)

Large Language Models Rival Humans in Learning Logical Rules, New Study Finds

When OpenAI’s GPT-4 and other large language models (LLMs) first awed the public with fluent text generation, skeptics were quick to point out that producing convincing sentences isn’t the same as thinking. 

Could these systems actually mirror the way humans learn and reason, or are they just mimicking patterns in data? Now, a new study accepted for publication in the February 2026 edition of the Journal of Memory and Language offers one of the clearest tests yet. 

Conducted by researchers from Brown University, the paper asks whether LLMs can do more than regurgitate language. It probes whether they can induce abstract logical rules from examples, a classic benchmark in cognitive science for how humans form concepts. 

The findings not only challenge long-standing assumptions about artificial neural networks but also open new and exciting avenues for understanding both AI and human cognition.

In four experiments comparing a range of state-of-the-art LLMs to human participants on a rule-learning task, the researchers show that some models, most notably GPT-4 and the open-weights Gemma-7B, achieved human-level accuracy on tasks requiring propositional logic and even displayed human-like learning trajectories over time. 

While LLMs still fell short on tasks requiring complete first-order logic, their performance suggests that these systems may embody new, non-symbolic ways of representing logical concepts that cognitive scientists will now have to take seriously.

“Across four experiments, we find converging empirical evidence that LLMs provide at least as good a fit to human behavior as models that implement a Bayesian probabilistic language of thought (pLoT),” the researchers write. “Moreover, we show that the LLMs make qualitatively different predictions about the nature of the rules that are inferred and deployed in order to complete the task, indicating that the LLM is unlikely to be a mere implementation of the pLoT solution.”

For decades, cognitive scientists have used “rule induction” tasks to study how people learn concepts. Participants see small sets of objects that vary by color, size, and shape and are asked to guess which ones belong to a novel category—say, which objects are “wudsy.” 

The hidden rules can be simple (“blue objects”) or complex (“same shape as the yellow object”), and researchers track the learning curves and mistakes to infer how people form rules.

This task has been a gold standard for testing computational models of human learning. Bayesian “probabilistic Language of Thought,” or “pLoT,” models, which start with libraries of logical primitives and combine them probabilistically, have long been the best fit for human data. Neural networks, by contrast, have been considered poorly-suited for such symbolic reasoning because they lack explicit logical operators.

Brown University researchers decided to run today’s massive LLMs through the same gauntlet. They converted the visual task into text by turning “red square” or “medium blue circle” into prompts. They framed the problem as a binary classification (“True” or “False”). Crucially, the models received feedback across multiple rounds, just as human participants had.

The results were unexpected and intriguing. On rules expressible in propositional logic (“and,” “or,” “not”), GPT-4, Mixtral 8×7B, and Gemma-7B all exceeded the lower bound of human accuracy on both overall and late-stage performance. 

GPT-4 scored 0.908 on propositional rules in the last quarter of trials, compared with a human mean of 0.932, while Gemma-7B scored 0.969. Even on more complex first-order logic rules, Gemma-7B edged out humans on some metrics. However, all models’ performance dipped relative to more straightforward rules.

The study also looked at whether models merely guessed correctly or actually formed rules they could articulate. In a second experiment, GPT-4 was prompted to state the rule it was using before labeling new objects. Its classifications remained 96.3% consistent with its self-reported rules for propositional tasks—a strong sign that both stemmed from a common underlying representation.

However, the model’s “match rate”—the proportion of times it recovered the exact truth, conditional rule- was just 44.1% compared to 82.4% for the Bayesian model. 

GPT-4 essentially failed to invoke genuine first-order logic, instead stringing together long chains of “and” and “or” operators. This suggests that while LLMs can approximate complex rules, they may be doing so with different primitives than those humans or symbolic models use.

A third experiment pushed deeper by examining if LLMs not only match human accuracy but also mirror the pattern of human mistakes and learning over time.  Using Gemma-7B, whose open weights allowed fine-tuning on human data, the team measured the correlation between the model’s and humans’ probability judgments for each object.

Once tuned on training lists of human responses, the model explained 84.8% of the variance in human participant responses on held-out lists, which is significantly higher than the correlation of the Bayesian model. In other words, when Gemma-7B got the correct answer, it often did so in the same way humans did, showing similar peaks and troughs in its learning curves.

These findings hint that large, generic language models may have stumbled onto inductive procedures resembling those people use. 

“The tuned LLM closely matches the occurrences of peaks and troughs in human learning trajectories, and qualitatively seems to match the magnitude of these changes better than the pLoT model,” researchers write. “The fact that the tuned LLM frequently matched the occurrences of these troughs strongly suggests not only that it often arrived at a similar best-so-far hypothesis, but also that it may be implementing a similar inference procedure as both human participants and the Bayesian pLoT model.” 

The findings complicate a long-running debate. If a neural network with no built-in logical primitives can achieve human-level rule learning and human-like learning curves, then perhaps humans themselves do not rely on the neat symbolic operators posited by classical cognitive theories. Instead, people may be approximating logical rules using more associative or content-sensitive mechanisms, just as LLMs appear to do. This has profound implications for both AI and cognitive science, suggesting that the traditional view of human reasoning as purely symbolic may need to be re-evaluated.

The study also raises practical questions for AI. Could prompting or fine-tuning LLMs with explicit quantifiers improve their handling of first-order logic? And if these models are learning non-classical “logic-like” operators, could studying them inspire new hypotheses about human reasoning? These questions not only point to potential avenues for improving AI systems but also underscore the value of studying LLMs as a means of gaining insights into human cognition.

The authors are cautious. High accuracy does not guarantee that a model is actually using logic. It may be using hidden shortcuts or patterns linked to the specific words used in the task. Similarly, the results do not show that LLMs “think” like people in any rich sense. However,  they do establish that, at least on this benchmark, some LLMs now rival or surpass humans and the best symbolic models.

As the paper concludes, “LLMs may instantiate a novel theoretical account of the primitive representations and computations necessary to explain human logical concepts, with which future work in cognitive science should engage.”

Tim McMillan is a retired law enforcement executive, investigative reporter and co-founder of The Debrief. His writing typically focuses on defense, national security, the Intelligence Community and topics related to psychology. You can follow Tim on Twitter: @LtTimMcMillan.  Tim can be reached by email: tim@thedebrief.org or through encrypted email: LtTimMcMillan@protonmail.com