Theory of Mind: How GPT-4 and LLaMA-2 Stack Up Against Human Intelligence

A team of psychologists and researchers from the University Medical Center Hamburg-Eppendorf, Italian Institute of Technology, Genoa, University of Trento, and others have researched the evolving mind capabilities of large language models (LLMs) like GPT-4, GPT-3.5, and LLaMA2-70B and performed comparisons between LLMs and human performance. The theory of mind, the ability to attribute mental states to oneself and others, is fundamental to human social interactions. As AI and LLMs are advancing, a new concern arises about their ability to understand and navigate social complexities at par with humans. This study aims to systematically compare the theory of mind abilities of LLMs with human participants across various tasks, shedding light on their similarities, differences, and underlying mechanisms.

To evaluate LLMs’ theory of mind abilities, the researchers adopt a systematic experimental approach inspired by psychology. They employ a sequence of well-established theory of mind tests, including the hinting task, false belief task, recognition of faux pas, and irony comprehension. These tests cover a spectrum of theory of mind abilities, from basic understanding of false beliefs to more complex interpretations of social situations. LLMs, including GPT-4, GPT-3.5, and LLaMA2-70B, undergo multiple repetitions of each test, allowing for a robust comparison against human performance. Each task is tested on unique inputs to ensure LLMs do not merely replicate training data but demonstrate genuine understanding.

The researchers diligently administered each test to both groups, LLMs and human participants in written formats to ensure a fair comparison. They analyze responses using scoring protocols specific to each test, comparing performance across models and humans. Notably, GPT-4 exhibits strengths in irony comprehension, hinting, and strange stories tests, often surpassing human performance. However, it struggles with uncertain scenarios, such as the faux pas test, where it shows a reluctance to commit without full evidence. In contrast, GPT-3.5 and LLaMA2-70B demonstrated a bias towards affirming inappropriate statements, indicating a lack of differentiation in understanding implied knowledge. The study says that GPT models are cautious because they use mitigation measures to cut down on hallucinations and improve the accuracy of facts, which makes them overly cautious when things are not clear. Furthermore, the disembodied nature of LLMs without embodied decision-making processes contributes to differences in handling social uncertainty compared to humans.

In conclusion, the research highlights the complexity of evaluating LLMs’ theory of mind abilities and the importance of systematic testing to ensure a meaningful comparison with human cognition. While LLMs like GPT-4 demonstrate remarkable advancements in certain theory of mind tasks, they fall short in uncertain scenarios, revealing a cautious epistemic policy possibly linked to training methodologies. Understanding these differences is crucial for the development of LLMs that can navigate social interactions with human-like proficiency.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...