Meet FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions

In conversational AI, evaluating the Theory of Mind (ToM) through question-answering has become an essential benchmark. However, passive narratives need to improve in assessing ToM capabilities. To address this limitation, diverse questions have been designed to necessitate the same reasoning skills. These questions have revealed the limited ToM capabilities of LLMs. Even with chain-of-thought reasoning or fine-tuning, state-of-the-art LLMs still require assistance when dealing with these questions and perform below human standards.

Researchers from different universities introduced FANToM, a benchmark for testing ToM in LLMs through conversational question answering. It incorporates psychological and empirical insights into LLM evaluation. FANToM proves challenging for top LLMs, which perform worse than humans even with advanced reasoning or fine-tuning. The benchmark evaluates LLMs by requiring binary responses to questions about characters’ knowledge and listing characters with specific information. Human performance was assessed with 11 student volunteers.

FANToM is a new English benchmark designed to assess machine ToM in conversational contexts, focusing on social interactions. It includes 10,000 questions within multiparty conversations, emphasizing information asymmetry and distinct mental states among characters. The goal is to measure models’ ability to track beliefs in discussions, testing their understanding of others’ mental states and identifying instances of illusory ToM. 

FANToM tests machine ToM in LLMs through question-answering in conversational contexts with information asymmetry. It includes 10,000 questions based on multiparty conversations where characters have distinct mental states due to inaccessible information. The benchmark assesses LLMs’ ability to track beliefs in discussions and identify illusory ToM. Despite chain-of-thought reasoning or fine-tuning, existing LLMs perform significantly worse on FANToM than humans, as evaluated results indicate.

The evaluation results of FANToM reveal that even with chain-of-thought reasoning or fine-tuning, existing LLMs perform significantly worse than humans. Some LLM ToM reasoning in FANToM is deemed illusory, indicating their inability to comprehend distinct character perspectives. While applying zero-shot chain-of-thought logic or fine-tuning improves LLM scores, substantial gaps compared to human performance persist. The findings underscore the challenges in developing models with coherent Theory of Mind reasoning, emphasizing the difficulty of achieving human-level understanding in LLMs.

In conclusion, FANToM is a valuable benchmark for assessing ToM in LLMs during conversational interactions, highlighting the need for more interaction-oriented standards that align better with real-world use cases. The measure has shown that current LLMs underperform compared to humans, even with advanced techniques. It has identified the issue of internal consistency in neural models and provided various approaches to address it. FANToM emphasizes distinguishing between accessible and inaccessible information in ToM reasoning. 

Future research directions include grounding ToM reasoning in pragmatics, visual information, and belief graphs. Evaluations can encompass diverse conversation scenarios beyond small talk on specific topics, and multi-modal aspects like visual information can be integrated. Addressing the issue of internal consistency in neural models is crucial. FANToM is now publicly available for further research, promoting the advancement of ToM understanding in LLMs. Future studies may consider incorporating relationship variables for more dynamic social reasoning.


Check out the Paper, Github, and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 32k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on Telegram and WhatsApp.

Hello, My name is Adnan Hassan. I am a consulting intern at Marktechpost and soon to be a management trainee at American Express. I am currently pursuing a dual degree at the Indian Institute of Technology, Kharagpur. I am passionate about technology and want to create new products that make a difference.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...