Researchers at the Allen Institute for AI Propose Līla, a Unified Benchmark for Comprehensive Evaluation of the Mathematical Reasoning Abilities of Artificial Intelligence Systems

Mathematical reasoning is a fundamental requirement for general-purpose artificial intelligence systems. These tasks can have a wide range of complexity, and they might be as simple as grocery shopping or as complicated as climate modeling. Researchers from Arizona State University and the Allen Institute for AI proposed Līla, a unified benchmark for mathematical reasoning, to assess and enhance AI systems in this field. The benchmark consists of 23 different activities across four dimensions: language diversity (no language, simple language), language format (question-answering, fill-in-the-blanks), mathematical abilities (arithmetic calculus), and external knowledge (commonsense, physics).

Līla consists of 140K questions spanning 23 diverse tasks in natural language that are annotated with Python programs and instructions in other languages. Multiple splits of the data set are included, including Līla-IID (train, dev, test), Līla-OOD (train, dev, test), and Līla-Robust. The benchmark was built by extending 20 datasets by gathering task instructions and solutions in Python programs, gaining both the correct answer and explainable solutions. Additionally, two evaluation datasets were added to assess language perturbation’s and out-of-distribution performance’s robustness. The team also introduced Bhaskara, a general-purpose mathematical reasoning model trained on Līla. This model is available on HuggingFace. According to their tests, the multi-task model performs better than the comparably sized T5 and GPT-Neo when perfecting new arithmetic tasks.

The team also highlighted some key findings from their experimentations. Superior Out-of-Distribution (OOD) Performance was one such crucial outcome. The study discovered that the multi-task Bhaskara model outperforms its single-task peers on novel math problem types not encountered during training using Līla-OOD, an out-of-distribution split provided in the benchmark. Līla-Robust, a split that contains questions with linguistic variants without changing the mathematical content, was used to assess the robustness of mathematical reasoning models. The diversity of Bhskara’s training material contributed to the language’s overall robustness against linguistic disruption. 

Prior research has used transformer-based LLMs to directly produce the algebraic expression given the natural language mathematical query. Bhaskara, on the other hand, was taught to output Python scripts that, when run, produce the correct response. As the Lla benchmark has the required annotation, this was made achievable. The researchers also noted how program synthesis significantly beats direct response in both fine-tuning and few-shot conditions. The conceptual and practical value of gathering and producing interpretable program intermediates for mathematical reasoning was thus further highlighted.

They further explore why cutting-edge language modeling systems like GPT-3 perform poorly on Līla and conclude that they do. One significant contribution made by the team is the addition of program annotations to the existing datasets for mathematical reasoning, providing both the correct answer and an explainable solution. The study article by the team will also be presented at the prestigious EMNLP 2022 conference.

Through their research, the team realized that although AI still has a long way to go before it can grasp general-purpose mathematical thinking, with such rapid advancement and increased interest, much more can still be accomplished. The team aspires for their contributions to measure and promote the advancement of mathematical reasoning systems. They anticipate the future work the community will perform to change the way people approach and solve various mathematical problems.

Check out the paper, dataset, model, and reference article. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is passionate about the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more about the technical field by participating in several challenges.