This AI Paper from China Presents MathScale: A Scalable Machine Learning Method to Create High-Quality Mathematical Reasoning Data Using Frontier LLMs

Large language models (LLMs) excel in various problem-solving tasks but need help with complex mathematical reasoning, possibly due to the need for multi-step reasoning. Instruction Tuning effectively enhances LLM capabilities. However, its effectiveness is hindered by the scarcity of datasets for mathematical reasoning. This limitation highlights the need for more extensive datasets to fully leverage Instruction Tuning to improve LLM performance in mathematical problem-solving.

Instruction Tuning is effective but limited by small datasets like GSM8K and MATH. ChatGPT-based Instruction Tuning, exemplified by WizardMath and MetaMath, enhances math instruction by utilizing ChatGPT for data synthesis. These methods employ reinforced Evol-instruct and bootstrapping strategies to evolve questions and augment datasets. However, their effectiveness is constrained by manually designed operations. 

Researchers from The Chinese University of Hong Kong, Microsoft Research, and Shenzhen Research Institute of Big Data introduce a novel approach, MathScale, to address mathematical reasoning datasets’ scalability and quality issues. This innovative method extracts high-level concepts from existing math questions, constructs a concept graph to estimate connections between them, and generates new questions based on randomly sampled concepts. MathScale also introduces MWPBENCH, a unique, comprehensive benchmark covering various difficulty levels, to evaluate mathematical reasoning capabilities consistently and fairly. The effectiveness of MathScale in scaling dataset size and significantly improving LLM capabilities is demonstrated by the MathScaleQA dataset and its performance on MWPBENCH.

MathScale’s dataset generation process is a systematic four-step approach. Firstly, it leverages GPT-3.5 to extract high-level concepts from existing math questions, eliminating the need for reliance on original questions. Secondly, it constructs a concept graph based on these extractions, visually representing the connections between different concepts. Thirdly, it employs a random walk algorithm to sample topics and knowledge points from the graph, ensuring a diverse and comprehensive dataset. Finally, it generates new math questions based on these sampled concepts, strictly adhering to the provided topics and knowledge points.

MathScale sets itself apart from other models, including LLaMA-2 7B, LLaMA-2 13B, and Mistral 7B, on the MWPBENCH dataset. It not only achieves a micro average accuracy of 35.0% and a macro average accuracy of 37.5% but also surpasses counterparts of equivalent size by 42.9% and 43.7%, respectively. Even on out-of-domain test sets like GaokaoBench-Math and AGIEval-SAT-MATH, MathScale-7B significantly outperforms other open-source models. MathScale-Mistral demonstrates performance parity with GPT-3.5-Turbo on both micro and macro averages, further underscoring its superiority.

In conclusion, researchers from The Chinese University of Hong Kong, Microsoft Research, and Shenzhen Research Institute of Big Data present MathScale, which introduces a straightforward and scalable approach for producing top-notch mathematical reasoning data using cutting-edge LLMs. Also, MWPBENCH provides a comprehensive benchmark for math word problems across various difficulty levels. MathScale-7B exhibits state-of-the-art performance on MWPBENCH, outperforming equivalent-sized peers by significant margins. This contribution advances mathematical reasoning by facilitating fair and consistent model evaluations in academic settings.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

You may also like our FREE AI Courses….

Muhammad Athar Ganaie, a consulting intern at MarktechPost, is a proponet of Efficient Deep Learning, with a focus on Sparse Training. Pursuing an M.Sc. in Electrical Engineering, specializing in Software Engineering, he blends advanced technical knowledge with practical applications. His current endeavor is his thesis on "Improving Efficiency in Deep Reinforcement Learning," showcasing his commitment to enhancing AI's capabilities. Athar's work stands at the intersection "Sparse Training in DNN's" and "Deep Reinforcemnt Learning".

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...