The AI research community widely believed that modern deep learning architectures were not “intelligent” enough to solve advanced mathematical problems. But, previous attempts to solve such tasks used transformers that were pretrained only on texts, resulting in their failures. Lacking the ability to automatically solve, grade, and generate university-level mathematics problems in real-time imposes a considerable pedagogical challenge to the operation of institutes of higher studies. Most past attempts have seen modest success and were tested on simpler or specialized mathematics problems. Past schemes could not solve high-school, Olympiad, and university-level problems and were isolated to a single course.
A team of researchers from MIT, Columbia University, Harvard University, University of Waterloo has proposed the first-ever method for using neural networks pre-trained on text datasets and then fine-tuned on code to solve mathematics problems by program synthesis. The team has demonstrated the method’s ability to automatically generate programs from mathematics questions, execute them to solve university-level problems from MIT’s extensive mathematics courses. The proposed methods consist of two major innovations that have attributed to its success:
1) Use of recent neural networks pretrained on text followed by a fine-tuning on code, rather than training on text alone which attributed to failures of many of the past studies
2) New techniques to rephrase problems were used such that the neural networks downstream can efficiently synthesize the correct executable functions
The overall scheme showed that it is possible for a single neural network without any specialized fine-tuning to automatically solve mathematics problems within seconds per problem across a wide range of courses with different concepts involved. The neural network provides an executable program as an output that then solves the mathematics problem at hand and visualizes it if required.
The researchers tested the method on six courses from MIT and one course from Columbia University. The courses’ concepts were single and multi-variable calculus, differential equations, probability, statistics, and linear algebra. The neural network was able to solve the problems from the listed courses with perfect accuracy. On testing the scheme on the MATH benchmark, accuracy to 100% was achieved, which is a huge accomplishment compared to the previous state-of-the-art result of 6.9%-8.8%.
The group also demonstrated a method for generating new questions for each course using the Open AI Codex Transformers. A randomly cropped version of the numbered list from the dataset of each course is taken to create new questions, and the result is then used to prompt the Codex for generating the next question. This process is carried out multiple times for each course to generate a new set of questions. A survey was carried out among MIT and Columbia University students who have taken the courses or equivalent courses to evaluate the machine-generated problems. The comparison of machine-generated and human-generated problems was done based on their difficulty level and appropriateness for the course. Also, the students were asked to label a set of problems as being machine-generated or human-generated. The complete survey results are summarised in the figure below from the original paper.
The researchers also proposed a scheme for the grading of the papers. The grading scheme was implemented by comparing correct answers with student answers given a point allocation. The paper did not provide much details regarding the neural network architecture used. Although many examples of solving, generating, and grading mathematics problems are provided in the paper, whose details can be found here. The research group believes that their approach can be scaled to other STEM courses and can bring substantial pedagogical benefits to higher education by providing automatic solution evaluation and question generation capability. The students’ survey showed that the machine-generated problems were quite close to the human written problems. The researchers seek to scale up this survey to many students and hundreds of other courses.