Google AI Proposes MathWriting: Transforming Handwritten Mathematical Expression Recognition with Extensive Human-Written and Synthetic Dataset Integration and Enhanced Model Training

Online text recognition models have advanced significantly in recent years due to enhanced model structures and larger datasets. However, mathematical expression (ME) recognition, a more intricate task, has yet to receive comparable attention. Unlike text, MEs have a rigid two-dimensional structure where the spatial arrangement of symbols is crucial. Handwritten MEs (HMEs) pose even greater challenges due to ambiguity and the need for specialized hardware. Obtaining handwritten samples is costly as they require human input, further compounded by the necessity for dedicated devices like touchscreens or digital pens. Therefore, improving ME recognition demands tailored approaches distinct from text recognition.

Google Research has unveiled MathWriting, a dataset for online HME. Comprising 230k human-written and 400k synthetic samples, it surpasses offline HME datasets like IM2LATEX-100K. MathWriting facilitates online and offline HME recognition, aiding research by providing ample data. Compatible with other online datasets like CROHME and Detexify, MathWriting is shared in InkML format. Rasterizing inks effectively expand offline HME datasets. This initiative introduces a new benchmark for ME recognition, featuring normalized ground truth expressions for simplified training and robust evaluation alongside code examples on GitHub for seamless usage.

✅ [Featured Article] LLMWare.ai Selected for 2024 GitHub Accelerator: Enabling the Next Wave of Innovation in Enterprise RAG with Small Specialized Language Models

When comparing MathWriting to CROHME23, MathWriting stands out with nearly 3.9 times more samples and 4.5 times more distinct labels after normalization. Although there’s a considerable overlap of labels between the two datasets (47k), most are dataset-specific. Notably, MathWriting boasts a larger number of human-written inks compared to CROHME23. Moreover, MathWriting offers a broader array of tokens, encompassing 254 distinct ones, including Latin capitals, the majority of the Greek alphabet, and matrices—enabling representation of diverse scientific fields like quantum mechanics, differential calculus, and linear algebra.

The MathWriting dataset comprises 253k human-written expressions and 6k isolated symbols for training, validation, and testing, alongside 396k synthetic expressions. It covers 244 mathematical symbols and ten syntactic tokens. The dataset, released under Creative Commons, employs normalized LATEX notation as ground truth. The handwritten mathematical expression recognition benchmark is based on MathWriting’s test split, utilizing the character error rate (CER) metric. Various recognition models, including CTC Transformer and OCR, demonstrate the dataset’s utility. Data collection involved human contributors copying rendered expressions via an Android app, followed by minimal postprocessing and label normalization to enhance model performance.

The MathWriting dataset offers a detailed insight into handwritten mathematical expressions compared with the CROHME23 dataset. With extensive label and ink statistics, MathWriting provides valuable information on the diversity of expressions and writing styles. It emphasizes the significance of synthetic data in enhancing model diversity and highlights challenges such as device variations and noise sources like stray strokes and incorrect ground truth. Despite inherent recognition challenges, MathWriting is a comprehensive resource for training and evaluating handwriting recognition models, offering insights into real-world recognition scenarios.

In conclusion, MathWriting’s broad applications support recognition training across scientific domains and enable synthetic expression generation. Integration with datasets like CROHME23 promises enhanced model performance and diversity. Bounding box data facilitates synthetic ink generation, potentially refining LATEX’s rigid structure for more natural synthesis. Additionally, it offers avenues for character segmentation in UI features. Further improvements include exploring varied label normalization and leveraging contextual information for enhanced recognition. Future research could focus on optimizing train/validation/test splits and developing language models tailored to mathematical expressions. In conclusion, MathWriting empowers recognition research, offering extensive data and avenues for advancement.


Check out the PaperAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

[Free AI Webinar] 'How to Build Personalized Marketing Chatbots (Gemini vs LoRA)'.