The Researchers at Allen Institute for Artificial Intelligence (AI2) have developed a new AI model that summarizes text from scientific papers. It provides the sentence form TL;DR (Too Long Didn’t Read) summary when a user uses the search function or goes to an author’s page.
The team has rolled out this model to the Allen Institute’s Semantic Scholar search engine for papers. At present, the TL;DR summaries can be accessed only for computer science-related documents.
One of the famous natural language processing (NLP) problems is using AI to summarize the text. There are two general methods: an extractive way to find a sentence or set of sentences from the text verbatim that apprehends its essence. The other way is abstractive that involves generating new sentences. While extractive methods are popular due to NLP systems’ limitations, advances in natural language generation have made the abstractive technique better in recent years.
Training and Fine-tuning
This model uses a transformer, a type of neural network architecture that has powered all major leaps in NLP, including OpenAI’s GPT-3. The researchers pre-trained the transformer on a general corpus of text to build familiarity with the English language.
The team then created SciTldr, a dataset to train the model. This dataset contains over 5,400 pairs of computer science research papers and their corresponding summaries. Since titles themselves are a sort of outline, they would help the model improve its results better. Therefore, the model was trained on the second dataset of over 20,000 research papers and their corresponding titles to reduce domain knowledge dependency while writing a summary.
AI takes the essential parts of the abstract, introduction, and conclusion section of the paper to summarize. The trained model summarized documents over 5,000 words in just 21 words on an average. That means that each paper is compressed 238 times its size on average. That’s a compression ratio of 238. Now the researchers aim to extend the model to papers in other disciplines as well.
While earlier research was able to address summarization, this model especially stands out because of its compression level. The model has been estimated to be more informative and accurate than previous methods by reviewers during testing.
The researchers have come across the issue that the tl;dr summaries are sometimes similar to the paper title. This diminishes their overall efficiency. Therefore they intend to update the training model’s process to eliminate any similarity. The team also aims to work on summarizing multiple documents at a time, which could help researchers enter a new field.
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.