Stanford AI Researchers Propose ‘LinkBERT’: A New Pretraining Method That Improves Language Model Training with Document Links

This Article is written as a summay by Marktechpost Staff based on the research paper 'LinkBERT: Pretraining Language Models with Document Links'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper, github and blog post.

Please Don't Forget To Join Our ML Subreddit

Language Models (LMs) are the cornerstone of modern Natural Language Processing (NLP) systems, primarily because of their extraordinary ability to gain knowledge from textual material. These models have been ingrained in our daily lives because of their use in tools like search engines and voice assistants. Models like the BERT and GPT series stand out because they can be pre-trained on vast amounts of unannotated text input using self-supervised learning. These pre-trained models can then be readily modified for a wide range of new question-answering jobs without much task-specific finetuning veiled language modeling and causal language modeling. The main disadvantage of these prevalent LM pretraining solutions is that they only model one document at a time and do not capture dependencies or knowledge that span multiple documents. Because of their interdependence, independently evaluating each document has some limits. This is best illustrated by using text from the web or scientific literature, which is frequently utilized in LM training. Most of the time, these textual data contain document links, such as hyperlinks and reference links. These document links are essential since knowledge can be found in several documents rather than just one.

Document linkages are essential for LMs to learn new information and make discoveries, just as they are for humans. We must remember that a text corpus is more than a list of documents; it is a graph of documents with links linking them. Models trained without these dependencies may be unable to capture facts scattered throughout several documents, which is required for a variety of applications such as question answering and knowledge discovery. To take a step toward tackling this challenge, a group of Stanford AI lab researchers created LinkBERT, a new pretraining approach that includes information about document link information during training. The LinkBERT LM is divided into three stages. The first stage is building a document graph from the text corpus using hyperlinks and citation links. Each document is viewed as a node, and if there is a hyperlink between two documents, a directed edge is added between them. The second step is to use the graph to create link-aware training instances by grouping connected documents together. The document is chunked into segments, which are subsequently concatenated together based on the document graph’s links. Concatenation can be done in several ways like contiguous, random, and linked segments. The model can learn to recognize relationships between text parts using these three alternative approaches.


Pretraining the LM using link-aware self-supervised tasks such as masked language modeling (MLM) and document relation prediction (DRP) is the final step. MLM conceals some tokens in the input text before predicting the tokens based on the tokens around them. The objective is to get the LM to gain multi-hop knowledge of topics that are linked together via document linkages. It is because of DRP that the model can categorize the relationship between two segments as contiguous or random. This job helps the LM to learn about document relevance and dependencies. The LinkBERT models were tested on various downstream tasks from diverse domains. The model regularly outperforms baseline language models like BERT and PubmedBERT that were not pre-trained with document linkages across tasks and domains. Because of the critical connections between scientific publications via citation links, these successful results were especially relevant for the biomedical domain. LinkBERT also shows exemplary results for multi-hop reasoning.

It is quite convenient to utilize LinkBERT as a drop-in replacement for BERT. LinkBERT not only improves performance for general language comprehension tasks, but it also captures concept relations and is effective for cross-document understanding, according to a careful experimental study. The model also internalizes more world knowledge and is helpful for jobs that need much knowledge, like question answering. The LinkBERT models were released with the hope that they would pave the way for future research projects. Some of these projects include generalizing sequence-to-sequence style LMs to conduct document link-aware text generation and so on. The research work has also been published in Association for Computational Linguistics 2022.