Open-domain long-on answering (LFQA) form questions a fundamental challenge in natural language processing (NLP) that involves retrieving documents relevant to a given query and using them to generate a detailed paragraph-length answer.
Recently, there has been significant progress in factoid open-domain question answering (QA). In this technique, a short phrase or entity is enough to answer a question, but significantly less work has been done in long-form question answering (LFQA). LFQA is an important task, primarily because it provides a testbed to measure generative text models’ factuality. But, the current benchmarks and evaluation metrics aren’t quite suitable for making progress on LFQA.
In a recent paper, “Hurdles to Progress in Long-form Question Answering”, that is set to appear at NAACL 2021, Google.ai present a new system for open-domain long-form question answering that utilizes two recent advances in NLP: One is the state-of-the-art sparse attention models, such as Routing Transformer (RT), which allows attention-based models to scale to long sequences, and other is the retrieval-based models, like REALM, that can facilitate retrievals of Wikipedia articles related to a given query.
The system combines information from multiple retrieved Wikipedia articles related to the given question before generating an answer. It achieves a new state-of-the-art on ELI5, the only large-scale publicly available dataset for long-form question answering.
However, while the system tops the public leaderboard, the researchers have discovered some alarming trends with the ELI5 dataset and the associated evaluation metrics. In particular, they have found out little evidence that models use the retrievals on which the condition and the trivial baselines (e.g., input copying) beat modern systems. The researchers also observed that there is a significant train/validation overlap in the dataset. The paper suggests mitigation strategies for each of these challenges.
The main component of NLP models is the Transformer architecture. Every token in a sequence attends to every other token in the series, resulting in a model that can scale quadratically with the sequence length. The RT model introduces a dynamic, content-based mechanism that reduces the complexity of attention in the Transformer model.
The RT work’s key factor is that each token attending to every other token is often redundant and can be approximated by a combination of local and global attention. RT model is pre-trained l on the Project Gutenberg (PG-19) dataset with a language modeling objective.
The researchers demonstrated the effectiveness of the RT model by combining it with retrievals from REALM. The REALM model is a retrieval-based model that utilizes the maximum inner product search to fetch Wikipedia articles related to a particular query or question. The researchers improved the quality of REALM retrievals by using a contrastive loss.
The model was tested on long-form question answering using the ELI5 dataset, a part of the KILT benchmark and the only publicly available large-scale LFQA dataset. Next, they fine-tuned the pre-trained RT model and retrievals from c-REALM on the ELI5 dataset from KILT.
The submission is at first place on the KILT leaderboard for long-form questions answering ELI5 with a combined KILT R-L score of 2.36. Although the model tops the leaderboard, there were several challenges associated with it.
The researchers observed little to no evidence that the model is grounding its next generation in the retrieved documents. They also found significant overlap in the training, validation, and test sets of ELI5. Moreover, there were issues with the Rouge-L metric used to evaluate text generation quality, with trivial nonsensical baselines. The researchers hope that the community works together to solve these issues so that researchers can make meaningful progress in this field.