How Does Retrieval Augmentation Impact Long-Form Question Answering? This AI Study Provides New Insights into How Retrieval Augmentation Impacts Long- Knowledge-Rich Text Generation of Language Models

LFQA aims to provide a complete and thorough response to any query. Parametric information in large language models (LLMs) and retrieved documents presented at inference time enable LFQA systems to construct complicated replies to questions in paragraphs rather than by extracting spans in the evidence document. Recent years have revealed the startling impressiveness and fragility of large-scale LLMs’ LFQA capabilities. Retrieval has recently been proposed as a potent approach to supply LMs with up-to-date, appropriate information. However, it is still unknown how retrieval augmentation influences LMs during production, and it doesn’t always have the expected effects.

Researchers from the University of Texas at Austin investigate how retrieval influences the creation of answers for LFQA, a challenging long text generation problem. Their study provides two simulated research contexts, one in which the LM is held constant while the evidence documents are changed and another in which the opposite is true. Due to the difficulty in assessing LFQA quality, they begin by counting superficial indicators (e.g., length, perplexity) associated with distinct answer attributes like coherence. The ability to attribute the generated answer to the available proof documents is an attractive feature of retrieval-augmented LFQA systems. Newly acquired human annotations on sentence-level attribution are used to test commercially available attribution detection technologies. 

Based on their examination of surface patterns, the team concluded that retrieval enhancement significantly modifies LM’s creation. Not all impacts are muted when the submitted papers are irrelevant; for example, the length of the generated responses may change. In contrast to irrelevant documents, those that provide important in-context evidence cause LMs to produce more unexpected phrases. Even when using an identical set of evidence documents, various base LMs may have contrasting impacts from retrieval augmentation. Their freshly annotated dataset provides a gold standard against which to measure attribution evaluations. The findings show that NLI models that identified attribution in factoid QA also do well in the LFQA context, surpassing chance by a wide margin but falling short of the human agreement by a margin of 15% in accuracy. 

The research shows that even when given an identical set of documents, the quality of attribution might differ widely between base LMs. The study also shed light on the attribution patterns for the production of lengthy texts. The generated text tends to follow the sequence of the in-context evidence documents, even when the in-context document is a concatenation of numerous papers, and the last sentence is much less traceable than earlier sentences. Overall, the study shed light on how LMs leverage contextual evidence documents to answer in-depth questions and point toward actionable research agenda items. 


Check out the PaperAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..