Literature-based hypothesis generation is the central tenet of literature-based discovery (LBD). With drug discovery as its core application field, link-based hypothesis testing (LBD) focuses on hypothesizing ties between ideas that have not been examined together before (such as new drug-disease links).
Even though these systems have grown into machine-learning methodologies, this setup has serious issues. The hypotheses can’t be expected to be as expressive if it reduces the “language of scientific ideas” to its most basic form. Moreover, LBD does not mimic the factors that human scientists consider throughout the ideation process, such as the intended application’s setting, requirements and restrictions, incentives, and problems. Finally, the inductive and generative nature of science, where new concepts and their recombinations continuously develop, is not considered in the transductive LBD context, where all concepts are known as apriori and need to be connected.
Researchers at the University of Illinois at Urbana-Champaign, the Hebrew University of Jerusalem, and the Allen Institute for Artificial Intelligence (AI2) try to address these complexities with Contextual Literature-Based Discovery (C-LBD), a unique setting and modeling paradigm. They are the first to use a natural language setting to constrain the generation space for LBD and also break away from classic LBD in the output by having it generate sentences.
Inspiration for C-LBD comes from the idea of an AI-powered assistant that can provide suggestions in plain English, including unique thoughts and connections. The assistant accepts as input (1) relevant information, such as present challenges, motives, and constraints, and (2) a seed phrase that should be the primary focus of the developed scientific concept. Given this information, the team investigates two forms of C-LBD: one that generates a full phrase explaining an idea and another that generates only a salient component of the idea.
To this end, they introduce a novel modeling framework for CLBD that may gather inspiration from disparate sources (such as a scientific knowledge graph) and use them to form novel hypotheses. They also introduce an in-context contrastive model that uses the background sentences as negatives to prevent unwarranted input emulation and promote creative thinking. Unlike most LBD research, which is directed toward biomedical applications, these experiments apply to articles in the field of computer science. From the 67,408 papers in the ACL anthology, the team autonomously curated a new dataset using IE systems, complete with task, method, and background sentence annotations.
By focusing on the NLP field specifically, researchers in that area will have an easier time analyzing the results. Experimental results from automated and human evaluations reveal that the retrieval-augmented hypothesis generation significantly outperforms previous methods but that current state-of-the-art generative models are still inadequate for this work.
The team believes that expanding C-LBD to include a multimodal analysis of formulas, tables, and figures to provide a more comprehensive and enriched background context is an intriguing direction to investigate in the future. The use of advanced LLMs like GPT-4, which is currently in development, is another avenue to investigate.
Check out the Paper and Github. Don’t forget to join our 22k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.