NYU Researchers Propose GPQA: A Challenging Dataset of 448 Multiple-Choice Questions Written by Domain Experts in Biology, Physics, and Chemistry

Large Language Models (LLMs) are at the forefront of Artificial Intelligence (AI) and show great promise to surpass human skills in this quickly changing field. But when these models get closer to superhuman capabilities, assessing them fairly and bringing them into line with human understanding becomes more difficult. Solving this problem is essential to guaranteeing that new AI systems would be dependable in delivering correct information, particularly on issues where the truth that humans can verify may be elusive, a problem known as scalable oversight.

Robust assessment testbeds are necessary to gauge how well LLMs align for these jobs. Testbeds need to consistently get accurate data from these models, especially in scenarios where access to human-generated or independently verified truth is limited. Such testbeds should be difficult enough to allow for generalization to problems outside human knowledge, even to test highly trained non-experts. Evaluating the accuracy of LLMs’ answers is more difficult when they take on more complicated topics, especially in fields where specialized knowledge is needed. A major component of oversight techniques, like reinforcement learning from human feedback, is the accuracy with which human annotators can assess the accuracy of LLM outputs. However, problems like hallucination and sycophancy in model answers are made worse in areas where annotators find it difficult to distinguish correctness owing to a lack of experience.

In response to these issues, researchers from NYU, Cohere, and Anthropic present GPQA: A Graduate-Level Google-Proof Q&A Benchmark. GPQA is an assessment dataset with graduate-level multiple-choice questions covering biology, chemistry, and physics. Interestingly, GPQA spends a lot of time trying each question and validates it with domain experts and highly trained and driven non-experts, ensuring that the questions are challenging. GPQA is the result of a thorough four-step procedure. Questions are first developed by domain experts and then validated and revised by others. Two more expert validators evaluate the amended questions for objectivity. Ultimately, highly qualified non-expert validators who take their time answering each question confirm the dataset’s complexity. Employee incentives are thoughtfully crafted to recognize and reward superior work at every level.

With 448 demanding instances, GPQA proves the challenge that even the most advanced AI systems face. Even the best GPT-4-based model only gets 39% accuracy, while professionals reach 65% and non-experts reach 34%. This highlights the value of the dataset for researching scalable supervision techniques for next-generation models that outperform existing ones. Notwithstanding its importance, GPQA has drawbacks, including very limited model training sizes and possible biases in expert selection. In the future, oversight datasets might strive to find unsolved problems as a standard for superhuman AI supervision, closing the knowledge gap between models and human expertise.

GPQA functions as a trailblazing assessment dataset, expanding the frontiers of artificial intelligence assessment in demanding fields. Its development approach and validation techniques facilitate the development of protocols to efficiently oversee superhuman AI systems by providing insights into scalable supervision trials. To sum up, the development of GPQA represents a significant milestone in assessing AI systems and can potentially improve the alignment of superhuman models with human knowledge.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.

🚀 LLMWare Launches SLIMs: Small Specialized Function-Calling Models for Multi-Step Automation [Check out all the models]