Researchers from Microsoft Research and Georgia Tech Unveil Statistical Boundaries of Hallucinations in Language Models

A key issue that has recently surfaced in Language Models is the high rate at which Language Models (LMs) provide erroneous information, including references to nonexistent article titles. The Merriam-Webster dictionary defines a hallucination as “a plausible but false or misleading response generated by an artificial intelligence algorithm.” In one instance, attorneys who submitted legal research with imagined court cases they thought to be accurate faced a $5,000 penalty. In the medical field, patients’ hallucinations may be fatal, and doctors worry about being sued for negligence. Additionally, the media has covered hallucinations extensively, and the President of the United States recently issued an Executive Order requesting, among other things, protections against deceptive results from generative artificial intelligence systems. 

In this work, researchers from Microsoft Research and Georgia Tech present statistical lower bounds on the hallucination rate for learning machines (LMs) that are calibrated fact predictors. This sheds light on the characteristics of hallucinations. This does not imply that hallucinations are unavoidable. As the research team will discuss, it is more in line with the growing trend of practitioners supplementing “pretraining” procedures with “post-training” procedures that lower hallucination rates and calibration. An LM is just a probability distribution D over sequences of tokens,i.e., words or other character sequences. Any LM that predicts every string with positive probability (a typical characteristic of LMs) will necessarily hallucinate with positive probability. However, hallucinations will be uncommon if this chance is low. Therefore, measuring the frequency of hallucinations is essential. 

Log-probabilities across complete sequences or conditional log-probabilities of the next token given the preceding ones may be used to express any distribution D identically: log D(t1… tm) = Pm i=1 log D(ti | t1 … ti−1). This seemingly insignificant mathematical equivalency has a significant implication. Although prediction and generation have different requirements, any LM may be used to either produce text or predict the next token in naturally occurring text conditioned on the preceding tokens. Take the following sentence, for example Alexa Wilkins went to Salumeria last Tuesday for lunch because the reviews said the tuna sandwich was amazing. A predictive language model might suggest such sentences to lessen phone typing. It may be beneficial to forecast sandwich as a word to input following the term tuna, along with other plausible words such as salad. 

However, it would be false if a generative LM were to fabricate the vast majority of these kinds of sentences at random. According to this article, even in perfect circumstances, LMs with strong predictive text ability should experience hallucinations. Notably, in the initial step of pretraining, which is typical nowadays, the generative LM is tailored for predictive text performance. Moreover, it offers a lower bound on the rate of hallucination, which might throw insight into the varied rates at which different sorts of facts should be hallucinated. Both the example above and the possible references (which the research team will refer to as 5W = Who-Ate-What-When-Where-Why factoids) have in common that they are arbitrary in the sense that neither can be ascertained methodically by rules; that is, most of these facts cannot be verified because they are not included in the training data. 

As opposed to facts, the validity of which can be methodically ascertained. Even in a simplified situation with many ideal qualities, the research team estimate the number of hallucinations LMs should experience. The research team prefer simplicity over generality since their lower bounds are statistical, and their goal is to pinpoint the underlying source of LM hallucinations. The research team seek a hallucinatory lower-bound that holds in the simplest context when training data is i.i.d. without factual mistakes, similar to classification, where one seeks a lower-bound for the difficulty of classification in noiseless settings (although noise-tolerant classification techniques).

The research team offer a natural extension of calibration to generative models. Their idea is different from previous calibration applications in LMs, which were token-level. Since each fact may be described using natural language in various ways, calibrating token probabilities is only useful when evaluating raw token probabilities. Rather, the probability distribution across the bits of information (facts or hallucinations) in the text is considered by their semantic-level calibration. An LM is considered calibrated if, among the information it creates with probability a ≈ z, for any given probability z ∈ [0, 1], such information appears on average in a fraction of naturally occurring language with probability a ≈ z (preferably the distribution from which training data was collected).

This work aims to explain this phenomenon by demonstrating that, even in an ideal world where the training data is perfectly factual, there is no blurring of facts and hallucinations, each document contains at most one fact, and there is not even a prompt that would encourage hallucination, pretraining LMs for predictive accuracy results in hallucinations. Furthermore, their hypothesis clarifies why contemporary LMs have greater hallucinations than previous LMs, such as trigram models, despite training on comparable data sets with comparable goals. The mono act rate may show the rates at which calibrated LMs must delude themselves for various kinds of facts. 

When facts with a high monofact rate that is, events that frequently appear just once in the training data occur, one predicts hallucinations. It’s interesting to note that this is uncommon for allusions to books or articles a problematic kind of hallucination being studied now. Therefore, examining the sheer quantity of facts, including references and others, that an LM encounters during training may result from other problems like model capacity. Additionally, it could be possible to correct hallucinated references by altering the pretraining pipeline without using post-training, but this won’t help with other kinds of arbitrary facts, like the ones in their 5W example, where the monofacts are frequent.


Check out the PaperAll credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.

🚀 LLMWare Launches SLIMs: Small Specialized Function-Calling Models for Multi-Step Automation [Check out all the models]