The Berkeley Artificial Intelligence Research (BAIR) evaluated how large language models memorize and regurgitate their training data’s rare snippets in a recent paper. The focus was on GPT-2 and found that at least 0.1% of its text generations contain lengthy verbatim strings, “copy-pasted” from a document in its training set.
Such memorization would be a prominent issue for language models trained on private data such as on users’ emails because the model might inadvertently output a user’s sensitive conversations. Yet, multiple challenging regulatory questions are raised even for models trained on public data from the Web memorization of training data. This may range from misuse of personally identifiable information to copyright infringement.
To extract naturally occurring data that a language model has memorized was the aim of the research. We do not know what kind of text to look for, which makes this problem more challenging.
If it occurs, memorization must be a rare phenomenon since large language models exhibit minimal overfitting, i.e., their train and test losses are nearly identical.
The following two-step “extraction attack” was used to describe how to find such examples in the paper.
- Feed GPT-2 with short prompts and collect generated samples. Generate a vast number of samples by interacting with GPT-2 as a black-box.
- Keep the generated samples with an abnormally high likelihood. For example, the team retains any sample on which GPT-2 assigns a much higher probability than a different language model.
A total of 600,000 samples were generated by querying GPT-2 with three different sampling strategies. Each sample contains roughly 200 words (256 tokens) on average.
Samples were selected with an abnormally high probability for manual inspection out of the above, and the number of samples chosen reached 1800. Out of the above 1800 selected samples, 604 samples contain text that is reproduced verbatim from the training set.
Problematic Data Memorization
The model re-generated lists of news headlines, pieces of software logs, exclusive software licenses, Donald Trump speeches, pieces of software logs, exclusive software licenses, passages from the Bible and Quran, snippets of source code, the first 800 digits of pi, and much more! While some forms of memorization, such as memorizing pi’s digits, are relatively benign, others are much more problematic.
Memorization of Personally Identifiable Information
Making matters worse, numerous cases of GPT-2 generating memorized personal information in contexts were found that can be deemed offensive or otherwise inappropriate. For example, GPT-2 generates fictitious IRC conversations on transgender rights between two real users.
In this conversation, the specific usernames only appear twice on the entire Web, both times in private IRC logs leaked online as part of the Gamergate harassment campaign.
Memorizing personal data does not constitute “appropriate security,” An argument prevails that the data’s implicit inclusion in the outputs of downstream systems is unsuitable with the original purpose of data collection.
Other than data misuse violations, misrepresenting someone’s personal information in improper contexts also touches on existing privacy regulations guarding against false light torts or defamation. Similarly, misrepresenting a product or companies names could violate trademark laws.
Memorization of Copyrighted Data
Copyrighted text is another type of content that the model memorizes.
Memorization of Books
GPT-3 is a model 100 times huge than GPT-2. The paper shows that larger language models memorize more, so we expect GPT-3 to remember an even more immense amount of data.
The team prompts GPT-3 with the beginning of the 3rd chapter of Harry Potter and the Philosopher’s Stone. The model correctly reproduced about one full page of the book (about 240 words) before making its first mistake.
Memorization of the Code
Language models can also memorize other types of copyrighted data, such as source code. For instance, GPT-2 can output 264 lines from the Bitcoin client (with six minor mistakes).
The team also found one example where GPT-2 may output a complete file.
The above being just a few examples of the copyrighted content that the model memorized from the training set. It must also be noted that while source code and books have an explicit copyright license, the Internet content is mostly automatically copyrighted under US law.
Whether Training Language Models Infringe on Copyright or not
Given that the language models memorize and regurgitate the copyrighted content, whether that means they constitute copyright infringement or not. It has been debated among legal scholars that the legality of training models on copyrighted data.
The issue of data memorization certainly has an important role to play in this debate. In response to a request-for-comments (the US Patent Office), many parties argue in favor of characterizing ML as fair use, in part because the ML models are assumed not to emit memorized data.
Yet large language models certainly can produce large portions of memorized copyrighted data, including specific documents in their entirety.
This defense of fair use does not hinge only on the assumption that models don’t memorize their training data, but the above findings also seem to diminish this line of argument. Finally, the answer to this question might depend on how the outputs are used.