Common Corpus: A Large Public Domain Dataset for Training LLMs

In the dynamic landscape of Artificial Intelligence, a longstanding debate questions the need for copyrighted materials in training top AI models. OpenAI’s bold assertion to the UK Parliament in 2023 that training such models without utilizing copyrighted content was ‘impossible’ sent shockwaves through the industry, sparking legal battles and ethical quandaries. However, recent developments have challenged this conventional wisdom, offering compelling evidence that large language models can be trained without copyrighted materials’ contentious use.

The Common Corpus initiative has emerged as the largest public domain dataset for training LLMs. This international collaboration, led by Pleias and involving researchers in LLM pretraining, AI ethics, and cultural heritage, has challenged the status quo and ignited a new era of AI practices. This multilingual and diverse dataset shows the potential of training LLMs without copyright concerns, marking a significant shift in the AI landscape.

✅ [Featured Article] Selected for 2024 GitHub Accelerator: Enabling the Next Wave of Innovation in Enterprise RAG with Small Specialized Language Models

Fairly Trained, a leading non-profit in the AI industry, has taken a decisive step towards fairer AI practices. It has awarded its first certification for an LLM built without copyright infringement, a model known as KL3M. Developed by Chicago-based legal tech consultancy startup 273 Ventures, KL3M is not just a model but a beacon of hope for fair AI. The rigorous certification process, overseen by Fairly Trained’s CEO, Ed Newton-Rex, instills confidence in the potential for fair AI, stating that “there is no fundamental reason why someone couldn’t train an LLM fairly.”

Kelvin Legal DataPack, a training dataset meticulously created by Fairly Trained, includes thousands of legal documents reviewed to comply with copyright law. Despite its size of around 350 billion tokens, this dataset is a testament to curation’s power. It may be smaller than those compiled by OpenAI and others that have scraped the internet, but its performance is exceptional. Jillian Bommarito, the company’s founder, attributes the success of the KL3M model to the rigorous vetting process applied to the data. The potential of curated datasets like this to supercharge AI models, tailoring them precisely to their designated tasks, is truly exciting. 273 Ventures now offers coveted spots on a waitlist for clients eager to access this invaluable resource.

Researchers developing the Common Corpus took a bold step by utilizing a text collection equivalent in size of data used for training OpenAI’s GPT-3 model. They made it available on the open-source AI platform Hugging Face. While Fairly Trained has only certified 273 Ventures’ LLMs, the emergence of Common Corpus and KL3M signals a shift in the AI landscape. Advocates for fairer AI, particularly for artists affected by data scraping, see these initiatives as pivotal in challenging the norm. Fairly Trained’s recent certifications, including the Spanish voice-modulation startup VoiceMod and the heavy-metal AI band Frostbite Orckings, showcase a diversification beyond LLMs, hinting at a broader scope for AI certification.

While the Kelvin Legal DataPack, a training dataset created by Fairly Trained, has its merits, it also has limitations. This dataset includes thousands of legal documents reviewed to comply with copyright law and is a valuable resource. However, it’s important to note that much of the public domain data available is outdated, especially in regions like the US, where copyright protection often extends beyond 70 years from the author’s death. Therefore, this dataset may not be suitable for grounding an AI model in current affairs.

Check out the Blog, Reference Article, and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 39k+ ML SubReddit

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...