Large Language Model (LLM) Training Data Is Running Out. How Close Are We To The Limit?

In the quickly developing fields of Artificial Intelligence and Data Science, the volume and accessibility of training data are critical factors in determining the capabilities and potential of Large Language Models (LLMs). Large volumes of textual data are used by these models to train and improve their language understanding skills.

A recent tweet from Mark Cummins discusses how near we are to exhausting the global reservoir of text data required for training these models, given the exponential expansion in data consumption and the demanding specifications of next-generation LLMs. To explore this question, we share some textual sources currently available in different media and compare them to the increasing needs of sophisticated AI models.

  1. Web Data: Just the English text portion of the FineWeb dataset, which is a subset of the Common Crawl web data, has an astounding 15 trillion tokens. The corpus can double in size when top-notch non-English web content is added. 
  1. Code Repositories: Approximately 0.78 trillion tokens are contributed by publicly available code, such as that which is compiled in the Stack v2 dataset. While this may appear insignificant in comparison to other sources, the total amount of code worldwide is projected to be significant, amounting to tens of trillions of tokens. 
  1. Academic Publications and Patents: The total volume of academic publications and patents is approximately 1 trillion tokens, which is a sizable but unique subset of textual data.
  1. Books: With over 21 trillion tokens, digital book collections from sites like Google Books and Anna’s Archive make up a massive body of textual content. When every distinct book in the world is taken into account, the total token count rises to 400 trillion tokens. 
  1. Social Media Archives: User-generated material is hosted on platforms such as Weibo and Twitter, which together account for a token count of roughly 49 trillion. With 140 trillion tokens, Facebook stands out in particular. This is a significant but mostly unreachable resource because of privacy and ethical issues.
  1. Transcribing Audio: The training corpus gains around 12 trillion tokens from publicly accessible audio sources such as YouTube and TikTok.
  1. Private Communications: Emails and stored instant conversations add up to a massive amount of text data, roughly 1,800 trillion tokens when added together. Access to this data is limited, which raises privacy and ethical questions.

There are ethical and logistical obstacles to future growth as the current LLM training datasets get close to the 15 trillion token level, which represents the amount of high-quality English text that is available. Reaching out to other resources like books, audio transcriptions, and different language corpora could result in small improvements, possibly increasing the maximum amount of readable, high-quality text to 60 trillion tokens. 

However, token counts in private data warehouses run by Google and Facebook go into the quadrillions outside the purview of ethical business ventures. Because of the limitations imposed by limited and morally acceptable text sources, the future course of LLM development depends on the creation of synthetic data. Since access to private data reservoirs is prohibited, data synthesis appears to be a key future direction for AI research. 

In conclusion, there is an urgent need for unique ways of LLM teaching, given the combination of growing data needs and limited text resources. In order to overcome the approaching limits of LLM training data, synthetic data becomes increasingly important as existing datasets get closer to saturation. This paradigm shift draws attention to how the field of AI research is changing and forces a deliberate turn towards synthetic data synthesis in order to maintain ongoing advancement and ethical compliance.

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.

[Announcing Gretel Navigator] Create, edit, and augment tabular data with the first compound AI system trusted by EY, Databricks, Google, and Microsoft