Hugging Face Introduces Cosmopedia To Create Large-Scale Synthetic Data For Pre-Training

Hiring human annotators was a time-consuming and expensive technique traditionally used to create datasets for supervised fine-tuning and instruction-tuning. Due to the high cost, only a select few influential people in the area were able to create such comprehensive datasets. Nevertheless, things have altered in the past several months. Numerous top-notch synthetic fine-tuning datasets have been developed, with GPT-3.5 and GPT-4 being the most common tools.

The Phi models developed by Microsoft were pioneers in this area; they relied heavily on synthetic data for training. These outperformed larger models trained on web datasets for longer periods. With over 617k downloads in the last 30 days, Phi-2 is among the 20 most popular models on the Hugging Face hub.

βœ… [Featured Article] Selected for 2024 GitHub Accelerator: Enabling the Next Wave of Innovation in Enterprise RAG with Small Specialized Language Models

Another drawback is the employment of proprietary models to produce the data, in addition to the fact that very little is known about how the Phi datasets came to be. Researchers from Hugging Face introduce Cosmopedia, a database of synthetic textbooks, blog entries, stories, blogs, and WikiHow articles produced by Mixtral-8x7B-Instruct-v0.1. It is the largest open synthetic dataset to date, with over 25 billion tokens and 30 million files.

While creating synthetic data may appear simple, it becomes very difficult to scale up while preserving diversity, which is critical for maximum performance. In this work, the team generated over 30 million Cosmopedia prompts covering hundreds of subjects with a duplicate content rate of less than 1%.

Cosmopedia’s ultimate goal is to provide an enormous amount of comprehensive synthetic data of excellent quality. To construct Cosmopedia’s prompts, the researchers merged two methods: conditioning on online data and conditioning on curated sources. They called this”seed data,” the original set of information used to create their conditions.

Curated Sources: Subjects come from trusted educational resources, including OpenStax, WikiHow, Stanford courses, and Khan. The key shortcoming of this strategy is its inability to scale, even though it produces high-quality content. 

By taking advantage of the variability in audience and generation style, it is possible to generate samples from a single topic in different formats (e.g., academic textbook vs. blog post) and for different audiences (e.g., young children vs. college students).

Web Data: With web data accounting for more than 80% of Cosmopedia’s prompts, it was clear that this approach was the most scalable. Using a dataset similar to RefinedWeb, the researchers organized millions of online samples into 145 groups. For each cluster, they determined its topic by giving Mixtral extracts from 10 randomly selected samples and asking them to identify their common topic.

After reviewing the clusters, they eliminated those that did not meet the standards for instructional value. Obituaries, explicit adult content, and celebrity gossip are some examples of content that has been removed. They continued by telling the model to create a textbook according to a web sample’s topic based on its clustering, and then they constructed prompts. 

The team conditioned the prompts on the topic only half the time and modified the audience and generation styles to promote diversity and account for any incomplete topic labeling. They used this method to create 23 million prompts in the end.

The preliminary evaluations of the models educated using the produced textbooks revealed an absence of basic knowledge and common sense indicative of a primary school curriculum. To tackle this, the researchers used texts from the UltraChat and OpenHermes2.5 instruction-tuning datasets as seed data for the prompts and constructed stories incorporating common sense and everyday knowledge. These datasets cover a wide variety of topics. 

The team utilized the text-clustering repository to apply topic clustering to the online data utilized in Cosmopedia prompts. To create 25 billion tokens of synthetic content using Mixtral-8x7B-Instruct-v0.1, they utilize the llm-swarm library. The Hugging Face Hub is utilized by this scalable synthetic data generation tool, which makes use of local LLMs or inference endpoints. It is compatible with the vLLM and TGI inference libraries. In the Hugging Face Science cluster, TGI was used to locally deploy Mixtral-8x7B on H100 GPUs. More than 10,000 GPU hours were required to generate Cosmopedia.

The team highlights that there is a chance that the seed samples or the training data for the model could be contaminated with benchmarks because this is synthetic data. They employ a decontamination pathway to remove test benchmark samples from their dataset to overcome this.

Using a 10-gram overlap, they were able to detect samples that may be tainted, just like Phi-1. Following candidate retrieval, the researchers compare the dataset sample to the benchmark using difflib.SequenceMatcher. They remove the sample if the ratio of the matched substrings’ length to the benchmark sample’s length is greater than 0.5. All of the benchmarks that were tested using the Cosmo-1B model, such as MMLU, HellaSwag, PIQA, SIQA, Winogrande, OpenBookQA, ARC-Easy, and ARC-Challenge, passed this decontamination procedure.

For data deduplication and tokenization, they used the datatrove package. Model training was carried out using nanotron, and assessment was done using lighteval.

The model outperforms TinyLlama 1.1B on MMLU, ARC-easy, OpenBookQA, and ARC-challenge, and it’s on par with Qwen-1.5-1B on OpenBookQA and ARC-challenge. Nevertheless, there are noticeable performance differences compared to Phi-1.5, indicating higher-quality synthetic generation. These differences could be attributed to the LLM employed for generation, the topic coverage, or the prompts.

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone's life easy.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...