Web-Scale Data Has Driven Incredible Progress in AI, But Do We Really Need All That Data? Meet SemDeDup: A New Method to Remove Semantic Duplicates in Web Data With Minimal Performance Loss

The growth of self-supervised learning (SSL) applied to larger and larger models and unlabeled datasets has been a major factor in recent success in machine learning. Particularly, many contemporary huge datasets are obtained at a worldwide web size and are typically unfiltered, save for NSFW filtering. LAION is a public multi-modal dataset including 5 billion image/text pairs.

Test error often scales as a power law concerning data amount. This has been observed because of the growing interest in scaling laws that forecast how a model’s performance will change given more data and/or parameters. However, power law scaling cannot be maintained since it rapidly reaches the point of declining marginal returns, where more data is needed to make even smaller performance improvements. Hence, it would have a significant influence if data efficiency were improved. The same computational budget would allow models to achieve the same performance much faster or better. 

Recent studies have been motivated by these findings. It proposes that with an ideal data ranking metric, exponential scaling might be possible by reducing training data following an intelligent criterion, thus breaking the power law scaling with respect to data. Yet, there is little knowledge of the best ways to pick data. These methods may prioritize one of three groups of outliers, approximately ranked by the difficulty of identifying them:

  1. Perceptual duplicates are data pairs that are virtually indistinguishable from the naked eye.
  2. Semantic duplicates have nearly identical information content but are easily distinguishable to the human eye.
  3. Semantic redundancy differs from semantic duplicates because it does not result from the same things. Nonetheless, there may still be a lot of repetition in the data shown in such situations.

Instead of supplying no information, as with the preceding types of data, misleading data generate a negative or detrimental signal, so deleting them improves performance rather than having no effect at all.

SemDeDup, proposed by researchers from Meta AI and Stanford University, is a computationally tractable and straightforward method for detecting semantic duplicates. 

Semantically identical data that would be difficult to find using simple deduplication algorithms are the primary focus of this effort. Because input-space distance measurements are unlikely to reveal semantic duplicates, finding such data points is difficult. The researcher overcame this restriction by employing k-means clustering on a publicly available pre-trained model. The next step was identifying nearby residents who fell below a given cutoff.

By omitting redundant information, the train may go much more quickly. Alternately, one can achieve greater performance than the baseline, especially on OOD tasks, while still obtaining a speedup, albeit smaller than that for matched performance, by removing fewer duplicates. The LAION training set was shrunk by half with almost no performance loss, leading to faster learning and the same or better results out of distribution. The study applies SemDeDup to C4, a large text corpus, and achieves efficiency gains of 15% while often outperforming past methods of SoTA deduplication.

Getting rid of semantic duplication is a good starting point for minimizing data size, but it’s not the only option. The team’s goal is to eventually have much smaller datasets, reducing training time and making massive models more accessible.


Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 16k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

[Announcing Gretel Navigator] Create, edit, and augment tabular data with the first compound AI system trusted by EY, Databricks, Google, and Microsoft