As researchers propose new goals, larger models, and unique benchmarks, the size, variety, and the number of publicly available NLP (Natural Language Processing) datasets have expanded rapidly. Curated datasets are used for assessment and benchmarking; supervised datasets are utilized for training and fine-tuning models; massive unsupervised datasets are required for pretraining and language modeling. In addition to the annotation approach, each dataset type has a different scale, granularity, and structure.
New dataset paradigms have historically been critical in propelling the progress of NLP. Today’s NLP systems are built with a pipeline that includes a variety of datasets with widely variable dimensions and levels of annotation. For pretraining, fine-tuning, and benchmarking, several datasets are used. As a result, the number of datasets used in the NLP community has increased dramatically. Significant issues, such as interface standardization, versioning, and documentation, arise as a result of the increase in datasets. A practitioner should be able to work with various datasets without having to use a variety of interfaces. Furthermore, N practitioners working with the same dataset should be aware that they are working with the same version. Whether using small-scale datasets like Climate Fever (1k data points), medium-scale Yahoo Answers (1M), or even the entire PubMed (79B) database, interfaces should not have to change as a result of this magnitude.
Datasets is a modern NLP community library created to assist the NLP environment. Datasets aims to standardize end-user interfaces, versioning, and documentation while providing a lightweight front-end that works for tiny datasets as well as large corpora on the internet. The library’s design involves a distributed, community-driven approach to dataset addition and usage documentation. After a year of work, the library now features over 650 unique datasets, over 250 contributors and has supported many original cross-dataset research initiatives and shared tasks.
Datasets is a community library that aims to solve the problems of data management and access while also promoting community culture and norms. The project is community-built and has hundreds of contributors across languages, and each dataset is tagged and documented. Each dataset is expected to be using a standard tabular format that is versioned and cited; datasets are computation- and memory-efficient by default and work seamlessly with tokenization and featurization.
On various levels, Datasets varies from other recent dataset versioning efforts. The project is independent of any modeling framework and provides a tabular API that may be used for any purpose. It focuses on natural language processing (NLP) and offers specialized types and structures for language constructs. It intends to access a long-tail of datasets for numerous tasks and languages through the dataset hub and data cards, promoting community management and documentation.
Datasets do not hold the underlying raw datasets. However, it does provide distributed access to data hosted by the original authors. Each dataset has a builder module created by the community. The builder module converts raw data, such as text or CSV, into a standardized dataset interface representation. Internally, each created dataset is represented as a table with typed columns. A range of standard and NLP-targeted kinds are included in the Dataset type system. Apache Arrow, a cross-language columnar data platform, forms the foundation for Datasets. Arrow has a local caching mechanism that allows datasets to be backed up by a memory-mapped on-disk cache for the quick lookup.
The library gives access to typed data with little preparation when downloaded. It includes sorting, shuffling, dividing, and filtering algorithms for manipulating datasets. When a dataset is requested, it is downloaded from the original host. This calls dataset-specific builder code, which transforms the text into a typed tabular format that matches the feature schema and caches the table. A memory-mapped typed table is provided to the user. The user can run arbitrary vectorized code and store the results to process further the data, such as tokenize it.
Some datasets are so massive that they can’t even fit on a disc. A streaming model is included in Datasets, which buffers these datasets on the fly. The core map primitive is supported in this mode, and it operates on each data batch as it is streamed. Recent research into distributed training of an extensive open NLP model was made possible thanks to data streaming. Tools for quickly creating and using a search index over any dataset are included under Datasets. The library can use either FAISS or ElasticSearch to generate the index. This interface makes it simple to locate the nearest neighbors using textual or vector queries.
Hugging Face Datasets is a community-driven open-source package that standardizes NLP dataset processing, distribution, and documentation. The core library is intended to be simple, quick to load and use the same interface for datasets of various sizes. It makes it easy to use standard datasets, has encouraged new use cases of cross-dataset NLP, and offers sophisticated functionality for tasks like indexing and streaming big datasets, with 650 datasets from over 250 contributors.