Google researcher’s new study suggests modifying the conventional transformer architecture to process byte sequences in natural language processing (NLP). The new competitive byte-level models can effectively balance computational cost trade-offs of contemporary large language models.
Tokenization splits the sentences into a sequence of tokens. Most NLP tasks follow a tokenization procedure to preprocess the data. However, tokenization can struggle with typos, irregularities in spelling and capitalization, morphological changes, and out-of-vocabulary tokenization problems.
Studies suggest using token-free models to address this problem. The token-free models operate directly on the raw text. They store text data as a sequence of bytes which the model generally uses to process arbitrary text sequences. However, byte sequences are much longer than their corresponding word-level token sequences, making this approach computationally heavy.
Researchers at Google introduce ByT5, a token-free variant of multilingual T5. ByT5 simplifies the NLP pipeline by doing away with vocabulary building, text preprocessing, and tokenization. In their recent paper, the team demonstrates that ByT5 operates directly on UTF-8 bytes instead of using a subword vocabulary similar to most pretrained language models. The proposed architecture does not require text preprocessing and can be easily used to process byte sequences without increasing computational costs.
ByT5 is a token-based mT5 (Multilingual Text-to-Text Transfer Transformer), trained on a large corpus of unlabeled text data. It has achieved SOTA performance across various multilingual NLP tasks. The team made a small set of changes to make mT5 token-free. The changes were made keeping in mind that they do not dramatically increase computational cost. While making necessary changes, UTF-8 bytes of the SentencePiece vocabulary is fed directly into the model without any text preprocessing and embedding these bytes to the model’s hidden size.
The pretrained tasks are modified so as to reuse the final 100 byte IDs instead of adding 100 new tokens for the sentinels. Also, the researchers mask longer byte-spans with a mean mask span length.
Using ablations, the researchers demonstrate that encoder-decoder models with heavier encoders perform much better on classification and generation tasks. They also state that the pre-training task benefits from masking longer ID sequences. Although ByT5 models are pretrained on four times less text than used in mT5, it still achieves remarkable gains. With this, the team proposed that byte-level models are more data-efficient learners.
The team evaluates the performance of the modified transformer architecture on byte-level processing concerning computing cost trade-offs. For this, they compare ByT5 against mT5 on a wide range of tasks on standard English and multilingual NLP benchmarks.
The results show the competitiveness of ByT5 against parameter-matched mT5 models in terms of downstream task quality. ByT5 outperforms mT5 across all model sizes and tasks, including generative tasks, multilingual tasks with an in-language label even in noisy environments. The team also evaluated its cross-lingual understanding of the XTREME benchmark by comparing F1/EM scores on the Question Answering task. ByT5 achieves excellent performance on all tasks, including English classification and generation.