GPT-J, a six-billion-parameter natural language processing (NLP) AI model based on GPT-3, has been open-sourced by a team of EleutherAI researchers. The model was trained on an open-source text dataset of 800GB and was comparable with a GPT-3 model of similar size.
The model was trained using Google Cloud’s v3-256 TPUs using EleutherAI’s Pile dataset, which took about five weeks. GPT-J achieves accuracy similar to OpenAI’s reported findings for their 6.7B parameter version of GPT-3 on standard NLP benchmark workloads. The model code, pre-trained weight files, a Colab notebook, and a sample web page are included in EleutherAI’s release.
In 2018, OpenAI published the first publication on generative pre-trained transformers (GPT), an unsupervised learning model that produced best-in-class results on many NLP tasks. GPT-2, a 1.5B parameter model from OpenAI, was announced in early 2019. Last year, OpenAI revealed the GPT-3 model, which has 175B parameters but did not share the trained model files. Instead, OpenAI offered an API that allows developers to use web service calls to integrate the model into their programs.
EleutherAI released the 2.7B parameter GPT-Neo model, their first implementation of a GPT-like system, in March 2021. GPT-Neo was built in TensorFlow and trained on TPUs with the Mesh TensorFlow parallel library. The team also started work on GPT-NeoX, a GPU-based solution based on Microsoft’s DeepSpeed; while the code is open-sourced, no model files are presently accessible.
GPT-J, the most recent model, was trained with Mesh-Transformer-JAX, a new library. Instead of using a specific deep-learning framework like TensorFlow, the library uses Google’s JAX linear algebra framework. GPT-J delivers more flexible and faster inference than Tensorflow, and the model’s development took far less time than earlier initiatives. GPT-J improves training efficiency by 125 percent as compared to the 2.7GB GPT-Neo model. In terms of zero-shot performance on several down-streaming workloads, GPT-J is the best-performing publically available Transformer LM. Its Tensorflow + TPU equivalents enable more flexible and faster inference.
This project needed far fewer person-hours than past large-scale model creation projects, demonstrating that JAX + xmap + TPUs is the perfect collection of tools for rapid large-scale model development.
GPT-J delivers excellent absolute efficiency in the 6B configuration on a TPU V3-256 pod. GPT-J achieves 5.4 PFLOPs, as tested in the GPT3 article, despite the hardware’s theoretical maximum of 13.4 PFLOPs (ignoring attention computation, ignoring compute-memory tradeoffs like gradient checkpointing). When these additional considerations are taken into account, 8.1 PFLOPs, or around 60% of the theoretical limit, are used.
With TPU v3-256, GPT-J training takes about five weeks.
EleutherAI’s mission is to create safety research more accessible to everyone, especially “low-resource” researchers. They issued a justification for the release on the organization’s blog in response to worries about exploiting its models. GPT-like models are “simple and conceptually straightforward,” making it impossible to keep them out of the hands of undesirable actors. Many well-funded institutions, including Microsoft, NVIDIA, and Google, have already trained much larger models than GPT-3.
On GitHub, you can find the GPT-J code and models. The model’s text production skills are demonstrated in an interactive demo on EleutherAI’s website.
Github repository for GPT-J: https://github.com/kingoflolz/mesh-transformer-jax
Colab Notebook: https://colab.research.google.com/github/kingoflolz/mesh-transformer-jax/blob/master/colab_demo.ipynb
Web Demo: https://6b.eleuther.ai/