This research summary article is based on the paper 'PaLM: Scaling Language Modeling with Pathways' Please don't forget to join our ML Subreddit
In recent years, large neural networks trained for language recognition and creation have shown remarkable outcomes in various tasks. GPT-3 demonstrated that large language models (LLMs) could be utilized for few-shot learning and achieve outstanding results without significant task-specific data or model parameter modification. Recent LLMs, including GLaM, LaMDA, Gopher, and Megatron-Turing NLG, have scaled model size, used sparsely activated modules, and trained on larger datasets from more diverse sources to attain state-of-the-art few-shot performance on numerous tasks.
In a recent research paper, Google researchers introduced Pathways Language Model (PaLM). PaLM is a 540-billion parameter, dense decoder-only Transformer model learned with the Pathways system that allowed efficient training of a single model across several TPU v4 Pods. PaLM was tested on hundreds of language understanding and generation tasks, and it was discovered that it achieved state-of-the-art few-shot performance across the board, in many cases by a large margin.
PaLM was trained on a mix of English and multilingual datasets, including high-quality web publications, books, Wikipedia articles, discussions, and GitHub code. A “lossless” vocabulary has been constructed that keeps all whitespace (which is critical for code), splits out-of-vocabulary Unicode characters into bytes, and divides integers into distinct tokens, one for each digit.
Language, Reasoning, and Code Tasks: Breakthrough Capabilities
PaLM demonstrates ground-breaking skills in a variety of highly challenging tasks. Below are some instances of language comprehension and generation, thinking, and coding-related challenges.
Language Generation and Understanding
PaLM was tested on 29 typical English natural language processing (NLP) tasks. On 28 of 29 tasks, PaLM 540B outperformed previous large models such as GLaM, GPT-3, Megatron-Turing NLG, Gopher, Chinchilla, and LaMDA on a few-shot basis, including question-answering tasks (open-domain closed-book variant), cloze and sentence-completion tasks, Winograd-style tasks, in-context reading comprehension tasks, commonsense reasoning tasks, SuperGLUE tasks, and natural inference tasks.
PaLM exhibits remarkable natural language interpretation and generating capabilities on various BIG-bench activities. The model can, for example, discriminate between cause and effect, comprehend conceptual combinations in proper circumstances, and even guess the movie from an emoji. PaLM performs well on multilingual NLP benchmarks, including translation, in addition to English NLP tasks, even though only 22% of the training corpus is non-English.
PaLM demonstrates breakthrough skills on reasoning problems that need multi-step arithmetic or commonsense reasoning by combining model size with chain-of-thought prompting. Previous LLMs, such as Gopher, received less advantage from the model scale in performance improvement.
PaLM 540B paired with chain-of-thought prompting showed robust performance on three arithmetic and two commonsense reasoning datasets. PaLM outperforms the previous top score of 55 percent achieved by fine-tuning the GPT-3 175B model with a training set of 7500 problems and combining it with an external calculator and verifier by solving 58 percent of the issues in GSM8K, a benchmark of thousands of challenging grade school level math questions, using 8-shot prompting. This new score is particularly intriguing because it approaches the 60% average of issues handled by 9-12-year-olds, who is the question set’s target demographic.
PaLM can even provide explicit explanations for instances requiring a complicated combination of multi-step logical reasoning, world knowledge, and deep language comprehension. It can, for example, offer high-quality answers to innovative jokes that aren’t present on the internet.
Generation of Code
LLMs have also been proven to generalize well to coding tasks, including producing code from a natural language description (text-to-code), translating code between languages, and resolving compilation mistakes [1, 2, 3, 4]. (code-to-code).
Despite having only 5% code in the pre-training dataset, PaLM 540B displays high performance across coding and natural language tasks in a single model. Its few-shot performance is awe-inspiring because it is on par with the fine-tuned Codex 12B while training with 50 times less Python code. This discovery supports previous results that more extensive models can be more sample efficient than smaller models because they transfer learning from different programming languages and plain language data more effectively.
PaLM’s speed can be improved further by fine-tuning it on a Python-only code dataset called PaLM-Coder. For a code repair assignment called DeepFix, where the goal is to modify initially damaged C programs until they compile successfully, PaLM-Coder 540B exceeds the previous state of the art with a compile rate of 82.1 percent. This allows for the correction of more sophisticated faults during software development.
Considerations in Ethics
Various possible dangers linked with LLMs trained on web content have been highlighted in a recent study. Such potential undesired hazards must be analyzed and documented using transparent artifacts such as model cards and datasheets, including intended use and testing information. To that aim, the work includes a datasheet, model card, responsible AI benchmark results, and explicit bias and risk analyses of the dataset and model outputs. While the study aids in identifying some of the model’s potential dangers, domain- and task-specific analysis is required to fully calibrate, contextualize, and reduce possible effects.
Conclusion and Next Steps
PaLM uses a well-studied, well-established recipe of a dense decoder-only Transformer model to train a 540-billion parameter model efficiently across two TPU v4 Pods, demonstrating the Pathways system’s scaling capability to thousands of accelerator processors across two TPU v4 Pods. PaLM achieves breakthrough few-shot performance across various natural language processing, reasoning, and code challenges by pushing the bounds of model scale.