AI-powered coding tools, which use machine learning algorithms to generate code based on input data, have attracted increasing attention. In theory, these systems can reduce the time spent writing codes as well as computational and operational costs with minimal errors in output.
However, current coding pre-training systems have many challenges. These methods heavily rely on either an encoder-only model similar to BERT or a decoder-only model like GPT. Either way, it is suboptimal for generation and understanding tasks. As an example, CodeBERT needs an additional decoder when used for tasks like code summarization. Apart from the above issue, most current methods adopt the conventional NLP pre-training techniques on source code by considering it a sequence of tokens like in natural language (NL). This largely ignores the rich structural information present in programming languages, which is vital to comprehend its semantics fully.
The Salesforce team has created and open-sourced a new identifier-aware unified pre-trained encoder-decoder model called CodeT5. So far, they have demonstrated state-of-the-art results in multiple code-related downstream tasks such as understanding and generation across various directions, including PL to NL, NL to PL, and from one programming language to another.
CodeT5 is built on a similar architecture to Google’s T5 (Text-to-Text Transfer Transformer) framework but with better code understanding. A unified model for natural language processing tasks is presented. It reframes text-to-text, where the input and output data are always strings of texts. This allows any task to be applied by this uninformed model.
The research team of CodeT5 had over 8.35 million examples to train the AI on, including user-written comments from open source GitHub repositories. While training, the largest and most capable version of CodeT5, which had 220 million parameters, took 12 days on a cluster of 16 Nvidia A100 GPUs 40GB memory.
CodeT5 achieves state-of-the-art (SOTA) performance on fourteen subtasks in a code intelligence benchmark CodeXGLUE , as shown in the following tables.
In terms of applications of CodeT5, the Salesforce team plans to use it to build an AI-powered coding assistant for Apex developers. Below you can see an example of a coding assistant with three code intelligence capabilities and powered by CodeT5:
- Text-to-code generation: It can generate code based on the natural language description
- Code autocompletion: It can complete the whole function of code given the target function name
- Code summarization: It can generate the summary of a function in natural language description
Despite all the benefits and features of Salesforce’s CodeT5, researchers did acknowledge that one major drawback is it could encode stereotypes like race or gender from text comments in datasets used to train.