In recent years, Natural language processing (NLP) techniques are adopted widely to solve the programming languages’ tasks to assist the software engineering process. A growing number of sophisticated NLP applications make researcher’s life more convenient. The transformer model (combined with transfer learning) has been established to be a powerful technique for NLP tasks. However, not many studies focus on the applications for understanding source code language to ease the software engineering process.
Researchers from Google AI, NVIDIA, Ludwig-Maximilians-University, and Technical University of Munich (TUM) have recently published a paper describing CodeTrans, an encoder-decoder transformer model for the software engineering tasks domain. The proposed model explores the effectiveness of encoder-decoder transformer models for six software engineering tasks, including thirteen sub-tasks.
CodeTrans adapts the encoder-decoder model proposed by Vaswani et al. in 2017 and the T5 framework proposed by Raffel et al. in 2020. The T5 models concatenate different training examples up to the maximum training sequence length. The proposed approach instead disables the reduce_concat_tokens feature, allowing every sample to have only a single training example. The model also employs the concepts of TaskRegistry and MixtureRegistry from the T5 model, where every task can be built as a single TaskRegistry, and one or more TaskRegistries can create one MixtureRegistry. Using these, the team developed 13 TaskRegistries, one MixtureRegistry for self-supervised learning, and one MixtureRegistry for multi-task learning.
CodeTrans was trained using single-task learning, transfer learning, and multi-task learning on one NVIDIA GPU and Google Cloud TPUs. They used supervised and self-supervised tasks to build a language model in the software engineering domain.
They applied the model on six supervised tasks in the software engineering domain as follows:
- Code Documentation Generation: Requires a model to generate documentation for a given code function.
- Code Comment Generation: Focuses on developing the JavaDoc for Java functions.
- Source Code Summarization: generates a summary for a short code snippet.
- Git Commit Message Generation: Generates a commit message describing the git commit changes.
- API Sequence Recommendation: Generates an API usage sequence (such as the class and function names) based on a natural language description.
- Program synthesis: Generates programming codes based on natural language descriptions.
The team evaluated all the tasks on a smoothed BLEU-4 score metric. The proposed model outperforms all baseline models and attains SOTA performance across all tasks. The experiments conducted across various tasks demonstrate that large models can bring a better model performance. Additionally, it shows that models with transfer learning, multi-task learning fine-tuning, and the pre-training models can be fine-tuned on the new downstream tasks efficiently while saving a significant amount of training time. Also, multi-task learning is helpful for the small dataset on which the model will overfit easily. These experiences can be generalized for training NLP tasks on different domains.
The team stated two aspects of programming language function that influence the model’s performance: Function names/Parameter names and Code structure. A well-named function would lower the difficulty for the model to generate the documentation. They hope that future research would focus on functions with disguised parameter names or function names and find the best way to present code structure features.
They also mention preprocessing the datasets by parsing and tokenizing the programming codes using Python libraries for each programming language. But, not every user might know a programming language, and additionally, preprocessing increases the complexity for users to get the best model performance. Therefore, they remark the scope of examining the effect of preprocessing for the software engineering tasks and train models with good performance, but without preprocessing like parsing and tokenizing.