Autocompletion has become a handy and widely used tool in contemporary messaging and other writing tasks. It is also an essential feature of an integrated development environment (IDE) for computer programming. Recently, research has shown that autocompletion can be powered by deep learning, thus allowing software language models to achieve significant accuracy improvements by training on real-world datasets collected from programmers’ IDE activity. However, a common issue with less popular programming languages is that the available IDE datasets may be insufficient for training.
In a paper, a research team from Facebook demonstrates how transfer learning can enable pre-training on non-IDE, non-autocompletion, and different-language example code sequences before fine-tuning on the autocompletion prediction task. The proposed method improves model accuracy by more than 50 percent on small fine-tuning datasets and over 10 percent on 50k labeled examples.
The researchers used datasets from real-world developer activity at Facebook and focused on the popular programming language Python and a less popular language Hack. They first trained a variety of monolingual models from either Hack or Python and several multilingual models from both languages. To recognize and predict rare and novel tokens from an open vocabulary effectively, they applied two tokenization approaches; one is Byte-pair encoding (BPE), and the other is Bigram encoding + copy mechanism. They used two state-of-the-art code prediction performance models — GPT-2 and PLBART to test the effects of transfer learning and evaluated both online and offline model performance.
The researchers designed their extensive experiments to answer three significant issues: Benefits autocompletion models receive from combining unsupervised pre-training with task-specific fine-tuning, the effect of pre-training on a large source code dataset obtained from outside code authoring, and if pre-training a multilingual model on the language with more training data can benefit the language with fewer data.
They Pretrained two transformer software language models GPT-2 and BART, on source code files obtained from version control commits. They showed how their performance on autocompletion prediction improves through fine-tuning real-world IDE code sequences 2.18 percent.
They trained the GPT-2 model on two real-world datasets: code sequences logged during IDE authoring and autocompletion selections. They demonstrated how the combination of pre-training and task-specific fine-tuning leads to a superior model, outperforming the base model by nearly 3.5 percent.
They also Showed that pre-training on a different programming language boosts accuracy by 13.1 percent when comparing a model pre-trained on Hack examples and fine-tuned on 10k Python examples compared to only training on Python examples. They have also shown that improvements across three transfer learning dimensions: task, domain, and language — translate into increased autocompletion tool usage by 3.86 percent, 6.63 percent, 4.64 percent, respectively, by comparing these models through online A/B tests.
The study demonstrates that pre-training autocompletion models on non-IDE, non-autocompletion, and different language example code sequences can greatly boost model accuracy, improving the coding experience for developers who use even the less popular programming languages.