IBM Research Introduces ‘CodeNet’: A Large-Scale Dataset Aimed at Teaching AI to Code For Machine Learning Models

If you have checked your bank account, used a credit card, gone to the doctor, booked a ticket, paid your taxes, or bought anything in a store, you’ve probably dealt with a design that depends on COBOL (Common Business Oriented Language) code. Even though it was initially introduced over six decades ago, it is still used in many mission-critical business systems worldwide. COBOL is believed to be used in over 80% of financial operations, while the US Social Security Administration uses approximately 60 million lines of COBOL code.

As COBOL programmers and developers began to retire, corporations battled to maintain their systems operationally, let alone upgrade them for the reality of the always-on internet. And this is just one of many languages that are still in use but do not reflect what current coders want to write in or what is best suited for modern commercial applications.

Code language translation is one of the issues attempted to be addressed with CodeNet. It has the potential to revive methodologies for updating outdated systems, assisting developers in writing better code, and even allowing AI systems to help code the computers of the future. CodeNet is essentially an extensive dataset that intends to teach AI systems how to interpret and improve code and assist developers to code more quickly. Eventually, allows an AI system to code a computer. It comprises around 14 million code samples, totaling 500 million lines of code from over 55 different languages. It includes examples of current languages such as C++, Java, Python, and Go and older languages like Pascal, FORTRAN, and COBOL.

Over the last decade, there has been a revolution in Artificial Intelligence. Language and picture data, meticulously curated and labeled in databases such as ImageNet, have given rise to AI systems that can finish sentences for writers, identify tumors for physicians, and automate a plethora of business and IT activities. However, for code, the computer language, creating a dataset from which AI systems can learn has proven problematic.

The ultimate objective of CodeNet is to enable developers to build systems that can update old codebases while also fixing faults and security vulnerabilities in code. The big question is: Can computers program themselves?

CodeNet Experiments

On CodeNet, baseline studies for code categorization, code similarity, and code completion were run. When CodeNet users conduct their own tests, they can refer to these results as a guide. Because of CodeNet’s excellent quality, several studies show that models created from it generalize better across datasets than models developed from other datasets.

Classification of codes

CodeNet can assist in developing systems that can detect what form of code a snippet is. They employed a variety of machine-learning approaches, including a bag of tokens, a sequence of tokens, BERT model, and graph neural networks (GNNs). Some techniques for matching code types to source code achieved up to 97% accuracy.

Code resemblance

Code similarity assesses whether or not two or more pieces of code address the same problem. It is the underlying approach for code recommendation, clone detection, and cross-language transformations. They used a variety of code similarity algorithms (including MLP with a bag of tokens, Siamese network with token sequence, Simplified Parse Tree [SPT] with custom feature extraction, and GNN with SPT) against the benchmark datasets. A complex GNN with intra-graph and inter-graph attention mechanisms yields incredible similarity scores.

Across-dataset generalization

Models trained on the CodeNet benchmark datasets will benefit considerably from their excellent quality. They compared C++1000 to one of the most giant publicly accessible datasets of its sort, GCJ-297, produced from challenges and solutions in Google’s Code Jam. The same MISIM neural code similarity system model was trained on C++1000 and GCJ-297, and the two trained models were evaluated on another independent dataset, POJ-104. The model which was trained on GCJ-297 has a 12% poorer accuracy than the one trained on C++1000. C++1000 can generalize better than GCJ-297 because there is less data bias. And the quality of data cleaning and de-duplication in CodeNet is superior.

Finishing the code

This is a beneficial use case for developers, in which an AI system can forecast what code should be executed next at a specific place in a code sequence. To put this to the test, a masked language model (MLM) was created, which randomly hides tokens in an input sequence and attempts to accurately anticipate what follows next in a series of tests it hasn’t seen before. They trained a popular BERT-like attention model and achieved a top-1 prediction accuracy of 91.04% and a top-5 accuracy of 99.35%.

Additional CodeNet applications

Because of the extensive information and linguistic diversity, CodeNet may be used for a wide range of intriguing and valuable purposes. Code samples in CodeNet are identified with their anonymous submitter and acceptance status, allowing us to easily select realistic pairs of broken and corrected code from the same submitter for automatic code repair. A high proportion of the code samples include inputs, allowing us to run the code and extract the memory footprint and the CPU runtime, which can then be utilized for regression research and prediction. Given its plethora of programs written in various languages, CodeNet may also be used for program translation. The enormous quantity of code samples published in popular languages (such as C++, Python, Java, and C) provide suitable training datasets for the unique and successful monolingual techniques developed in recent years.

What sets CodeNet apart?

While CodeNet isn’t the only dataset attempting to tackle the realm of AI for code, it has some distinct advantages.

Big scale: To be helpful, CodeNet must include many data samples, with a wide range of selections to reflect what users may face while attempting to code. CodeNet is probably the largest dataset in its class, with 500 million lines of code: It has around ten times the number of code samples as GCJ, and its C++ benchmark is roughly ten times larger than POJ-104.

Rich annotation: CodeNet offers a range of metadata on its code samples, such as whether or not a sample solves a specific problem (and the error categories it falls into if it doesn’t). It also contains the issue statement for a particular job, sample input for execution, and sample output for validation. The code is designed to solve a coding challenge. This additional information is not found in other datasets of a similar nature.

To improve CodeNet’s accuracy and performance, the code samples were evaluated for near-duplicate (and duplication) and utilized clustering to detect identical issues.

How is CodeNet built?

CodeNet has 13,916,868 code submissions, which are grouped into 4,053 challenges. Some 53.6 percent (7,460,588) of the responses are approved, indicating that they can pass the test, while the remaining 29.5 percent are flagged with incorrect answers. The other entries were rejected because they did not match the runtime or memory criteria.

The issues on CodeNet are primarily educational, ranging from easy exercises to complex problems requiring complicated algorithms, and the users contributing code span from novices to seasoned coders. The information is mainly code and metadata taken from two online code judging sites, AIZU and AtCoder. These websites provide courses and contests where coding issues are offered, and contributions are scored by an automated review process to determine their correctness. Only public contributions and manually combined data from the two sources were evaluated, resulting in a consistent format from which a single dataset was created.

Duplicate difficulties and near-duplicate code samples helped extract benchmark datasets where data independence was desired. Benchmark datasets for the leading languages (C++, Python, and Java) were offered for the customers’ convenience. Each code sample and related challenge is unique. Many pre-processing techniques were supplied to ensure that code samples could be successfully transformed into machine learning model input. Users can also develop benchmark datasets tailored to their individual needs by utilizing GitHub’s data filtering and aggregation tools.

What is the future of CodeNet?

This is only the beginning of what CodeNet can bring to the field of AI for programming. CodeNet data will be offered a series of challenges shortly. The initial task is for data scientists to use CodeNet to create AI models that can detect code comparable to another piece of code. This challenge was made in collaboration with Stanford University’s Global Women in Data Science program.

Imagine a future in which a developer may build on old code in a familiar language. They could write in Python, and an AI system could translate that into fully executable COBOL, prolonging the life and stability of the system they’re working on indefinitely. They’ve also begun to investigate how computers may program their successors. But for CodeNet to be a success, developers need to start using what it is.

Github: https://github.com/IBM/Project_CodeNet

Paper: https://openreview.net/pdf?id=6vZVBkCDrHT

Reference: https://research.ibm.com/blog/codenet-ai-neurips-2021

https://www.youtube.com/watch?v=V2xyu3WJ1kA&t=304s