In their latest paper, researchers at OpenAI reveal details about a deep learning model called Codex. This technology is the backbone of Copilot, an AI pair programmer tool jointly developed by GitHub and OpenAI that’s currently available in beta to select users. The paper explores the process of repurposing their flagship language model GPT-3 to create Codex, as well as how far you can trust deep learning in programming. The OpenAI scientists managed this feat by using a series of new approaches that proved successful with previous models.
As mentioned earlier, Codex is a descendant of GPT 3 that scientists have repurposed. GPT 3 is a benchmark in itself, with around 175 billion parameters (the complexity of a model is usually measured with the number of parameters they have). It was two orders higher in terms of magnitude than its predecessor, GPT 2. The model was trained to easily outdo the other models available that were specialized in particular tasks.
However, in the paper by OpenAI, GPT 3 was unable to solve any of the coding problems to evaluate the model. There were no coding samples in the training dataset of GPT 3, and that is seen as the cause for its inability to perform in the arena. Therefore, a different approach was taken for Codex wherein the researchers used supervised learning to fine-tune the models instead of unsupervised learning found in the GPT. It has been claimed that this Codex was able to boost performance levels to a whopping 37.7 percent. Codex has also been equipped to solve an entire plethora of coding tasks.
Codex as a machine learning model works according to the no-free lunch model that primarily states that generalization comes at the cost of performance. Simply put, it says that the machine learning models designed for particular specifications work better than when there is a generalization of the tasks. Codex claims to solve the specialized task of transferring function descriptions and signatures into source code with unbelievable accuracy; however, the natural language processing does not reach the average levels of accuracy.
OpenAI’s researchers undertook the research indicating that Codex’s performance levels improved substantially when they increased the model’s size. An example of the same is that at 300 million parameters, Codex solved 13.2 percent of the evaluation tasks and 28.8 percent when the parameter was 12 billion. The scientists, however, did not go beyond the 12 billion thresholds for this model because the dataset size for training required would also be huge and could cause potential overloading. This could mean that the model would be accurate in memorizing and rehearsing the trained examples but bad at dealing with the situations. Moreover, gathering and maintaining a more extensive dataset would also prove to be extremely expensive.
The cost of Codex is another arena that could pose a significant problem because the cost of training the model would be such that revolving a profitable business around it would prove to be unsuccessful. A smaller but fine-tuned version of GPT 3 could prove to be a better option. But a fact that cannot be ignored is that the code generation market is a lucrative one where the hourly salaries of the programmers are skyrocketing; therefore, saving even a few hours could potentially be profitable.
As per the paper. the output produced by Codex is quite fascinating, but the fact of the matter is that it is a machine learning model and does not understand the programming of any kind. It merely captures the statistical correlations between the code fragments. The researchers have highlighted the same in their paper. They have said that Codex is not sample efficient to train and acknowledge and that even the most seasoned developers would not have seen anywhere even near the number of codes used. They went on to add that a strong student that has studied an introductory science course would be able to solve a more significant fraction of coding tasks than the Codex 12B.
Another point is that even after completing the block in the prompt, it would mindlessly continue to generate codes. This could prove to be useful for simple tasks that have a recurring nature. But a larger task with several tasks to be completed would reflect upon the limitations of this model. This is further shown by the experiments, where as and when the components increased, the model’s performance was found to lag.
The researchers have also discussed the issue of misalignment in their paper. What this means is that whenever there is a potential bug in the code, Codex might deliberately suggest a code that would appear to be good but, in reality, would be incorrect. Subtle bugs are usually widespread in codes and are easy to identify for human programmers. More study might make the researchers understand this more, but this issue would likely persist and increase as the data increases even with further upgrades.
Codex, if it is as successful as is being claimed, could be a gamechanger for the software industry; however, it would have its limitations attached.
Paper: https://arxiv.org/pdf/2107.03374.pdf
Github: https://github.com/openai/human-eval
Source: https://thenextweb.com/news/dont-mistake-openai-codex-for-a-programmer-syndication
Amreen Bawa is a consulting intern at MarktechPost. Along with pursuing BA Hons in Social Sciences from Panjab University, Chandigarh, she is also a keen learner and writer, having special interest in the application and scope of artificial intelligence in various facets of life.