Meet StarCoder: The Biggest Open-Source Large Language Models for Code

BigCode is a Hugging Face and ServiceNow-led open scientific cooperation focusing on creating huge programming language models ethically. Large Language Models for Code (Code LLMs) StarCoder and StarCoderBase were developed with the help of GitHub’s openly licensed data, which includes 80+ programming languages, Git commits, GitHub issues, and Jupyter notebooks. To achieve similar results to LLaMA, we also trained a model with 15B parameters using 1B tokens. StarCoder is an improved version of the StarCoderBase model trained on 35 billion Python tokens. StarCoderBase was proven to be more effective than other open Code LLMs on several popular programming benchmarks and to be on par with or even better than closed models like OpenAI’s code-Cushman-001 (the original Codex model that powered early versions of GitHub Copilot). The StarCoder models, which have a context length of over 8,000 tokens, can process more input than any other open LLM, opening the door to a wide variety of exciting new uses.

StarCoder and comparable devices were tested extensively over a wide range of benchmarks. HumanEval is a widely used benchmark for Python that checks whether or not a model can correctly finish a function given only its signature and docstring. StarCoder and StarCoderBase were proven more effective than larger models like PaLM, LaMDA, and LLaMA.

Model

Models trained on 80+ languages from The Stack (v1.2) are not included in the StarCoder models’ 15.5B total parameters. The model was introduced on 1 trillion tokens with the Fill-in-the-Middle objective using Multi Query Attention with a context window of 8192 tokens.

Researchers are also sharing the following demos and materials alongside the model:

  • OpenRAIL licenses the model’s heaviness, which includes intermediate checkpoints.
  • All training and preprocessing code is licensed under Apache 2.0.
  • an all-encompassing framework for testing computer programs
  • a fresh dataset for training and assessing PII-removal algorithms
  • The dataset used for training has been completely preprocessed.
  • A tool to identify where in the dataset the code was generated.

Uses

  • Code from GitHub was used to train the model. Because of this, it is not a good model for instructions, and you won’t have much success issuing directives like “Write a function that computes the square root.” However, following the on-screen prompts can transform it into a helpful technical assistant.
  • Fill-in-the-middle uses tokens to determine which parts of the input and output are the prefix, middle, and suffix.
  • The model’s pretraining data set was selected to include only content with permissive licenses. However, the model can use the dataset to generate source code word for word. It is important to adhere to any attribution and other criteria stipulated by the code’s license. 
  • The new VSCode plugin is a useful complement to conversing with StarCoder while developing software. To see if the current code was included in the pretraining dataset, press CTRL+ESC.

Key Features

  • It’s a major open-source Code-LLM.
  • Using GitHub data that is licensed more freely than standard, a 15B LLM was trained.
  • On all major open-source programming benchmarks, it achieves the best results.
  • It is a technical assistant, generates realistic code, and supports 80 programming languages.
  • It was trained on 1 trillion tokens and had a context window of 8192 tokens.
  • Only legally authorized information.

Limitations

  • It is easier to eradicate such copies if the copyright owner opts out when the code is licensed permissively or under a copy-left license and then duplicated to another repository. It needs to be more effort put into developing effective data control and consent processes for the massive amounts of data used in LLMs’ training.
  • Like other LLMs, StarCoder has limitations, including the possibility of producing erroneous, rude, deceptive, ageist, sexist, or stereotypically reinforcing information.
  • The model is made available under the OpenRAIL-M license, which imposes legally binding constraints on how the model can be used and how it can be modified.
  • StarCoder’s coding abilities and natural language understanding were analyzed by researchers by comparing them to English-only benchmarks. Research into the efficacy and limitations of Code LLMs on different natural languages is necessary to broaden the applicability of these models.

Researchers hope to improve access, repeatability, and transparency of Code LLMs in the research and developer community by releasing the StarCoder models under an Open Responsible AI Model license and by open-sourcing all code repositories for creating the model on GitHub. To ensure that any derivative works of the model or applications that make use of the model adhere to the BigCode principles of responsible AI, the model license includes usage restrictions. Researchers also made available a fresh set of attribution tools for end-users of Code LLMs to utilize in the hunt for potentially plagiarized model generations. Researchers hope these precautions will aid in a secure model release, guaranteeing that StarCoder’s high-performing models will continue to be used for good.


Check out the Model and Blog. Try it here. Don’t forget to join our 20k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

🚀 Check Out 100’s AI Tools in AI Tools Club

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone's life easy.

✅ [Featured Tool] Check out Taipy Enterprise Edition