In recent years, we have seen significant advancement in large language models (LLMs). From OpenAI’s GPT-3, which generates extremely accurate texts, to its open-source counterpart BLOOM, impressive LLMs have been released one after the other. Previously unsolvable language-related tasks had become merely a challenge for these models.
All this advancement is made possible thanks to the massive amount of data we have on the Internet and the availability of powerful GPUs. As good as they sound, training an LLM is an extremely costly process, both in terms of data and hardware requirements. We are talking about AI models with trillions of parameters, so it is really not easy to feed these models with enough data. However, once you do it, you get a mesmerizing performance out of them.
Have you ever wondered what the starting point of developing “computing” devices was? Why did people spend time and effort designing and developing the first computers? We can assume that was not for entertaining people with video games or YouTube videos.
It all started with the goal of solving information overload in science. Computers are proposed as a solution to manage the growing information. They would have taken care of routine tasks such as storage and retrieval so that the way for insights and decisions in scientific thinking is cleared. Can we really say we achieved this while coming up with an answer for a scientific question on Google becoming more and more difficult nowadays?
Moreover, the sheer amount of scientific papers published daily is way beyond what a human being can process. For example, an average of 516 papers per day were submitted to arXiv in May 2022. On top of that, the amount of scientific data is growing beyond our processing capabilities as well.
We have tools to access and filter this information. When you want to research a topic first place you go is Google. Although it will not give you the answer you are looking for most of the time, Google will point you to the correct destination, like Wikipedia or Stackoverflow. Yes, we can find the answers there, but the problem is these resources require costly human contributions, and updates can be slow in that regard.
What if we had a better tool to access and filter the sheer amount of scientific information we have? Search engines can only store the information, they cannot reason about them. What if we had a Google Search that could understand the information it stores and be able to answer our questions directly? Well, it is time to meet Galactica.
Unlike search engines, language models can potentially store, combine and reason about scientific knowledge. They can find connections between research articles, find hidden knowledge, and bring those insights to you. Also, they can actually generate useful information for you by connecting content they know. Generating a literature review about a certain topic, lecture note about the course, answers to your questions, and wiki articles. These are all possible with language models.
Galactica is the first step toward an ideal scientific neural network assistant. The ultimate scientific assistance will be the interface of how we access knowledge. It will handle the cumbersome information overload process while you focus on making decisions using this information.
So, how does Galactica work? Well, it is a LARGE language model per se, so it contains billions of parameters trained on billions of data points. Since Galactica is designed to be a scientific assistant, the obvious source of training data is research papers. In that regard, over 48 million research papers, 2 million code samples, 8 million lecture notes, and textbooks have been used to construct the training data of Galactica. In the end, a dataset with 106 billion tokens is used.
Galactica was used in writing its own paper, so this makes Galactica one of the first AI models that introduced itself. We believe it will be used to write many more papers in the near future.
This was a brief summary of Galactica, the new AI model from Meta designed to help with scientific knowledge retrieval. You can try Galactica for your own use cases using the links below.
Check out the paper and project. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.
Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He received his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, with his dissertation titled "Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning." His research interests include deep learning, computer vision, video encoding, and multimedia networking.