Researchers From Berkeley Lab Introduce Self-Supervised Representation Learning For Astronomical Images


Sky surveys make it possible to catalog and analyze celestial objects without the need for lengthy observations, making them invaluable for exploring the universe. However, in addition to providing a general map or image of a region of the sky, they are one of the largest data generators in science, currently imaging tens of millions to billions of galaxies over the lifetime of a single survey. As a result, screening collected datasets to find the most relevant information or discovery has become increasingly laborious. 

Machine learning algorithms are widely used to train the computer models that mine the data. However, each of these approaches has its own set of challenges. For instance, supervised learning requires image data to be assigned labels manually, which is time-consuming and limited in scope, as only about 1% of all known galaxies currently have such labels.

Researchers from Lawrence Berkeley National Laboratory are experimenting with a novel approach called self-supervised representation learning to overcome these shortcomings. Self-supervised learning eliminates the requirement for training labels similar to unsupervised learning. Study shows that self-supervised algorithms can be used to build “representations” (low-dimensional versions of images that retain their inherent information) by introducing certain data augmentations. The technique has also been observed to outperform supervised learning on industry-standard image datasets.

The team states that it has become challenging to get labeled datasets even from crowdsourcing with the increasing volume of data in recent years. And this has motivated them to find innovative ways to automate and accelerate the process, given the increasing size of image datasets being produced by the world’s ever-more sophisticated telescopes. 

According to the team, with the increasing volume of data in recent years, it has become challenging to obtain labeled datasets even through crowdsourcing. Given the increasing size of image datasets produced by the world’s ever-more sophisticated telescopes, this has motivated them to find innovative ways to further automate and speed up the process. 

Their approach is to extract useful features from these images and train the model to find a solution from a small subset of the data to generalize to an entire representation.

To test their concepts, the team initially aimed to train the computer model to learn image representations for galaxy morphology classification and photometric redshift estimation, which are common downstream tasks in sky surveys. For this, they used existing data from the Sloan Digital Sky Survey’s 1.2 million galaxy images. They discovered that the self-supervised approach outperformed supervised state-of-the-art results in both cases. 


Their method allows learning from the entire sky survey without the use of labels. Furthermore, this approach performs a large number of tasks simultaneously, each at a higher level of performance than was previously possible. Instead of teaching a model to perform a specific task, we have to instruct it to search all of the data and learn how the images differ, thereby learning what is in the images themselves.


The team intends to expand the scope of applications and tasks by applying their approach to a much larger, more complex dataset (such as the Dark Energy Camera Legacy Survey (DECaLS)). They claim that other scientific fields, such as microscopy, high-energy physics (anomaly detection), medical imaging, and satellite imagery, could benefit from this method as well. This method makes it possible for anyone with no machine learning expertise or a small amount of computer power to use it, reducing the barrier to working with large datasets.