Meet MinD-Vis: An AI Model That Can Reconstruct What You See Using Brain Scans

Diffusion models became the apple of the machine learning community’s eye in the last months. From generating videos using text prompts to image editing, we have seen various successful applications of diffusion models in the last quarter. 

The idea behind the diffusion models is relatively simple. You start with pure noise and gradually denoise until you get a realistic-looking image. You might ask, how about the text prompts we use? How do they affect the output image? Well, the answer is they are used to condition the network so that the gradual denoising process goes in a certain direction.

So, we know if we start from a pure noise image, we can generate a realistic-looking image from it. That is how a diffusion model works. Have you ever wondered how we remember or imagine the objects we have seen before? What happens in our brain when we see a duck in the park and go home and try to remember what it looked like? And why am I even talking about this in a diffusion model article? Well, because MinD-Vis tries to achieve something really interesting. Decoding the MRI scans of human brains to reconstruct the objects they saw. 

Overview of MinD-Vis. Source:

Yes, you read it right. There is a diffusion model to reconstruct the objects you see using the MRI scan of your brain when you first saw that object. 

We shape our lives around what we experience and what we see. In addition to the qualities of external stimuli, our experiences also shape the complex brain activity that underlies our perception of the world. Cognitive neuroscience’s main objective is to comprehend these brain functions and to decode stored information. Therefore, decoding the visual information of brain scans is an important task. 

How do we capture the information in the brain, though? Most of us have probably seen a Magnetic Resonance Imaging (MRI) device in a hospital. These devices can scan brain activities. Functional magnetic resonance imaging (fMRI), on the other hand, is a type of medical imaging technology that uses a magnetic field and radio waves to produce detailed images of the brain. Unlike traditional MRI, which produces static images of the brain, fMRI can be used to create dynamic images that show changes in the brain’s activity over time.

Some studies focused on recovering the visual correspondence using the original fMRI scans and the guidance of biological principles with a deep learning model. However, as deep learning models are fed an enormous amount of data, and since there is not a large-scale fMRI-image pair dataset available, these approaches usually produce blurry and semantically meaningless images. 

Acquiring efficient and biologically sound representations for fMRI is essential to build a clear and universal link between brain activity and visual stimuli with a few paired annotations. 

When it comes to giving the context information to the deep learning model, self-supervised learning with pretext tasks in large datasets is a really powerful approach. Afterward, a domain-specific task is adopted to finetune the model further. This is especially useful when the dataset size is relatively small.  However, it is important to select a proper pretext task to utilize this approach properly. Masked Signal Modeling (MSM) is one of the best examples here, as it can achieve really good results in computer vision tasks. 

Moreover, we all saw how good diffusion models are when it comes to generation. They provide superior performance in generation and training stability. This is handy to have in visual stimuli decoding.

MinD-Vis architecture. Source:

Therefore, MinD-Vis combines these two tools to come up with a reliable stimuli decoding model. MinD-Vis is a Sparse Masked Brain Modeling with Double-Conditioned Latent Diffusion Model for Human Vision Decoding. It exploits large-scale dataset learning and mimics the sparse coding of information in the brain. MinD-Vis can produce meaningful images with matching details using brain recording with very few training pairs. 

Check out the Paper, Code, and Project. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He received his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, with his dissertation titled "Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning." His research interests include deep learning, computer vision, video encoding, and multimedia networking.