Researchers from the National University of Singapore Propose Mind-Video: A New AI Tool That Uses fMRI Data from the Brain to Recreate Video Image

Understanding human cognition has made reconstructing human vision from brain processes intriguing, especially when employing non-invasive technologies like functional Magnetic Resonance Imaging (fMRI). There has been a lot of progress in recovering still images from non-invasive brain recordings, but not much in the way of continuous visual experiences like films. 

Although non-invasive technologies only collect so much data since they are less robust and more vulnerable to outside influences like noise. In addition, gathering neuroimaging data is a time-consuming and expensive process.

Progress has been made despite these challenges, most notably in learning useful fMRI features with sparse fMRI-annotation pairs. Unlike static images, the human visual experience is a nonstop, ever-changing stream of sceneries, motions, and objects. Because fMRI measures blood oxygenation level-dependent (BOLD) signals and takes pictures of brain activity every few seconds, it can be difficult to restore dynamic visual experience. Each fMRI readout can be considered an “average” of the brain’s activity during the scan. Contrarily, the frame rate of a standard video is 30 frames per second (FPS). In the time it takes to acquire one fMRI frame, 60 video frames can be displayed as visual stimuli, potentially exposing the subject to a wide range of objects, actions, and settings. Therefore, retrieving films at an FPS significantly greater than the fMRI’s temporal resolution via fMRI decoding is challenging.

Researchers from the National University of Singapore and the Chinese University of Hong Kong introduced MinD-Video, a modular brain decoding pipeline comprising an fMRI encoder and an augmented stable diffusion model trained independently and then fine-tuned together. The proposed model takes data from the brain in stages, expanding its knowledge of the semantic field. 

Initially, the team trains generic visual fMRI features using large-scale unsupervised learning and masked brain modeling. Next, they use the annotated dataset’s multimodality to distill semantic-related features and employ contrastive learning to train the fMRI encoder in the Contrastive Language-Image Pre-Training (CLIP) space. Next, an augmented stable diffusion model, designed for video production using fMRI input, is co-trained with the learned features to hone them.

The researchers added near-frame focus to the stable diffusion model for generating scene-dynamic videos. They also developed an adversarial guidance system to condition fMRI scans for specific purposes. High-quality videos were retrieved, and their semantics, such as motions and scene dynamics, were spot-on. 

The team assessed the outcomes using video and frame-level semantic and pixel metrics. With an accuracy of 85% in semantic metrics and 0.19 in SSIM, this method is 49% more effective than the prior state-of-the-art methods. The findings also suggest that the model appears to have biological plausibility and interpretability based on the results of the attention study, which showed that it maps to the visual cortex and higher cognitive networks.

Due to individual differences, the capacity of the proposed technique to generalize across subjects is still being studied. Less than 10% of the cortical voxels are used in this method for reconstructions, while the full potential of the total brain data remains untapped. The researchers believe that as more complex models are built, this area will likely find use in places like neuroscience and BCI.

Check out the Paper, Github, and Project. Don’t forget to join our 27k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

🚀 Check Out 100’s AI Tools in AI Tools Club

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...