Meet ‘VALHALLA’, a Machine Learning Method That can Hallucinate an Image of Written Words, and Then Use It to Help Translate The Text into Another Language

This Article is written as a summay by Marktechpost Staff based on the paper 'VALHALLA: Visual Hallucination for Machine Translation'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper, github, project and post.

Please Don't Forget To Join Our ML Subreddit

Machine translation is a branch of computational linguistics that uses software to convert text or speech between languages.

Typically, MT replaces words in one language with words in another. However, this method rarely results in a decent translation because recognition of entire phrases and their closest counterparts in the target language is required. Many words have several meanings, and not all terms in one language have comparable words in another.

Many researchers have been working to solve this challenge using corpus statistical and neural techniques, which has led to better translations, linguistic typology handling, idiom translation, and anomaly isolation.

But, these systems heavily rely on text-only data and have no explicit connection to the real world. With that being said, researchers are now looking into ways to multimodal MT systems that may incorporate a wealth of external data into the modeling process.

Typically, these methods require source phrases to be linked with corresponding images during training and testing. This specifically limits their utility in situations when images are not available during inference. This inspired researchers at the MIT-IBM Watson AI Lab, MIT CSAIL, and UC San Diego to work on multimodal machine translation, which uses the visual modality to improve machine translation systems.

In their recent work, the researchers first explore whether a system that only has access to images during training time can generalize to these situations in their latest work. “Visual hallucination, or the ability to conceive visual scenes, can be used to improve machine translation systems,” they claim. Further, they state that if a  translation system had access to images during training, it could be taught to abstract an image or visual representation of the text sentence to ground the translation process. This abstracted visual representation could be utilized instead of an actual image to perform multimodal translation during the testing period.

The researchers present a basic but effective VisuAL HALLucinAtion (VALHALLA) framework, which is based on machine learning for machine translation that integrates visuals during training to build a more successful text-only model. In machine translation, the models are trained to augment the text representation recovered from the source phrase with a latent visual representation that is similar to the one extracted by an MMT system from a real image.


In this study, the augmented text representation is recovered from the source phrase with a latent visual representation that is similar to the one extracted by an MMT system from a real image. They use a discrete codebook (trained using VQGAN-VAE) to train an autoregressive hallucination transformer to predict visual tokens from input source words for multimodal translation.

A visual hallucination transformer maps the source sentence into a discrete picture representation. Then, an MMT transformer maps the source sentence combined with its discrete image representation into the target phrase. Hallucination, translation, and consistency losses are used to train the transformer models from start to finish. 

According to the researchers, this is the first time an autoregressive image transformer has been used in conjunction with a translation transformer to successfully hallucinate discrete visual representations.

Their findings show that discrete visual representations work better than continuous visual embeddings currently used in MMT approaches. They demonstrated the superiority of VALHALLA over strong translation baselines on three typical MT datasets with a wide variety of language pairs and different training data sizes. 

The results reveal that VALHALLA outperforms the most relevant state-of-the-art MMT techniques that use continuous image representations by an average of 23% BLEU compared to the text-only translation baseline. In under-resourced translation contexts, the benefits over the text-only baseline are as great as +3.1 BLEU, confirming the idea that visual hallucinations can have significant practical relevance in these settings. Additional research backs this up, indicating that, in limited textual contexts, VALHALLA models indeed use visual hallucination to improve translations.