In A Latest Computer Vision Research, A Research Team Proposes A Novel Data Augmentation Technique Called ‘TokenMix’ To Improve The Performance of Vision Transformers

Data augmentation is creating additional data points from current data to increase the dataset size artificially. It is used to improve the performance and outcomes of machine learning models. The data augmentation is mainly done by making minor adjustments to the data or utilizing machine learning models to produce new data points in the latent space of the original data. 

Several approaches dealing with image data augmentation have been introduced in the literature, which can be divided into three groups. The first technique, cutting-based data augmentation, involves masking one or more parts of the input image. This technique pushes the network to learn informative representations from the whole picture. As examples, we can cite Cutout and Random-Erasing. The second technique, mixing-based data augmentation,  is based on mixing two images to create a new picture according to a mixing strategy such as combining the RGB values. The most famous algorithms following this approach are Mixup, Co-Mixup, and AugMix. Finally, the third technique, the Joint of cutting and mixing, aims to leverage both cutting and mixing to achieve better performance. For example, CutMix proposes to replace a patch in the image with that from another picture. In addition, the target for the mixed image is calculated as the proportion of the replaced area. 

Recently, researchers from the University of Hong Kong and Sensetime have proposed a method called TokenMix, which is based on CutMix that is more suitable for transformer-based architectures. Instead of replacing a single patch, TokenMix offers to replace multiple non-overlapping patches. Moreover, the new image is labeled from the activation maps obtained from a teacher network.

For transformer-based systems, region-level mixing is less advantageous, and designating the target of mixed images via linear combination may be erroneous and even counter-intuitive. TokenMix was introduced to meet these two challenges. In the new approach, the mask M used to select patches is generated at the token level to push the network to learn long-range dependency. Concretely, the input image is divided into several non-overlapped parts, which are then linearly projected to visual tokens. The masked tokens’ number and the aspect ratio are chosen randomly for each part. The authors state that the distributed masking areas can be recognized more easily than masking a whole rectangular region. On the other hand, a new technique based on activation maps has been introduced to make labeling the new image more consistent with its content.


Contrary to the strategy followed by CutMix, TokenMix is not interested in the ratio of the parts coming from each image since this ratio does not consider each part’s informative value. A pre-trained teacher network is used to generate the activation maps of the two starting images. The label of the new image is therefore constructed from the content-based neural activation maps of the two mixing pictures. Following this strategy, the parts that come from the foreground are more critical than the ones that contain the background, and the new label reflects a more accurate informative ratio. 

The evaluation study was carried out on several recent vision transformer architectures, such as DeiT, PVT, Swin Transformer, and CaiT, in addition to ResNet, a convolution model. The authors used the ImageNet-1K dataset. Results show that the new data-augmentation method improves CutMix on various transformer-based architectures. In addition, a qualitative study reveals that TokenMix improves the occlusion robustness of vision transformers and pushes them to focus on the foreground area. 

This paper proposed a new data augmentation method following a token-level augmentation strategy to be more suitable for transformer-based architectures. In addition, instead of assigning a target of mixed images with a linear combination, TokenMix gets the label of the mixed images with content-based neural activation maps. The evaluation study reveals that TokenMix can improve the occlusion robustness and help vision transformers focus on the foreground region of input images. Results show that TokenMix is constantly improving various transformer-based architectures.

This Article is written as a research summary article by Marktechpost Research Staff based on the research paper 'TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper and github link.

Please Don't Forget To Join Our ML Subreddit

Mahmoud is a PhD researcher in machine learning. He also holds a
bachelor's degree in physical science and a master's degree in
telecommunications and networking systems. His current areas of
research concern computer vision, stock market prediction and deep
learning. He produced several scientific articles about person re-
identification and the study of the robustness and stability of deep