DeepMind Researchers Propose Normalizer-Free ResNets (NFNets) To Achieve Large-Scale Image-Recognition Without Batch Normalization

A team of researchers at DeepMind introduces Normalizer-Free ResNets (NFNets) and demonstrates that the image recognition model can be trained without batch normalization layers. The researchers present a new clipping algorithm to design models that match and even outperform the best batch-normalized classification models on large-scale datasets while also significantly reducing training time.

Batch normalization and its shortcomings

The batch normalization is a vital component of most image classification models. It can accelerate training, enables higher learning rates, improves generalization accuracy, and has a regularisation effect. However, batch normalization suffers from three practical disadvantages:

  • It is costly in memory and time.
  • It introduces discrepancies between model behaviors during training and inference time, thereby requiring additional fine-tuning.
  • It destroys the independence between training examples in the minibatch.

Many recent studies have successfully trained deep ResNets without normalization. However, the resulting models do not match SOTA batch-normalized networks’ test accuracy and are frequently unstable for strong data augmentations or large learning rates. 

Novel Normalizer-Free networks

A team of researchers at DeepMind have designed a family of Normalizer-Free ResNets (NFNets) to address this issue of weakness. NFNets can be trained in larger batch sizes and stronger data augmentations. It has set new SOTA validation accuracies on ImageNet. 

For training NFNets with larger batch sizes and stronger data augmentations, the team has employed Adaptive Gradient Clipping (AGC). It clips gradients based on the unit-wise ratio of gradient norms to parameter norms. AGC allows training Normalizer-Free Networks with larger batch sizes and stronger data augmentations. 

The researchers have used a range of ablations comparing batch-normalized ResNets to NF-ResNets with and without AGC to test the accuracy of AGC. The results explain that AGC efficiently scales NF-ResNets to larger batch sizes.

Building on AGC, the team trained a family of Normalizer-Free architectures (NFNets). They then implemented them to a SE-ResNeXt-D model (strong baseline for Normalizer-Free Networks) with revised width, depth patterns, and a second spatial convolution. Lastly, they applied AGC to every parameter except for the linear weight of the classifier layer.

The researchers compared the accuracy of the NFNet model with a set of standard models such as SENet (Hu et al., 2018), LambdaNet, (Bello, 2021), BoTNet (Srinivas et al., 2021), and DeIT (Touvron et al., 2020) — on ImageNet dataset. They tested NFNets in the transfer learning regime by pre-training on a dataset of 300 million labeled images to validate the NormalizerFree networks’ suitability to transfer learning after large-scale pre-training.

The NFNet-F5 model achieved a top-1 validation accuracy of 86.0 percent, outperforming the former state-of-the-art model EfficientNet-B8. Additionally, they show that Normalizer-Free models have outperformed their batch-normalized counterparts when fine-tuning on ImageNet after large-scale pre-training, obtaining an accuracy of 89.2 percent.