Researchers at Monash University Propose ‘EcoFormer,’ An Energy-Saving Attention with Linear Complexity That Reduces Compute Cost by 73%

A transformer is a transformative deep learning framework that successfully models sequential data across various tasks. Over the past few years, transformers have achieved remarkable success because of their robust computational power. However, these transformers’ enormous computational and energy costs frequently prevent their use in many practical applications, particularly on edge devices with limited resources. Researchers often sort to binarization to compress models to increase efficiency. Binarization involves converting floating point data into binary ones to reduce resource consumption due to affordable bitwise operations. However, current binarization techniques ignore the pairwise similarity modeling at the center of the attention process and focus on minimizing the information loss for the input distribution. In their recent study, ‘EcoFormer: Energy-Saving Attention with Linear Complexity,’ a research team from Monash University addresses this problem and suggests EcoFormer, a novel binarization paradigm with linear complexity. This attention method reduces the energy footprint by 73 percent on ImageNet by switching out expensive multiply-accumulate operations for simple accumulations.

This work is motivated by the fundamental problem of reducing the high energy cost of attention by using binary quantization to kernel embeddings to replace energy-expensive multiplications with energy-efficient bitwise operations. However, the researchers point out that conventional binary quantization techniques only aim to reduce the quantization error between full-precision and binary values, failing to maintain the pairwise semantic similarity between the tokens of the attention, which has a detrimental effect on performance. The team has developed a unique binarization technique that converts the initial high-dimensional query/key pairs to low-dimensional binary codes using kernelized hashing and a Gaussian Radial Basis Function (RBF). Moreover, the technique preserves pairwise similarity in softmax attention. 

EcoFormer uses this binarization technique to preserve semantic similarity in attention while approximating self-attention in linear time and at a lower energy cost. Attention in linear complexity can be approximated by describing it as a dot-product of binary codes (due to the equivalence between the inner product of binary codes), the Hamming distance, and the associative property of matrix multiplication. Researchers may also replace the majority of the pricey multiply-accumulate operations in attention with simple accumulations to significantly reduce the on-chip energy footprint on edge devices, thanks to the compact binary representations of queries and keys in EcoFormer. 

As part of their empirical investigation, the group used ImageNet1K to compare the proposed EcoFormer with conventional multi-head self-attention (MSA). According to the findings, EcoFormer may save energy use by 73 percent while only degrading performance by 0.33 percent. The team intends to apply EcoFormer to NLP applications like machine translation and audio analysis and investigate binarizing transformers’ value vectors in attention, multi-layer perceptrons, and non-linearities to minimize energy costs further. Overall, the proposed EcoFormer energy-saving attention mechanism with linear complexity is a viable method for removing the financial barrier preventing the widespread use of transformer models.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'EcoFormer: Energy-Saving Attention with Linear Complexity'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github.

Please Don't Forget To Join Our ML Subreddit
🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...