Convolutional neural networks (CNNs) are robust tools that use deep learning to perform generative and descriptive tasks such as image recognition. However, they require a large number of resources. The state-of-the-art CNNs comprise hundreds of layers and thousands of channels, resulting in increased computation time and memory use. That is why their implementation on low-power edge devices of Internet-of-Things (IoT) networks is challenging.
Researchers from the Tokyo Institute of Technology introduce an efficient sparse CNN processor architecture and training algorithms to address this challenging task. Their proposed method enables the seamless integration of CNN models on edge devices.
The proposed “sparse” CNNs are obtained by “pruning,” which removes weights that do not signify a model’s performance. This significantly reduces computation costs and maintains model accuracy. Such networks result in more compact versions that are compatible with edge devices. However, sparse techniques are inefficient for real-world settings because they limit weight reusability and result in irregular data structures.
The team, therefore, introduces a novel 40 nm sparse CNN chip that achieves both high accuracy and efficiency. This chip uses a Cartesian-product MAC (multiply and accumulate) array and “pipelined activation aligners” that spatially shift “activations” onto a regular Cartesian MAC array.
On a parallel computational array, regular and dense computations are more efficient compared to irregular or sparse ones. The team was able to achieve dense computation of sparse convolution using this new architecture, which employs MAC array and activation aligners. Furthermore, zero weights in both storage and computation could be avoided, resulting in better resource use.
The ‘tunable sparsity’ is a crucial component of the suggested approach. Although sparsity can boost efficiency by reducing computing complexity, sparsity level influences the accuracy of the prediction. Therefore, the researchers suggest altering the sparsity to achieve the necessary accuracy and efficiency and to understand the accuracy-sparsity relationship.
They employed “gradual pruning” and “dynamic quantization” (DQ) methodologies on typical image data sets ( such as CIFAR100 and ImageNet ) to generate highly efficient “sparse and quantized” models.
Gradual pruning is associated with pruning in incremental steps by removing the smallest weight in each channel. On the other hand, DQ assists in quantizing the weights of neural networks to low bit-length values, with the activations quantized during inference.
The team evaluated the pruned and quantized model on a prototype CNN chip. They observe the resulting measurement to be 5.30 dense TOPS/W ( Tera operations per second per watt—a metric for assessing performance efficiency), which is equivalent to 26.5 sparse TOPS/W of the base model.
The team states that their proposed architecture and sparse CNN training algorithm have a lot of scope in applications from smartphones to industrial IoT.