Researchers from MIT and NVIDIA Developed Two Complementary Techniques that could Dramatically Boost the Speed and Performance of Demanding Machine Learning Tasks

Researchers from MIT and NVIDIA have formulated two techniques that accelerate the processing of sparse tensors (Tensors serve as fundamental data structures in machine learning models, acting as multi-dimensional arrays that organize and store data). 

The goal of both new techniques is to take advantage of the tensors zero values effectively. It is possible to handle these tensors without processing the zeros, which saves memory and computation. For example, multiplying anything by zero results in zero, therefore, that operation can be skipped. Additionally, it can compress the tensor, allowing for more to be stored in on-chip memory because zeros don’t need to be kept.

Sparsity in tensors occurs when unnecessary elements are removed by replacing some values with zeros, a process known as pruning. The degree of sparsity and the positions of these zeros can differ across models. Researchers often constrain the locations of nonzero values to facilitate locating them in large models. The adaptability of hardware accelerators is limited because they are usually designed for particular sparsity patterns.

The research team have developed a hardware accelerator called HighLight, which is capable of efficiently handling diverse sparsity patterns. The researchers have used hierarchically structured sparsity to efficiently represent different types of sparsity patterns made up of simpler ones. In this method, they break down the numbers in a set into smaller groups, and each group follows a simple pattern. These smaller groups are then combined into larger groups, forming a hierarchy. Each collection of groups also follows a simple pattern (like having one group with zeros and three groups without in a level with four groups). This process continues with larger levels, but the patterns stay simple at each step.

This simplicity enables HighLight to find and skip zeros more efficiently, so it can take full advantage of the opportunity to cut excess computation. Their accelerator design had about six times better energy-delay products (a metric related to energy efficiency) than other approaches.

Researchers can also leverage sparsity to move and process data on a computer chip more efficiently. Since the tensors are often larger than what can be stored in the memory buffer on the chip, the chip only grabs and processes a chunk of the tensor at a time. The chunks are called tiles. To maximize the buffer’s capacity and minimize how frequently the chip must access external memory.

To maximize the buffer’s capacity and reduce the number of times the chip needs to access external memory (which can be energy-intensive and slow down processing), researchers aim to use the largest possible tile that fits into the buffer.

Since many data values are zero, a larger tile can fit into the buffer than its raw capacity might suggest, as zero values don’t need to be stored. However, the number of zero values can vary across different parts of the data, and therefore, it can also differ for each tile.

To deal with this, the research group suggested using an overbooking technique to allow for an increase in tile size. In a sparse data set, a tile size can be chosen so that most tiles have enough zeros to fit into the buffer. Occasionally, a tile may have more non-zero values than the buffer can accommodate. In such cases, these excess data are pushed out of the buffer.

The research group has empowered the hardware to retrieve only the displaced data without fetching and processing the entire tile again. They achieve this by modifying the “tail end” of the buffer, leading to the technique’s name, Tailors.

Additionally, they developed an approach, named Swiftiles, to determine the tile size efficiently, capitalizing on the benefits of overbooking. Swiftiles reduce the frequency with which the hardware must inspect the tensor to identify an optimal tile size, thereby saving on computational resources.

The combination of Tailors and Swiftiles results in a performance boost, doubling the speed while requiring only half the energy consumption compared to existing hardware accelerators that cannot handle overbooking.

According to the researchers, Swiftiles can estimate the optimal tile size without requiring multiple iterations to refine the estimate. This process is possible because of the support for overbooking. Even with a significant estimation error, notable speedup can be achieved due to a specific distribution of non-zero values.

Check out the Paper 1, Paper 2, and MIT Research Blog. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 32k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on Telegram and WhatsApp.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...