Researchers from Northwestern University Come Up With More Efficient AI Training With a Systolic Neural CPU

Since the inception of modern machine learning, the goal has been to make workloads that use the technology as efficient as feasible. Considering the nature of the applications, the speed and efficiency of deep learning models (DNNs) have been at the top of the agenda. These applications can do many tasks, from being a cornerstone of the coming autonomous vehicle era to actual analytics. It also proves helpful for businesses to become more user-friendly, profitable, and capable of thwarting cyberattacks.

Improving the end-to-end performance of deep learning tasks is a previously unexplored territory. A CPU is required for pre-and post-processing and data preparation for difficult Machine Learning work. However, all of these initiatives have not been very successful in addressing all of the issues. The most prevalent architecture is heterogeneous, combining a different CPU core and an accelerator.

Other efforts have addressed this problem, such as data compression, data movement reduction, and memory bandwidth improvement. An accelerator coherency port (ACP) was devised to increase data transfer performance to request data directly from the CPU’s last level cache rather than using the DMA engine in one situation.

Northwestern University researchers present a new design that combines a traditional CPU with a systolic convolutional neural network (CNN) accelerator on a single core, resulting in a highly programmable and versatile unified design. The study team says it can achieve a core utilization rate of above 95%. Data transfer is removed, and latency for end-to-end ML operations is minimized using this method.


While running image-classification tasks, the new architecture, the systolic CNN accelerator (SNCPU), generated latency improvements of 39 to 64 percent and a 0.65 TOPS/watt to 1.8 TOPS/watt energy efficiency. The main goal of the SNCPU is to have everything in a single core to reduce the time it takes to transport data and thus increase the core’s utilization. At the same time, it improves the chip’s ability to be configured as needed.

For weight-stationary tasks, the accelerator mode provides conventional systolic dataflows. In accelerator mode, each row or column has an accumulator (ACT module) that adds SIMD support for pooling, ReLU capabilities, and accumulation. Although data is usually kept local within the reconfigurable SRAM banks, L2 SRAM banks are included to exchange data between CPU cores during data processing in CPU mode.

Each PE comprises a simple pipelined MAC unit with 8-bit wide inputs and 32 bits at the output for developing a 32-bit RISC-V CPU pipeline from a systolic PE array.

  • The instruction cache address is the first PE in each row or column.
  • Two more are used to fetch instructions, with the internal 32-bit register and 8-bit input registers being reused.
  • For the decoder stage, two PEs are employed.
  • For the execution step, three PEs are combined: one as an ALU with additional circuitry for Boolean operations and a shifter, one to generate a new instruction cache address, and one to transmit the execution results to the registers.

The total overhead for reconfiguring the CNN accelerator for CPU functions is less than 9.8%, which includes CPU functions in the PE-array (3.4%), instruction (6.4%), and RF (6.4%). The CNN accelerator has a 15 percent higher power overhead than the basic original design. Extensive clock gating is used in the CNN and CPU modes to reduce duplicated power consumption from the added logic.

A specific 4-phase data flow utilizing the four different topologies is used for end-to-end picture classification tasks. The SNCPU architecture keeps most data in the processor core, removing the need for costly data migration and a DMA module. The DMA engine transports input data from the CPU cache to the accelerator’s scratchpad in a typical design; however, this step is skipped in the four-step SNCPU dataflow. The chip operates in CPU mode, preprocessing input data in rows. It also runs in column-accelerator mode, with the CPU mode’s data caches serving as feed memory for the CNN accelerator.

After the accelerator completes the entire layer of the CNN model, the SNCPU switches to column-CPU mode to execute data alignment, padding, duplication, and post-processing by utilizing data from the previous accelerator mode’s output memory. The SNCPU moves to row-accelerator mode to perform the second layer of the CNN by directly using the data cache from the primary CPU mode in the fourth phase. 

The four-phase method is repeated until all CNN layers have been completed. This obviates the need to transfer data between cores in the middle. Furthermore, the SNCPU may be built into 10 CPU cores, each of which can execute ten different instructions simultaneously, significantly improving the CPU’s pre-and post-processing capabilities over the traditional CPU-and-CNN architecture.

The researchers hope that this is just the start of a series of breakthroughs in this field that will benefit the entire scientific community.



🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...