The creation of application-specific hardware accelerators has resulted from the advent of ML-based approaches in solving diverse challenges in vision and language. Standard approaches for creating accelerators tailored to a given application, while promising, necessitate manual effort to create a sufficiently accurate hardware simulator, followed by many time-intensive simulations to optimize the intended purpose. Under varied design restrictions, this entails finding the proper balance between total computing and memory resources and communication bandwidth. On the other hand, designing accelerators that meet these design limitations frequently lead to infeasible designs.
The Google AI and UC Berkeley team introduce PRIME to address these issues. This data-driven optimization approach generates AI chip architectures by using logged data without further hardware simulation. This eliminates the need for time-consuming simulations and allows data from previous experiments to be reused in a zero-shot fashion, even when the set of target applications changes, and even for unseen but related applications to the training set. PRIME can be trained using data from previous simulations and a database of actually constructed accelerators, and a database of infeasible or failed accelerator designs.
Using supervised machine learning to train a prediction model that can predict the performance objective for a given accelerator as input is the simplest technique to use a database of previously developed accelerators for hardware design. Further maximizing the performance output of this learned model concerning the input accelerator design allows the creation of new accelerators. However, this method assumes that the prediction model can precisely forecast the cost of every accelerator. It is commonly known that most supervised learning prediction models misclassify adversarial cases, causing the taught model to predict wrong values.
PRIME develops a robust prediction model that isn’t easily tricked by adversarial cases to overcome this restriction. To architect simulators, this model is simply optimized using any standard optimizer. More crucially, unlike previous methods, PRIME can learn what not to construct by utilizing existing datasets of infeasible accelerators. This is accomplished by supplementing the learned model’s supervised training with extra loss terms that particularly punish the learned model’s value on infeasible accelerator designs and adversarial cases during training. This method is similar to adversarial training.
One of the main advantages of a data-driven approach is that it enables learning highly expressive and generalist optimization objective models that generalize across target applications. Furthermore, these models have the potential to be effective for new applications for which a designer has never attempted to optimize accelerators.
The trained model was altered to be conditioned on a context vector that identifies a certain neural net application desire to accelerate to train PRIME to generalize to unseen applications. Then a single, large model is trained on accelerator data for all applications designers have seen so far. PRIME can also optimize accelerators in a zero-shot approach for numerous concurrent applications and new, unforeseen applications thanks to this contextual change.
The team evaluates PRIME’s performance by contrasting the PRIME-architected optimized accelerator design for nine applications with the manually optimized EdgeTPU design. Although it was never trained to reduce chip area, PRIME increases latency over EdgeTPU by 2.69x while also lowering chip area use by 1.50x. PRIME reduces latency by 1.85x on the MobileNet image-classification models for which the custom-engineered EdgeTPU accelerator was designed.
PRIME can construct accelerators for multiple applications and in a zero-shot setting using logged accelerator data. In both cases, the contextual version of PRIME is trained, with context vectors indicating the intended applications. Then the model is optimized to obtain the final accelerator. In both cases, PRIME outperforms the best simulator-driven strategy, even when only a small amount of data is available for training, but a large number of applications are accessible. PRIME surpasses the best simulator-driven approach in the zero-shot setting, with a 1.26x reduction in latency. Furthermore, the performance gap widens when the number of training applications grows.
The researchers compared PRIME’s best accelerator to the greatest accelerator discovered by the simulator-driven approach. In comparison to the simulator-driven technique, PRIME reduces latency by 1.35x. These findings suggest that PRIME prefers PE memory size to satisfy the higher memory requirements, where significant latency reductions were attainable.
Overall, this method allows the model to optimize for specific applications. PRIME may also optimize for applications with no training data by training a single big model on design data across all applications with data.
The researchers believe that their work holds promise for a variety of applications. This includes developing chips for applications that necessitate addressing complex optimization problems and using low-performing chip blueprints as training data to aid in the development of new hardware.