Machine learning (ML) and Artificial intelligence (AI) have transformed how industries make business decisions. Many firms are now leveraging ML and AI to compile consumer data and analyze and predict future consumer behaviour. This has allowed them to process high volumes of data rapidly and accurately and analyze valuable insights to take promising actions for their business.
In their recent blogs, Snap shares their experience of applying GPU technology to accelerate ML model inference. The inference is the computation-intensive process of calculating model predictions (like the probability of a Snapchatter watching the complete video) from input features (like the number of videos viewed by Snapchatter in the past hour).
This is a challenging task for firms like Snap, having a community of over 293 million Snapchatters daily and creating over 10 trillion ML predictions daily.
ML models on Snapchat are based on deep neural networks (DNN), which makes them so accurate but also computationally heavy. They use a highly scalable and efficient inference stack within the x86-64 CPU ecosystem to overcome this problem. And soon after the launch of inference-oriented accelerators like the NVidia T4 GPU device, they started to investigate whether it can offer a better tradeoff between performance/scalability and cost.
Using GPUs To Accelerate ML Models
GPUs have a far higher raw throughput of floating-point operations (FLOPS) and a moderately larger memory bandwidth than CPUs. Low-precision arithmetic, such as FP16, do not appear to significantly impact DNN training and inference accuracy, and GPUs offer an order of magnitude higher throughput when low-precision arithmetic are utilized. GPUs are more efficient for ML inference in terms of performance per dollar. According to estimates, cloud GPU computers can provide 2-3 times the peak throughput per dollar as the top cloud CPU machines.
The team analyzes their ranking model ( that consumes most ML inference resources) on GPU to see their improvement level. They evaluate the performance and demonstrate that GPU acceleration offers a great advantage to current ML inference workloads.
The performance and flexibility tradeoff for TensorRT and TensorFlow XLA was investigated for the compilation and runtime suite. They chose the NVidia T4 GPU because this is specifically developed and widely available to handle inference workloads. TensorRT is NVidia’s toolkit, which offers a wide range of low-level interfaces for efficiency tuning. XLA is a compiler that converts generic linear algebra operations into high-performance binary code.
They began by benchmarking models dominated by matrix multiplication because it is the most efficient workload for GPUs. They used benchmarking CPU virtual machines (VM), a 60-core system from our cloud provider, and a GPU virtual machine (VM), a 64-core machine with four T4 GPU devices. They used the same hardware configuration in all benchmarks and calculated the number of Tera-FLOPS (TFLOPS) that can be sustained over time, both in absolute and relative terms to the theoretical peak throughput. This simple model was tested in their production environment with varied dimensions to guarantee that these utilization numbers are realistic in practice.
The results from the above experiments demonstrate that:
- On GPUs, FP32 provides a 4x improvement in throughput. The benefit with FP16 is more than 15x.
- The bigger the dimension, the better the GPU usage ratio, which supports prior analysis of the combination of arithmetic and memory operations.
- TensorRT outperforms XLA in lower dimensions.
- Throughput with FP16 is much higher than with FP32; however, FP16 suffers from some precision loss.
Below are some unique engineering solutions developed by the team for cloud-based GPU inference.
Automated Model Optimization for GPU
The above results show that GPUs can remarkably accelerate models dominated by large matrix multiplication. However, this is not true for all models. Memory intensive operations are expensive on GPU and should be preferably performed on CPU. Furthermore, small operations tend to under-utilize GPU cores, and the overall usage degrades when only multiple small operations are computed in a model.
To make the GPU inference details entirely transparent for engineers who work on ML models, the team implemented an automatic model translation workflow that applied multiple customizable GPU optimization steps to our DNN models. This workflow was then encapsulated into a single job executed by Kubernetes workers and was invoked after training completion without manual intervention.
Most ranking DNN models use Mixtures of experts (MoE), where each expert is a light operation, keeping the overall computation within budget. They describe that XLA’s built-in fusion of multiple operations offers higher throughput than with TensorRT. As a result, they created a model compilation step that incorporates XLA’s fusion facility and other low-level optimization techniques. When it is detected that the model under optimization uses MoE layers, this step is automatically triggered.
They measured model inference throughput with varied batch sizes for CPU, GPU TensorRT, and GPU XLA systems. When working with big batch sizes, XLA outperforms TensorRT by a factor of 20 to 50. It is noticed that the XLA underperforms when batch sizes are less than 256. This is due to fixed scheduling expenses that are not amortized with smaller batch sizes. Another intriguing finding is that CPU throughput peaks at 64 batches, indicating that CPU cache overflow is likely to occur.
Scheduling ML Model Inference Workloads
The team explains that we get substantially better throughput when scheduling operations from the same model request to the same device. Therefore, they introduce their custom-developed GPU operation scheduler that enables this speed boost.
They chose two complement models, one computationally lighter and the other computationally heavier. Two of the essential performance parameters for their production serving systems are throughput and tail latencies. The custom GPU scheduler improved throughput by 29% and tail latency by 50% on the heavy model, as shown in the table above.
Lastly, they measured the GPU acceleration impact from production workloads. Based on the on-demand prices for CPU and GPU VMs, GPU VMs can sustain more than 1.3x throughput at the same dollar cost as CPU VMs. They also notice that GPU VMs offer more than 1.7x throughput at the same cost with double batch sizes. Furthermore, they state that the tail latency is over 6x better with GPU inference than CPU inference.