NVIDIA Launches TensorRT 8 That Improves AI Inference Performance Making Conversational AI Smarter and More Interactive From Cloud to Edge

Artificial intelligence (AI) models are widely used in countless real-time applications, and their demand is exponentially increasing worldwide. This demands firms to employ state-of-the-art AI models and offer more efficient solutions. 

Today, NVIDIA released the eighth generation of the company’s AI software: TensorRT™ 8, which cuts inference time for language queries in half. This latest version of the software allows firms to deliver conversational AI applications with quality and responsiveness that was never possible before.   

TensorRT has been downloaded nearly 2.5 million times by more than 350,000 developers from 27,500 companies in numerous industries, including healthcare, automotive, finance, and retail, over the last five years. TensorRT applications can be used in hyperscale data centers, embedded product platforms, and automotive product platforms. TensorRT 8 will now allow developers to build the world’s best-performing ad recommendations, search engines, and chatbots and deliver them from the cloud to the edge.

TensorRT 8’s breakthroughs in AI inference are made possible through the following key features:

  • Transformer optimizations have allowed TensorRT 8 to deliver record-setting speed for language applications. 
  • They used a new performance technique in NVIDIA Ampere architecture GPUs called Sparsity. This increases efficiency and speeds up neural networks by reducing computational operations.
  • They employed Quantization-aware training that allows using trained models without losing accuracy to perform inference in INT8 precision. For an efficient detection of tensor core, this reduces calculation and storage overhead significantly.

Previously, firms had to reduce the size of their models, which resulted in less accurate results. Companies can now double or triple their model size with TensorRT 8 to achieve remarkable increases in accuracy.

TensorRT supports many industries across diverse fields.

Hugging Face, a pioneer in open-source AI collaborates with NVIDIA to launch revolutionary AI services that will permit neural search, text analysis, and conversational applications at scale. Their Accelerated Inference API gives up to 100x speedup for NVIDIA’s GPU-enabled transformer models. They even achieved 1ms inference latency on BERT using TensorRT 8 and will soon launch the models later this year. 

Ultrasound is a critical tool in detecting various diseases, and through intelligent health care solutions, clinicians can provide the highest quality of care. TensorRT is being used by GE Healthcare, a global leader in medical technology, diagnostics, and digital solutions, to speed up computer vision (CV) applications for ultrasounds.

In the Cardiovascular Ultrasound department at GE Healthcare, Chief engineer Erik Steen explains that clinicians spend most of their precious time selecting and measuring images during the ultrasound. However, to make the process more efficient for an R&D project, they implemented automated cardiac view detection on their Vivid E95 scanner. The cardiac view recognition algorithm chooses suitable images for cardiac wall motion analysis. TensorRT improves the execution of the view detection algorithm with its real-time inference capabilities while also shortening the time to market during the R&D project. 

TensorRT 8 is now available for free to NVIDIA Developer program members. The most recent versions of plug-ins, parsers, and samples are also open-source and can be accessed from the TensorRT GitHub repository. 

Github: https://github.com/NVIDIA/TensorRT

Source: https://developer.nvidia.com/blog/nvidia-announces-tensorrt-8-slashing-bert-large-inference-down-to-1-millisecond/

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...