Introducing PyTorch Profiler – The New And Improved Performance Debugging Profiler For PyTorch

The analysis and refinement of the large-scale deep learning model’s performance is a constant challenge that increases in importance with the model’s size. Owing to a lack of available resources, PyTorch users had a hard time overcoming this problem. There were common GPU hardware-level debugging tools, but PyTorch-specific background of operations was not available. Users had to merge multi-tools or apply minimal correlation information manually to make sense of the data to retrieve the missing information.

The PyTorch Profiler came to the rescue, an open-source tool for precise, efficient, and troubleshooting performance investigations of large-scale deep learning models. 

What is new?

Earlier Pytorch users used the autograd profiler to capture PyTorch operations information but did not collect comprehensive GPU hardware information and did not allow visualization.


The new PyTorch Profiler is a platform that puts together all kinds of knowledge and develops expertise to understand its maximum potential. This latest profiler gathers information relevant to GPU and PyTorch, corrects them, automatically detects the bottlenecks in the model, and makes suggestions about resolving these bottlenecks. It still sustains compatibility with autograd profiler APIs.


The new Profiler API is natively supported in PyTorch and offers the most comfortable experience possible to date; by using the PyTorch Profiler module, users can profile their models without downloading additional packages. Below is an instance of an automated bottleneck detection screenshot from the PyTorch profiler.

How to use it? Basic tutorial —

Wrap the code in the profiler’s context manager to profile the model training loop. Install the PyTorch Profiler TensorBoard Plugin to view the profiling session results by using the below command.

pip install torch_tb_profiler
with torch.profiler.profile(
) as profiler:
    for step, data in enumerate(trainloader, 0):
        inputs, labels = data[0].to(device=device), data[1].to(device=device)

        outputs = model(inputs)
        loss = criterion(outputs, labels)


Through the schedule parameter, one can reduce the number of training measures used to reduce the collected data. In TensorBoard, the tensorboard_trace_handler saves profiling output automatically to disc for review.


🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...