Speed up PyTorch with custom kernels | Alex Dremov

nimda January 9, 2025

0 22 4 minutes read

Speed up PyTorch with custom kernels | Alex Dremov

We'll start with torch.compile, move on to writing a custom Triton kernel, and finally get into designing a CUDA kernel.

Read for free at alexdremov.me

PyTorch offers incredible flexibility, allowing you to code GPU-accelerated operations in seconds. However, this convenience comes at a cost. PyTorch executes your code sequentially, resulting in much less performance. This translates into slower model training, which affects your test iteration cycle, your team's resilience, financial implications, and more.

In this post, I'll explore three techniques to speed up your PyTorch performance. Each method is used softmax like our “Hello World” demo, but you can replace it with any activity you like, and the methods discussed will still work.

We will begin torch.combinemove on to writing a custom Triton kernel, and finally into designing a CUDA kernel.

So, this post might be a bit difficult, but bear with me.

💥 “Wait, you just open one function call and it speeds up your code? Is that all? It sounds too good to be true. “

— Yes.

I torch.compile a relatively new API in PyTorch that uses runtime graph capture and kernel integration under the hood . With a single decorator, you can see speed improvements without significant changes to your code.

Simply speaking, for example, we can speed up calculations by combining operations into a single GPU operation, which eliminates separate GPU calls. Or better yet, organize a series of tasks by changing them evenly!

Such configuration is not possible in the normal mode of running PyTorch (eagerly) as it executes functions as they are called in code.

Implementation of Softmax with `torch.compile`

Below is a simple example that shows how to use and integrate the softmax function using torch.compile. Substitute a pass for your model, and your code will (hopefully) run faster.