Machine Learning

Speed ​​up PyTorch with custom kernels | Alex Dremov

We'll start with torch.compile, move on to writing a custom Triton kernel, and finally get into designing a CUDA kernel.

About Data Science

Read for free at alexdremov.me

PyTorch offers incredible flexibility, allowing you to code GPU-accelerated operations in seconds. However, this convenience comes at a cost. PyTorch executes your code sequentially, resulting in much less performance. This translates into slower model training, which affects your test iteration cycle, your team's resilience, financial implications, and more.

In this post, I'll explore three techniques to speed up your PyTorch performance. Each method is used softmax like our “Hello World” demo, but you can replace it with any activity you like, and the methods discussed will still work.

We will begin torch.combinemove on to writing a custom Triton kernel, and finally into designing a CUDA kernel.

So, this post might be a bit difficult, but bear with me.

💥 “Wait, you just open one function call and it speeds up your code? Is that all? It sounds too good to be true. “

— Yes.

I torch.compile a relatively new API in PyTorch that uses runtime graph capture and kernel integration under the hood . With a single decorator, you can see speed improvements without significant changes to your code.

Simply speaking, for example, we can speed up calculations by combining operations into a single GPU operation, which eliminates separate GPU calls. Or better yet, organize a series of tasks by changing them evenly!

Such configuration is not possible in the normal mode of running PyTorch (eagerly) as it executes functions as they are called in code.

Implementation of Softmax with torch.compile

Below is a simple example that shows how to use and integrate the softmax function using torch.compile. Substitute a pass for your model, and your code will (hopefully) run faster.

❗ Note that you will have big speedups if you combine the whole model and not just one function

Benefits:

  • One line to enable the compiler.
  • No dark magic rituals are required (except shapeshifting maybe).

Evil:

  • The first pass can be slow while mixing; after that, it raises the speed.
  • It doesn't always produce amazing acceleration everything models can also occasionally break if your code is too old.
  • It still has problems handling changing situations.

😡 Dynamic shape compilation mode is needed when input structure changes and we don't want to recompile the code for each specific size.

Debugging methods is a new topic.

Why Use Triton?

Triton is a language that integrates efficient GPU kernels while allowing you to write Pythonic code. It's implemented under PyTorch's dynamo/inductor stack, but you can also write your own custom ops! With many matrix/tensor functions – like softmax – you can get a big speedup. Because why waiting for official PyTorch characters when you can write your own?

Softmax in Triton

Here's a small snippet showing how to do a softmax with no information on the Triton front. I will keep it short and sweet to show it. In a real project, you can probably do more advanced tiling and block handling.

💥 This may seem complicated, but you just need to get familiar with Triton, and it will start to make sense.

Check out their guides

Indeed, it seems complicated. But the core of the algorithm is condensed into a few lines.

Everything else is data management and side hustle.

If we will benchmark different data lengths, we will see that we are consistent torch.nn.functional.softmax to work (which is a very well made kernel!) and it dramatically surpasses unconscious torch initiation.

Estimating | Author's photo

You can find the full kernel and benchmark code in the following github file.

Benefits:

  • Greater speed possible by combining ops and optimizing memory access patterns.
  • More control than torch.compile.
  • It's easy to write efficient code (like the torch implementation!)
  • It's easy to write code that doesn't work (if you don't know what you're doing).

Evil:

  • Now it's you kernel developerwhich means correcting the error if something goes wrong. The hard one. Indeed.
  • If you go ahead with custom passes back, you may need a second coffee… or more. That's because the flashlight can't use triton's autograd. So you will need to explain it back yourself.

Sometimes even Triton won't cut it, or he just enjoys sitting on the edge. If so, you can write a custom CUDA kernel in C++, compile it, and bind it to PyTorch with a custom extension. Similar projects [this fused CUDA softmax reference] show how people create special characters at high speed.

Softmax on Custom CUDA

You will usually have setup.py that includes a .cu or .cpp file and expose a Python function as an extension.

I will not provide the code for this method in this post, so this fact speaks for itself. This method is very complicated, requires a good deal of accountability, and is usually the last thing you should try.

It's very easy to write inefficient, buggy, unsafe code.

Benefits:

  • High control. “If you want something done right, do it yourself.”
  • There can be a very fast kernel if it is configured properly.

Evil:

  • Requires a deep understanding of CUDA.
  • Memory management, block sizes, shared memory — those are tough!
  • High maintenance can be extremely up.

When it comes to speeding up PyTorch's performance, you can choose from more sophisticated methods:

  1. torch.compile: Minor code changes required.
  2. Triton Kernel: More control of kernel behavior, much easier to code.
  3. Pure CUDA: High potential for improvement, but very high complexity.

If you want an easy upgrade, start with it torch.compile. If that's not enough, check out the Triton. For advanced users, writing a custom CUDA kernel can bring additional benefits, even if it requires deep GPU programming skills.

Subscribe so you don't miss posts about other optimizations and useful deep learning techniques!

  1. Integrating the optimizer with torch.compile (PyTorch Documentation)
  2. How should I use torch.compile correctly? (PyTorch discussion)
  3. Using User-Defined Triton Kernels with torch.compile (PyTorch Documentation)
  4. Torch.combine with a custom Triton kernel (PyTorch discussion)
  5. GitHub: fattorib/CudaSoftmax

Choose the method that best suits your project needs and your comfort level. Good luck doing it right!

The story was first published on alexdremov.me

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button