Learning Triton One Kernel at a time: Softmax

In the previous topic of this series, a work in every field of computer science: matrix multiplication. It is widely used in neural networks to integrate the performance of discrete layers. However, the operations themselves are difficult to interpret, because their values and statistics (means, diversity, diversity, diversity, min-max increase) can vary from part to part. This is one of the reasons why we use activation functions, for example the logistic function (aka sigmoid) which functions any real number in [0; 1] width.
The SoftMax function, also known as the normal exponential function, is a normal sigmoid function. Converts a vector of raw scores (logits) to probability distribution more Kind of classes. We can translate it as A weighted average behave like Smooth operation and it can be easy -dissolved. An important part of DOT-PROODDORT's attention, language modeling, and various refunds.
In this article, we will cover:
- Using the efficient SOFTMax Kernel in Triton.
- Using back-passing (
autograd). - Usability: Cache Modifiers and Auto-tuning.
If you are not familiar with triton yet, check out the previous articles!
Disclaimer: All illustrations and photos are by the author unless otherwise stated.
An adjective
The softmax is defined as:
The standard specification ensures that the value of the vector has gone 1so that it can be interpreted as an effective active allocation.
Note that this Softmax formulation is very sensitive An abundance of numbers. Remember that the maximum value is the standard FAFAEL16 he can represent 65 504equivalent experience (11). This means that any input value greater than ~11 will result exp(z_i) crossing a fair distance, which leads come out.
A common strategy to reduce this issue is to remove the maximum value of the input vector from all elements, which is the new threshold 0 too early 1 behind.

Naive implementation
As you can see, computing softmax includes two reduction functionsa the poor woman and a the figure. An algorithm without navi requires several different passes over the input veter. First enter the maximum value, then the total, and finally the standard results.
Here is what a naive robust implementation looks like:
The multiplicative core of this triton series reduces high-latency Global Memory Access. Our current nufpy implementation requires 3 separate reads to read the complete input veter, which is inefficient.
Online softmax
Fortunately, we can use a clever trick, known as online softmaxto combine i max and sum steps, to reduce the amount of memory read 2.
First, we define the number of multiplicative exponentials. For the next set of equations, m_i referred to the main above x until me– Index.

This equation allows us to sum exponentials iteratively A higher value is used until now. We can help you remove the first and second loop from the naive implementation and combine the magnitude and sum of the exponentials iteratively.
Our algorithm becomes:

This is easily translated to numvy:
Now that we understand the main principles behind softmax, we will use it in Triton, starting with a simple, single-block version and building up to online, multi-block construction. Finally, we want our kernel to behave like a Pytorch module and be compatible with it autograd.
Unfortunately, from Pytorch's view, Triton Kernels behave like black boxes: the operations they perform are not tracked by autograd. This requires that the background be applied to the background and clearly specify how the gradients should be combined. Let's meet in our beloved chain blessing and discover the softmax gradient.
The gradient
Since the results of Softmax are strictly respectable, we can use the logarithmic derivative to simplify the gradient search. Here, we take the derivative of log to extract and use the chain rule:

From there, we rearrange the goals and follow these steps:

Now imagine that we have some steep gradient, for example made by a loss function L (eg Cross-entropy loss). We get the following expression for the gradient:

Ease the time left on (9) due to the fact that δ_ij it will only fit 1 Of course me-Lakhi, falling a sum above +bar on the other hand.
Use of Triton
One softmax one softmax
Now that we've worked with gradients, we can write background and background softmax curves. First, let's focus on the Pytorch Wrapper to understand how the single block implementation works at a high level. If you are given a 2D input, the back and forth stacks will process all rows in parallel.
For simplicity, we will define the BLOCK_SIZE to be large enough to handle all the columns at the same time. Specifically, we'll set it to the next power of 2 higher than the number of columns, as required by Triton.
After that, we will define our `grid to be the number of rows (it can also handle the size of the batch).
Our pytorch wrapper SoftmaxSingleBlock class dies from torch.autograd.Function that works forward and backward. Both methods take a ctx An argument, which we will use to process the softmax results during the forward pass and reuse them during the backward pass.
Both of these pods are pretty straight forward, we start by loading the input using the same syntax as my previous one Vector addition the article. Be aware of that BLOCK_SIZE and num_warps are combined using a calculate_settings work. This work comes from Unsloth library and was reused in other kernel libraries such as Ligerkernel (which the kernels in this article are based on), provides heuristics to enable both variables:
def calculate_settings(n: int) -> tuple[int, int]:
MAX_FUSED_SIZE = 65536 # maximum grid dimension on Nvidia GPUs
BLOCK_SIZE = next_power_of_2(n)
if BLOCK_SIZE > MAX_FUSED_SIZE:
# we remove this assertion in this article
raise RuntimeError(
f"Cannot launch Triton kernel since n = {n} exceeds "
f"the maximum CUDA blocksize = {MAX_FUSED_SIZE}."
)
num_warps = 4
if BLOCK_SIZE >= 32768:
num_warps = 32
elif BLOCK_SIZE >= 8192:
num_warps = 16
elif BLOCK_SIZE >= 2048:
num_warps = 8
return BLOCK_SIZE, num_warps
Then, we use the standard softmax forward pass and equation (10) by passing back and forth. The only thing new here compared to previous articles is the use of cache mechanisms, which tell the compiler how to grow and store data. For now, we will only focus on three cache queries:
.ca(Cache on every level): Tells the compiler to load information into both L1 and L2 Cache, suggesting that it may be used soon. This modifier should be used when the data is small enough to fit into L1 (~128-192kb for SM with A100) and will be accessed repeatedly..cs(Signing): Treat data as to signit will be used once and discarded to free up space in L1..wb(Write-back): To write a standard cache, the information will remain in the cache section, it is good if the result can be reused.
In the following sections, we will use .ca Load shifting since we perform many operations on loaded data. Finally, we will use .cs in the forward pass, because the results will not be used immediately again .wb which has gone back and forth from the context of autograd (i.e. chain law), gradient effects will be consumed by the bottom cakes.
Multi-block softmax
Now, let's take a look at the online architecture of Softmax. In this section, we use the multi-block variant of the previous kernel. This version will use BLOCK_SIZE < n_colsin other words, we will only load the tile with BLOCK_SIZE things at a time, similar to how we handled the gem of the last course. Now you may ask “How do we choose the block size?”.
This is a good time to introduce tritons autotune Usage. Provided with a list of configurations, autotune We will perform a grid-search to find and save the best configuration for a specific installation situation. This process is repeated each time a new input state is passed to the kernel.
Here, we perform a grid search over block sizes and number of warps using the following function:
from itertools import product
# --- Multi Block Tuning ---
BLOCK_SIZES = [256, 512, 1024, 2048, 4096, 8192]
NUM_WARPS = [2, 4, 8, 16]
def get_autotune_config(
block_sizes: list[int], num_warps: list[int]
) -> list[triton.Config]:
return [
triton.Config(kwargs={"BLOCK_SIZE": bs}, num_warps=nw)
for (bs, nw) in list(product(block_sizes, num_warps))
]
Now we can decorate our multi-block stacks with autotune and pass a list of prepared items, key=”n_cols” It shows that the optimal resolution depends on the number of input columns.
The implementation of these kernels is very close to toftmax that we have covered online, the main difference is that we use tiles (not more than single objects like numps), which require some preparation. For example, we put the sum over the tile at d The update and backward kernel now also require two iterations.
Note: The Pytorch Wrapper is exactly the same except we removed the line there BLOCK_SIZE and num_warps announced (since they were elected by autotune).
Assessment and observation
Now we can step forward and backward through both kernels and make sure they are compatible with Pytorch's basics:
def validate_kernel(kernel_fn: callable) -> None:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch.random.manual_seed(0)
# Generate inputs
x = torch.randn((256, 512), device=device) # triton input
x.requires_grad = True
xt = deepcopy(x) # torch input
triton_output = kernel_fn(x)
torch_output = torch.softmax(xt, dim=1)
torch.testing.assert_close(triton_output, torch_output) # test fwd kernel
# Setup fake labels
y = torch.zeros_like(x)
inds = (torch.arange(0, y.shape[0]), torch.randint(0, 3, (y.shape[0],)))
y[inds] = 1
# Define loss and run backward pass
loss_fn = torch.nn.CrossEntropyLoss()
loss = loss_fn(torch_output, y)
loss.backward()
# Save gradient tensor for later
torch_xgrad = xt.grad.detach().clone()
triton_loss = loss_fn(triton_output, y)
triton_loss.backward()
torch.testing.assert_close(x.grad, torch_xgrad) # test grad outputs
validate_kernel(softmax_sb)
validate_kernel(softmax_mb)
Finally, we benchmark our implementation against the Pytorch base using the following Snippet:
# --- Source: Triton softmax tutorial ---
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=["N"], # argument names to use as an x-axis for the plot
x_vals=[
128 * i for i in range(2, 100)
], # different possible values for `x_name`
line_arg="provider", # argument name whose value corresponds to a different line in the plot
line_vals=[
"triton_single_block",
"triton_multi_block",
"torch",
], # possible values for `line_arg``
line_names=[
"Triton_single_block",
"Triton_multi_block",
"Torch",
], # label name for the lines
styles=[("blue", "-"), ("green", "-"), ("red", "-")],
ylabel="GB/s", # label name for the y-axis
plot_name="softmax-performance", # name for the plot. Used also as a file name for saving the plot.
args={"M": 4096}, # values for function arguments not in `x_names` and `y_name`
)
)
def benchmark(M, N, provider):
x = torch.randn(M, N, device=DEVICE, dtype=torch.float32)
stream = getattr(torch, DEVICE.type).Stream()
getattr(torch, DEVICE.type).set_stream(stream)
if provider == "torch":
ms = triton.testing.do_bench(lambda: torch.softmax(x, axis=-1))
if provider == "triton_single_block":
torch.cuda.synchronize()
ms = triton.testing.do_bench(lambda: softmax_sb(x))
torch.cuda.synchronize()
if provider == "triton_multi_block":
torch.cuda.synchronize()
ms = triton.testing.do_bench(lambda: softmax_mb(x))
torch.cuda.synchronize()
gbps = lambda ms: 2 * x.numel() * x.element_size() * 1e-9 / (ms * 1e-3)
return gbps(ms)
benchmark.run(show_plots=True, print_data=True)
Good News! Our single-block kernel is always a branch of the basic Pytorch while the multi-block variant closes the entry with more than 6k columns:

If we consider the larger input, we can make several observations:
- The multi-block matter finally bounds 900GB / passputs, surpassing the base of pytorch by putting in more than 30k columns.
- Interestingly, it looks like the multi-block variable will dominate the input with more than 60k columns.
- Even if we exceed the maximum block size by block-by-block, the kernel still works fine for some reason. Indeed, triton automatically manages the block size under the hood.
Whenn_colsis greater than the hardware limit, triton will break the input and go over it. However, this seems to be slower than the multi-block method.
To proceed, we can combine both methods into a single kernel that explicitly selects the appropriate kernel based on input size. In this way, we will benefit from the high performance of a single block kernel for small entries and the high throughput of a multi-block variant for entries with more than 60k columns.

This concludes the third episode of this triton cringe, thanks again for your support!
In the next article, we will develop the creation of moftmax online in the context of Bright attention.
Until next time! 👋



