Overcoming the hidden performance pitfalls of dynamic shape columns: A sample dataset that works well in Pytorch

nimda December 3, 2025

0 14 8 minutes read

Overcoming the hidden performance pitfalls of dynamic shape columns: A sample dataset that works well in Pytorch

Part of a series of posts on the topic of analyzing and optimizing Pytorch models. Throughout the series, we encouraged using the Pytorch profile in AI model development and demonstrated the potential impact of speed and cost efficiency on AI / ML workloads. One common scenario we've seen is code that appears to fail to function. In this post, we examine some of the pitfalls associated with the nave use of variable shaped desires – ting ting tes tension shape object that depends on previous computers and / or input. Although it does not work in all cases, there are times when the use of dynamically generated constraints can be avoided – although this may come at the expense of additional computing power and / or memory. We will demonstrate the commercialization of these toy implementations of sample data in Pytorch.

Three lower tessors are different

We stimulate the discussion by presenting three difficulties in the use of various shaped desires:

Host-Service synchronization events

In a good case, the CPU and GPU are able to run in parallel, with the CPU continuously feeding the GPU with input samples, and loading the necessary GPU chemicals, and the GPU executing the loaded kernels in the allocated memory. The presence of professionals with dynamic abilities throws a wrench into this equation. In order to allocate the right amount of memory, the CPU must wait for the GPU to report TESON's Shape, and then the GPU must wait for the CPU to allocate the memory and proceed to load the kernel. Exceeding this synchronization event can cause a drop in GPU utilization and slower runtime performance.

We saw an example of this in part one of this thread when we read about the hidden implementations of common losses that exist including Tort.Nzero and Torch.unique calls. Both APIs return arbitrary size values and are dependent on input content. When these operations are run on the GPU, a device synchronization event occurs. In the case of cross-entropy loss, we found inefficiencies using Pytorch Profer

Graph integration

In a recent post we explored the operational benefits of applying just in time (Jit) Compilation using the Torch.comPle Operator. One of our observations was that graph integration gave the best results when the graph was normal. The presence of a strong shape in the graph limits the scope of the integration: in some cases, it fails completely; For others it leads to a lower profit margin. Similar results apply to other types of graph integration, such as XLA, ONNX, OpenVino, and Tensrt.

Data protection

Another phenomenon that we have encountered in many of our posts (eg here) is sample closure. Awakening improves performance in two main ways:

Reducing overhead of kernel loading: Instead of loading the gpu kernels needed in the compiled pipeline once per input sample, the CPU can load the kernels once per batch.
Increasing the parallelism of compute units: GPUs are highly compatible engines. The more we can move the integration, the more we can fill the GPU and increase its utilization. With closed batching we can increase the matching rate by a factor of batch size.

Despite their inferiority, the use of flexible moldings is often unavoidable. But sometimes we can modify our modeling implementation to cut them off. Sometimes these changes will be to the right (as in the case of loss of loss). Sometimes they may require some creativity by coming up with a different sequence of Pytorch aps programs that give the same numerical result. Often, this effort can yield reasonable rewards in terms of implementation time and cost.

In the following sections, we will study the use of variable shaped stacks in the context of sample data processing. We will start with a small implementation and analyze its performance. We will then propose a GPU-friendly alternative that avoids the use of variable shape preferences.

To compare our implementation, we will run Amazon EC2 G6E.xGarge with NVIDIA L40S running AWS deep learning Ami (Dlami) with Pytorch (2.8). The code we are going to share is for demonstration purposes. Please do not rely on it for reliability or beauty. Please do not interpret our mention of any framework, library, or platform and its usage approval.

Sample at AI Model Works

In the context of this post, sampling refers to the selection of a domain of objects from a large set of candidates for the purposes of computational work, or performance. Sampling is common to many AI / ML models, such as detection, rank, and differential learning systems.

We describe a simple variation of the sample problem: Given a list of Ni wishes each with a binary label, we are asked to return a set K Wishes containing positive and negative examples, in random order. If the input list contains enough samples for each label (K / 2), the returned store must be divided equally. If there is a shortage of samples of one type, this should be supplemented by random samples of the second type.

The Code Block below contains the Pytorch implementation of our sample function. The implementation is inspired by the popular detectron20 library (eg see here and here). For testing in this post, we will adjust the sample rate to 1:10.

import torch

INPUT_SAMPLES = 10000
SUB_SAMPLE = INPUT_SAMPLES // 10
FEATURE_DIM = 16

def sample_data(input_array, labels):
    device = labels.device
    positive = torch.nonzero(labels == 1, as_tuple=True)[0]
    negative = torch.nonzero(labels == 0, as_tuple=True)[0]
    num_pos = min(positive.numel(), SUB_SAMPLE//2)
    num_neg = min(negative.numel(), SUB_SAMPLE//2)
    if num_neg < SUB_SAMPLE//2:
        num_pos = SUB_SAMPLE - num_neg
    elif num_pos < SUB_SAMPLE//2:
        num_neg = SUB_SAMPLE - num_pos

    # randomly select positive and negative examples
    perm1 = torch.randperm(positive.numel(), device=device)[:num_pos]
    perm2 = torch.randperm(negative.numel(), device=device)[:num_neg]

    pos_idxs = positive[perm1]
    neg_idxs = negative[perm2]

    sampled_idxs = torch.cat([pos_idxs, neg_idxs], dim=0)
    rand_perm = torch.randperm(SUB_SAMPLE, device=labels.device)
    sampled_idxs = sampled_idxs[rand_perm]
    return input_array[sampled_idxs], labels[sampled_idxs]

Performance analysis with pytorch profiler

Although not immediately apparent, the use of dynamic scenarios is easily seen in the Pytorch Profiler Trace viewer. We use the following function to enable Pytorch Profiler:

def profile(fn, input, labels):
    
    def export_trace(p):
        p.export_chrome_trace(f"{fn.__name__}.json")
        
    with torch.profiler.profile(
            activities=[torch.profiler.ProfilerActivity.CPU,
                        torch.profiler.ProfilerActivity.CUDA],
            with_stack=True,
            schedule=torch.profiler.schedule(wait=0, warmup=10, active=5),
            on_trace_ready=export_trace
    ) as prof:
        for _ in range(20):
            fn(input, labels)
            torch.cuda.synchronize()  # explicit sync for trace readability
            prof.step()

# create random input
input_samples = torch.randn((INPUT_SAMPLES, FEATURE_DIM), device='cuda')
labels = torch.randint(0, 2, (INPUT_SAMPLES,), 
                       device='cuda', dtype=torch.int64)

# run with profiler
profile(sample_data, input_samples, labels)

The image below was captured with a value of 10 million input samples. It clearly shows the presence of synchronization events from the Torch.Nonzero call, and the corresponding drops in GPU usage:

Profiler Trace of Sampler (by author)

The use of Torch.Nonzezero in our implementation is not good, but can it be avoided?

GPU friendly data sampler

We suggest using another implementation of our sample function that replaces the Tolt.Nonzero function with a combination of static tob.count_nonzero, Torch.topk, and other APIs:

def opt_sample_data(input, labels):
    pos_mask = labels == 1
    neg_mask = labels == 0
    num_pos_idxs = torch.count_nonzero(pos_mask, dim=-1)
    num_neg_idxs = torch.count_nonzero(neg_mask, dim=-1)
    half_samples = labels.new_full((), SUB_SAMPLE // 2)
    num_pos = torch.minimum(num_pos_idxs, half_samples)
    num_neg = torch.minimum(num_neg_idxs, half_samples)
    num_pos = torch.where(
        num_neg < SUB_SAMPLE // 2,
        SUB_SAMPLE - num_neg,
        num_pos
    )
    num_neg = SUB_SAMPLE - num_pos

    # create random ordering on pos and neg entries
    rand = torch.rand_like(labels, dtype=torch.float32)
    pos_rand = torch.where(pos_mask, rand, -1)
    neg_rand = torch.where(neg_mask, rand, -1)

    # select top pos entries and invalidate others
    # since CPU doesn't know num_pos, we assume maximum to avoid sync
    top_pos_rand, top_pos_idx = torch.topk(pos_rand, k=SUB_SAMPLE)
    arange = torch.arange(SUB_SAMPLE, device=labels.device)
    if num_pos.numel() > 1:
        # unsqueeze to support batched input
        arange = arange.unsqueeze(0)
        num_pos = num_pos.unsqueeze(-1)
        num_neg = num_neg.unsqueeze(-1)
    top_pos_rand = torch.where(arange >= num_pos, -1, top_pos_rand)

    # repeat for neg entries
    top_neg_rand, top_neg_idx = torch.topk(neg_rand, k=SUB_SAMPLE)
    top_neg_rand = torch.where(arange >= num_neg, -1, top_neg_rand)

    # combine and mix together positive and negative idxs
    cat_rand = torch.cat([top_pos_rand, top_neg_rand], dim=-1)
    cat_idx = torch.cat([top_pos_idx, top_neg_idx], dim=-1)
    topk_rand_idx = torch.topk(cat_rand, k=SUB_SAMPLE)[1]
    sampled_idxs = torch.gather(cat_idx, dim=-1, index=topk_rand_idx)
    sampled_input = torch.gather(input, dim=-2, 
                                 index=sampled_idxs.unsqueeze(-1))
    sampled_labels = torch.gather(labels, dim=-1, index=sampled_idxs)
    return sampled_input, sampled_labels

Obviously, this operation requires more memory and more performance than our first implementation. The question is: Do the performance benefits of static, free synchronization outweigh the additional costs in memory and compute?

To evaluate the trade-offs between these factors, we present the following benchmarks:

def benchmark(fn, input, labels):
    # warm-up
    for _ in range(20):
        _ = fn(input, labels)

    iters = 100
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    torch.cuda.synchronize()
    start.record()
    for _ in range(iters):
        _ = fn(input, labels)
    end.record()
    torch.cuda.synchronize()
    avg_time = start.elapsed_time(end) / iters
    
    print(f"{fn.__name__} average step time: {(avg_time):.4f} ms")

benchmark(sample_data, input_samples, labels)
benchmark(opt_sample_data, input_samples, labels)

The following table compares the runtime of each implementation of the various types of sample input:

Comparison step performance – Lower is better (by author)

For most of the installation sample size, more of the Host-sync sync event is comparable or lower than the additional use of tuli implementation. Disappointingly, we only see significant benefit from the Sync-free alternative when the input sample size reaches ten million. Standard sample sizes for AI / ML settings. But it is not our tendency to give up easily. As noted above, the tuli implementation performs other constructions such as graph integration and input closure.

Graph integration

Contrary to the original work – which fails to integrate – our implementation style is fully compatible with Torch.Chile

benchmark(torch.compile(opt_sample_data), input_samples, labels)

The following table includes our combined work hours:

The results are very good – providing 70-75 percent increase in the initial implementation of the original Sample from 1-10,000. But we still have a lot of potential.

Maximizing performance with integrated input

Because the first implementation consists of functions that are selected differently, it cannot handle the integrated input directly. For batch processing, we have no choice but to use it for each input, in a Python Loop:

BATCH_SIZE = 32

def batched_sample_data(inputs, labels):
    sampled_inputs = []
    sampled_labels = []
    for i in range(inputs.size(0)):
        inp, lab = sample_data(inputs[i], labels[i])
        sampled_inputs.append(inp)
        sampled_labels.append(lab)
    return torch.stack(sampled_inputs), torch.stack(sampled_labels)

In contrast, our optimized function supports the integrated input as is – no changes are required.

input_batch = torch.randn((BATCH_SIZE, INPUT_SAMPLES, FEATURE_DIM),
                          device='cuda')
labels = torch.randint(0, 2, (BATCH_SIZE, INPUT_SAMPLES),
                       device='cuda', dtype=torch.int64)

benchmark(batched_sample_data, input_batch, labels)
benchmark(opt_sample_data, input_batch, labels)
benchmark(torch.compile(opt_sample_data), input_batch, labels)

The table below compares the step times of our sampling operations with a batch size of 32:

Combined input step performance – lower is better (by author)

Now the results are explained: Using the basic implementation of Data Sampling, we are able to increase the performance by 2x-52x (!!) the option is chosen differently, according to the sample size of the input.

Note that although our tests are run on a GPU device, model integration and input processing also work in a CPU environment. Therefore, avoiding dynamic shapes can have effects on the performance of the AI/ML model on the CPU, too.

To put it briefly

The optimization process we demonstrated in this top post for a specific sample data case:

Discovery through performance profiling: Using Pytorch Profer
Alternative uses: Our systematic acquisition allowed us to develop different operations that achieved the same goal while avoiding the use of different custom molds. However, this step came at the cost of additional compute and memory overhead. As seen in our initial benchmarks, the sync alternative shows worse performance for standard input sizes.
To open up other job opportunities: The real success came because the symbolic implementation was combined with supported observations. This functionality provided a performance benefit that reduced the initial overhead, resulting in a 2x to 52x SpeedUp with actual implementation.

Naturally, not all stories will end as happily as ours. In many cases, we can find Pytorch code that performs well on the GPU but we don't have another implementation, or there may be one that requires too much resources. However, you are empowered to make meaningful gains in performance and cost reductions, the process of identifying operational inefficiencies and testing other AI / ML component improvements.

Source link

nimda December 3, 2025

0 14 8 minutes read