I Built a C++ Backend So My GPU Would Stop Eating Air

0 2 25 minutes read

I Built a C++ Backend So My GPU Would Stop Eating Air

This is a humorous-but-real tour of the — covering VRAM-aware bin packing, pinned-memory transfers, and how to make your LLM up to 5.89× faster by being mildly rude to PyTorch.

Repo: github.com/AnubhabBanerjee/WarpGroup-backend

TL;DR: Standard LLM batching pads short sequences with zeros so they match the longest one. Your GPU then dutifully performs billions of multiplications on those zeros, which is the computational equivalent of paying a chef to cook an empty plate. WarpGroup-Backend replaces this with a small C++ engine that crams variable-length sequences together like a very anxious Tetris champion. Result: 2.08× throughput on an H100, 5.89× on a GTX 1080, and zero OOM crashes. The article tells the whole story with code, jokes, and only a moderate amount of yelling at NVIDIA.

(Quick confession before we start: I came at this from a 5G/6G RAN engineering background. As it turns out, GPU bin packing is shockingly close to what the MAC scheduler in your phone has been doing for two decades. There’s a whole section on that below — section 7 — but it’s also why I’m writing this in the first place.)

1. A confession: most of your GPU’s “work” is fake

If you’ve ever batched variable-length text through a transformer, here is what really happens, dramatized:

You: “Please summarize these 8 documents, GPU.”

PyTorch: “Absolutely. Let me just make them all the same shape first.”

You: “Wait, they’re 80, 90, 110, 130, 95, 1850, 2000, and 60 tokens. Don’t — “

PyTorch: “Padded everything to 2000. Have fun.” 🫡

GPU: cheerfully burns ~half its compute and memory bandwidth on padded zeros

Your AWS bill: *develops a sense of humor*

That’s the joke. That’s the whole industry’s dirty secret. Variable-length data + rectangular tensors = your GPU getting paid by the hour to do pretend work.

WarpGroup-Backend is what happens when you decide enough is enough and you’d rather write 30% C++ than keep paying for that nonsense.

2. Why does padding exist at all? (a one-minute crash course)

Skip this if you already know. For everyone else:

A GPU loves rectangles. Specifically, it loves matrices where every row is the same length, because that lets it run thousands of identical math operations in parallel. This is what makes a GPU a GPU instead of an expensive paperweight.

But text is not a rectangle. Text is a ragged mess. One tweet is 12 tokens. One legal contract is 4,000. If you want to feed 8 of them at once through a transformer, you have two options:

Pad them all to the longest one with a token. The math runs on rectangles, your code is simple, and your GPU now spends most of its time multiplying zero by another zero. (HuggingFace’s default.)
Concatenate them into one long 1-D ribbon and tell the attention kernel “hey, please don’t let token 12 talk to token 4001, they’re from different documents.” This is variable-length attention, and it’s what FlashAttention-2’s flash_attn_varlen_func exists to do.

Option 2 is obviously better, and a growing number of production inference stacks (vLLM, TensorRT-LLM, SGLang, FlashInfer, TGI internals) already use variants of it.

But calling flash_attn_varlen_func is the easy part. The hard part — the part everyone keeps re-implementing in slightly different shapes — is organizing the ribbon: deciding which documents go in which batch, in what order, up to what total length, while keeping the GPU saturated and the host-side overhead invisible.

WarpGroup-Backend is the annoying-organizer half of that equation, written in C++ so it can be fast about it.

3. The “just pack them” lightbulb (and why it’s harder than it sounds)

The pitch is simple: take your queue of variable-length sequences and pack them into bins like a really intense moving-day grandma. Each bin holds up to N tokens total. You stuff as many small sequences as you can fit alongside each big one, then ship the whole bin to the GPU as one flat ribbon.

This is a classic bin-packing problem, and the textbook solution is First-Fit Decreasing (FFD):

Sort sequences from longest to shortest.
For each sequence, walk through the open bins. Drop it in the first one that has room.
If none have room, open a new bin.

This is the same algorithm your brain uses when packing a suitcase: big rocks first, then pebbles fill the cracks. It’s not optimal, but it gets within ~22% of optimal in the worst case while being O(n log n) instead of taking until the heat death of the universe.

Now here are the three things that make this not a 10-line Python script:

Problem A: How big should the bin be?

Easy answer: “as big as VRAM lets me.” Wrong answer: that’s not a number you can look up. The actual usable VRAM depends on:

Model size (a 7B in bf16 eats ~14 GB before you’ve done anything).
The activation/KV-cache memory for this specific sequence length (which scales weirdly).
PyTorch’s allocator fragmentation (an art, not a science).
Whether you sneezed on the driver.

You can’t compute this. You have to measure it. Phase 0 of WarpGroup is literally “the GPU’s job interview” — keep asking it to swallow larger and larger sequences until it OOMs, then back off 10% for safety.

Problem B: GPUs are picky eaters (the 16-token thing)

NVIDIA Tensor Cores execute their matrix multiply-accumulate (MMA) instructions on fixed tile shapes — for example m16n8k16 on Hopper, with the exact dimensions varying by datatype and architecture. The practical heuristic that falls out of all that complexity is delightfully simple: GPU kernels (including FlashAttention-2) tend to hit best efficiency when sequence-related dimensions are multiples of 16 (sometimes 8, depending on dtype and arch). So we round every sequence length up: 137 → 144, 200 → 208. It’s not that the GPU literally can’t process the last 9 tokens — it’s that letting them dangle ragged costs you on memory coalescing, warp tiling, and the GEMM shapes the downstream kernels actually want to see.

This is the silliest part of the whole exercise and yet ignoring it leaves measurable throughput on the floor.

Problem C: Python is too slow to do this in the hot loop

I love Python. I have tattooed import this on my soul. But Python has the GIL, the global interpreter lock, which is basically a bouncer that only lets one thread into the Python nightclub at a time. While Python is busy tokenizing PDFs, it cannot also be busy packing bins. While it’s packing bins, it cannot be busy feeding the GPU. The producer-consumer pipeline collapses into a slow, sad single-file line.

Solution: do the packing in C++, in a background thread, and release the GIL at the PyBind11 boundary so the Python side can keep tokenizing while C++ keeps packing. Now you have actual parallelism. The bouncer is letting two friends in at once. Enjoy!

4. The five-phase pipeline (the actually-cool part)

Here’s the high-level architecture, with jokes:

Phase 0: Empirically measure how much VRAM the model has left. (Python)
Phase 1: Tokenize text and yeet integers across the C++ border. (Python → C++)
Phase 2: Catch yeets in a thread-safe queue with no GIL drama. (C++ async dispatcher)
Phase 3: Sort, align, and Tetris them into 1-D bins. (C++ bin packer)
Phase 4: Hand bins to the GPU via pinned-memory async DMA. (C++ memory pool)
Phase 5: FlashAttention-2 enjoys its zero-padding lunch. (PyTorch)

Let’s walk through each one with the actual code. I’ll keep snippets short; the full files are linked.

Phase 0 — The GPU job interview

We need to know exactly how many tokens fit in VRAM. So we ask the GPU, politely but firmly, until it cries.

This lives in streaming_dataloader.py:

def determine_vram_capacity(model, device, start_tokens=0, step_size=5000, vocab_size=32000):
    """
    Phase 0: this method detects the device (GPU) on which it is currently hosted, loads the model into the VRAM, see how much space is left
    and how many tokens can be fit into the remaining space of the VRAM. It then increments the number of tokens by the step size and 
    repeats the process until the max capacity is reached. It returns the maximum number of tokens that can be fit into the VRAM while 
    the model is loaded.
    """
    # this part loads the model into the VRAM and checks if it can be loaded. if it can, it's okay. if it cannot be done,
    # it raises an error.
    try:
        model.to(device)
    except Exception:
        sys.exit("error loading model into VRAM")
    print("nStarting Phase 0: model loaded successfully in VRAMnn")

    # this part checks how much space is left, after the model is loaded, in terms of number of tokens
    # Probe remaining capacity by running synthetic forwards until CUDA OOM; then scale down by 0.9 for fragmentation.
    model.eval()  # inference-only (dropout off, batchnorm stats fixed), matching real packing inference

    max_tokens = start_tokens  # current sequence length to try; grows by step_size after each success
    safe_limit = 0  # last max_tokens that completed a forward without OOM (hardware redline before the failed step)

    with torch.no_grad():  # no autograd state = roughly half the activation memory vs training mode
        while True:
            try:
                # Synthetic token IDs: distribution does not matter for memory; size drives activations/attention workspace
                dummy_tokens = torch.randint(0, vocab_size, (max_tokens,), device=device)

                # Single packed sequence [0, max_tokens): worst-case one long sequence for max seqlen in the pass
                dummy_cu_seqlens = torch.tensor([0, max_tokens], dtype=torch.int32, device=device)

                # Real forward on target device: stresses the same kernels/allocations as production (unlike padding-only tests)
                # Signature must match your model (e.g. FlashAttention varlen). Adjust kwargs if your forward differs.
                _ = model(dummy_tokens, cu_seqlens=dummy_cu_seqlens, max_seqlen=max_tokens)

                safe_limit = max_tokens  # this length fit; treat as the best known redline so far
                max_tokens += step_size  # try a longer stretch next (coarse search; reduces how many forwards you run)

                del _, dummy_tokens, dummy_cu_seqlens  # drop references so the next trial’s peak memory is not inflated by leftovers

            except RuntimeError as e:
                if "out of memory" in str(e).lower():
                    # Forward at max_tokens exceeded free VRAM: stop; safe_limit is the last successful length
                    torch.cuda.empty_cache()  # return cached blocks to the pool (no-op on CPU-only builds in many versions)
                    gc.collect()  # free any Python objects still holding device tensor views
                    break
                # Shape/keyword errors mean the dummy call does not match the model API, not “out of VRAM”
                raise

    if safe_limit == 0:
        # If the very first try at start_tokens OOMs, there is no successful redline; lower start_tokens (no automatic halving here)
        sys.exit("Error: Model cannot process the starting token count without OOM. Decrease start_tokens.")

    # 0.9x: headroom for allocator fragmentation and small spikes on real batches vs this idealized probe
    optimal_capacity = int(safe_limit * 0.9)
    print(f"nOptimal bin capacity locked at: {optimal_capacity} tokens.n")
    
    return optimal_capacity

That’s it. That’s the whole “autotune.” Throw bigger and bigger fake sequences at the model until CUDA flips a table, write down the last one that survived, multiply by 0.9 for safety. No vendor whitepaper. No theoretical formula. Just bullying the GPU until it tells you its real limit.

On an H100 with Qwen2.5–7B-Instruct, this comes out to around 76,500 tokens per bin. Try fitting that into a batch-of-8 mindset.

Phase 1 — Tokenize and yeet

Python reads documents (PDFs, JSONL, whatever), tokenizes them with the HuggingFace tokenizer, and submits each as a list of integers across the PyBind11 boundary. This is in reader_and_tokennizer.py:

for text in _iter_text_units(data_source, json_key):
        tokens = tokenizer.encode(text, add_special_tokens=True)

        if len(tokens) > effective_cap:
            if max_length is not None and effective_cap == max_length:
                tokens = tokens[:effective_cap]
            else:
                print(
                    f"Warning: Sequence length ({len(tokens)}) exceeds VRAM capacity "
                    f"({effective_cap}). Truncating sequence...",
                    file=sys.stderr,
                )
                tokens = tokens[:effective_cap]

        if cxx_backend is not None:
            cxx_backend.submit_sequence(tokens)

Notice how boring this is. Good. Boring Python = fast C++. The Python side does I/O and tokenization and that’s it. No batching logic, no packing logic, no GPU choreography. Just one single responsibility, baby.

Phase 2 — The C++ catcher’s mitt

On the C++ side, async_dispatcher.cpp catches every submit_sequence call and drops it into a thread-safe std::deque behind a mutex:

void AsyncDispatcher::submit_sequence(std::vector tokens) {
    {
        std::lock_guard<:mutex> lk(queue_mutex);
        if (!engine_started.load()) {
            // Quiet no-op before initialize_engine() — avoids crashing notebooks
            // that accidentally submit early; uncomment throw if you prefer fail-fast.
            return;
        }
        pending_queue.push_back(std::move(tokens));
    }
    cv_data.notify_one();
}

Meanwhile, a background worker thread sits in a wait() loop. When tokens show up, it wakes, grabs a batch, and runs the packer. The clever bit: it doesn’t wake up instantly, because that would mean packing one sequence at a time, which defeats the entire point of bin-packing.

Instead, it waits for either 16 sequences to accumulate or 5 ms to pass:

            // Accumulation window: if there are few pending sequences and the
            // producer is still feeding (not input_done, not shutting down),
            // wait a brief moment for more before swapping. This lets the
            // FFD packer see a real batch instead of one sequence at a time.
            // Wakes early on: reaching kPackMinBatch, shutdown, or input_done.
            if (pending_queue.size() < kPackMinBatch &&
                !stop_flag.load() && !input_done_flag.load()) {
                cv_data.wait_for(lk, kPackWaitWindow, [&] {
                    return pending_queue.size() >= kPackMinBatch ||
                           stop_flag.load() || input_done_flag.load();
                });
            }

This is the difference between “bin packing” and “bin… eh, a single-sequence bin is technically a bin.” 5 ms is shorter than a single forward pass on this hardware, so the latency cost is invisible. The density gain is up to 8×.

The PyBind11 layer in bindings.cpp wraps everything with py::gil_scoped_release so that any C++ work that blocks or sleeps does not hold the Python GIL:

m.def(
        "get_next_bin",
        []() -> std::tuple<:tensor torch::tensor=""> {
            PackedBin bin;
            {
                py::gil_scoped_release release;
                bin = engine().get_next_ready_bin();
            }
            // Re-acquired the GIL here -- creating torch::Tensor objects and
            // returning them to Python touches CPython refcounts.
            if (bin.flat_tokens.empty()) {
                auto opts = torch::TensorOptions().dtype(torch::kInt32).device(torch::kCUDA);
                return std::make_tuple(torch::empty({0}, opts), torch::empty({0}, opts));
            }
            return engine().get_memory_pool().create_zero_copy_tensors(bin);
        },
        "Block until a packed bin exists; returns (token_ids, cu_seqlens) on CUDA.");

If you have ever debugged a Python-C++ deadlock, the comments in this file will give you flashbacks. It’s definitely worth a read, particularly when you drank one too many coffees and having troubles falling asleep!

Phase 3 — Tetris with rules (FFD + 16-token alignment)

This is the heart of the whole project, and it’s beautifully short. From bin_packer.cpp:

int BinPacker::align_to_tensor_core(int raw_length) {
    /*
     * NVIDIA Tensor Cores execute matrix math in 16x16 or 32x32 tiles.
     * If a sequence length is not perfectly divisible by 16, the Tensor Core 
     * cannot process the ragged edge, causing the GPU to stall.
     * We calculate the remainder and round up to the nearest multiple of 16.
     */
    int remainder = raw_length % 16;
    if (remainder == 0) {
        return raw_length;
    }
    return raw_length + (16 - remainder);
}

Three lines. That’s the “16-token Tensor Core alignment” that NVIDIA blog posts make sound like a PhD thesis. It’s roundup(n, 16). That’s it. That’s the thing.

The packing itself:

std::vector BinPacker::pack_queue(std::deque<:vector>>& pending_queue) {
    std::vector ready_bins;
    
    if (pending_queue.empty()) {
        return ready_bins;
    }

    // Step 1: Drain the queue into a standard vector so we can sort it.
    // We use std::move to transfer memory ownership instantly without copying data.
    std::vector<:vector>> sequences;
    while (!pending_queue.empty()) {
        sequences.push_back(std::move(pending_queue.front()));
        pending_queue.pop_front();
    }

    // Step 2: Sort Decreasing (Longest sequences first)
    // Packing large rocks first, then filling gaps with pebbles yields the highest density.
    std::sort(sequences.begin(), sequences.end(), 
              [](const std::vector& a, const std::vector& b) {
                  return a.size() > b.size(); 
              });

    // Step 3: First-Fit Packing
    for (auto& seq : sequences) {
        int raw_len = seq.size();
        int aligned_len = align_to_tensor_core(raw_len);
        
        // Edge case safety: If alignment pushes it barely over the VRAM limit, clamp it.
        if (aligned_len > max_vram_capacity) {
            aligned_len = max_vram_capacity;
            seq.resize(aligned_len, 0); 
        } else if (aligned_len > raw_len) {
            // Physically inject invisible '0' pad tokens to reach the 16-boundary
            seq.insert(seq.end(), aligned_len - raw_len, 0);
        }

        bool placed = false;

        // Try to fit the sequence into an existing open bin (First-Fit)
        for (auto& bin : ready_bins) {
            if (bin.current_token_count + aligned_len <= max_vram_capacity) {
                // It fits! Record the starting boundary in cu_seqlens
                bin.cu_seqlens.push_back(bin.current_token_count);
                
                // Append the tokens to the flat 1D array
                bin.flat_tokens.insert(bin.flat_tokens.end(), seq.begin(), seq.end());
                bin.current_token_count += aligned_len;
                
                placed = true;
                break;
            }
        }

        // If it didn't fit in ANY existing bin, we must allocate a new bin
        if (!placed) {
            PackedBin new_bin;
            new_bin.current_token_count = aligned_len;
            
            // The first sequence in a new bin always starts at index 0
            new_bin.cu_seqlens = {0}; 
            
            // Move the tokens into the flat array
            new_bin.flat_tokens = std::move(seq);
            
            ready_bins.push_back(std::move(new_bin));
        }
    }

Read that twice. Internalize it. This is the algorithm. Everything else in this repo — the C++ threads, the pinned memory, the GIL gymnastics — exists to feed this 30-line loop and to deliver its output to the GPU in a single async DMA with no extra host-side copies.

cu_seqlens is the magic decoder ring. It’s a small integer array of cumulative offsets. If your bin contains three documents of length 200, 144, and 64 (already aligned), then cu_seqlens = [0, 200, 344, 408]. FlashAttention-2 reads this array and goes, “ah, I see, three sub-sequences,” and runs an exact, padding-free attention over them. No giant dense mask tensor materialized in memory. No cross-document contamination. No wasted FLOPs on padded regions.

Phase 4 — The “zero-copy” magic trick (one DMA, no extra copies)

OK, so we have a packed bin in a std::vector on the CPU. We need it on the GPU. The naive approach: memcpy into a PyTorch tensor, then .to('cuda'). This works but has two costs paid on every single bin: (1) an extra host-side copy (your vector → some intermediate tensor → GPU), and (2) on x86_64 the OS can yank your memory pages around at any time, which forces CUDA to stage transfers through a bounce buffer instead of letting the DMA engine touch your bytes directly.

The trick is pinned (page-locked) host memory, allocated via cudaHostAlloc. Pinned memory is RAM that the OS has been forbidden from swapping or moving. Because the address is stable, the GPU’s DMA engine can pull the bytes across the PCIe bus in a single asynchronous transfer, without needing the CPU to stage an intermediate copy first. The transfer itself still happens — this is not literally “zero-copy” in the strict UVA/UM sense, the host→device DMA is real — but it’s one copy instead of two, it runs async to the CPU, and it lands directly in device memory. Everyone in the inference world calls this “zero-copy” anyway, and the function in the repo is named accordingly. Pedants, please direct complaints to the comment section; we’ll address them after lunch.

From memory_pool.cpp:

// ---------------------------------------------------------
// 1. ALLOCATE PINNED MEMORY (Happens once during Phase 0)
// ---------------------------------------------------------
MemoryPool::MemoryPool(size_t max_vram_tokens) : max_capacity(max_vram_tokens) {
    
    // Allocate the token buffer. cudaHostAlloc locks this memory into physical RAM.
    cudaError_t err1 = cudaHostAlloc((void**)&pinned_token_buffer, 
                                     max_capacity * sizeof(int), 
                                     cudaHostAllocDefault);
                                     
    // Allocate the sequence length boundaries buffer.
    // +1 because cu_seqlens always has one more element than the number of sequences.
    cudaError_t err2 = cudaHostAlloc((void**)&pinned_seqlens_buffer, 
                                     (max_capacity + 1) * sizeof(int), 
                                     cudaHostAllocDefault);

    if (err1 != cudaSuccess || err2 != cudaSuccess) {
        throw std::runtime_error("[MemoryPool] Fatal: Failed to allocate pinned memory. "
                                 "Host system may be out of RAM.");
    }
    
    std::cout << "[MemoryPool] Successfully locked " 
              << (max_capacity * sizeof(int)) / 1024 
              << " KB of DMA-ready pinned memory." << std::endl;
}

Then, when we want to ship a bin to the GPU, we copy our packed bin into the pinned buffer (one fast std::copy) and wrap that pinned buffer as a PyTorch tensor without allocating new tensor storage:

std::tuple<:tensor torch::tensor=""> MemoryPool::create_zero_copy_tensors(const PackedBin& bin) {
    
    // Step 1: Fast C++ copy from our packing algorithm into the pinned memory block.
    // std::copy is highly optimized by the compiler at the assembly level.
    std::copy(bin.flat_tokens.begin(), bin.flat_tokens.end(), pinned_token_buffer);
    std::copy(bin.cu_seqlens.begin(), bin.cu_seqlens.end(), pinned_seqlens_buffer);

    // Step 2: The PyTorch Metadata Shell
    // torch::from_blob does NOT allocate new memory. It simply wraps our existing 
    // pinned_token_buffer pointer in a PyTorch Tensor object so Python can interact with it.
    
    auto token_opts = torch::TensorOptions().dtype(torch::kInt32).device(torch::kCPU);
    
    torch::Tensor token_tensor = torch::from_blob(
        pinned_token_buffer,                   // The raw pinned pointer
        {static_cast(bin.current_token_count)}, // The exact size of this specific batch
        token_opts
    );

    torch::Tensor seqlens_tensor = torch::from_blob(
        pinned_seqlens_buffer, 
        {static_cast(bin.cu_seqlens.size())}, 
        token_opts
    );

    // Step 3: Trigger the PCIe DMA transfer.
    // Because the underlying memory is pinned, the .to(cuda) call triggers an 
    // asynchronous DMA transfer. The CPU immediately moves on to packing the next bin 
    // while the GPU hardware silently pulls the data over the bus.
    
    torch::Tensor gpu_tokens = token_tensor.to(torch::kCUDA, /*non_blocking=*/true);
    torch::Tensor gpu_seqlens = seqlens_tensor.to(torch::kCUDA, /*non_blocking=*/true);

    return std::make_tuple(gpu_tokens, gpu_seqlens);
}

torch::from_blob is one of the most beautifully evil functions in PyTorch. It says, “I’m not going to allocate any new tensor storage. I’m going to wrap this raw pointer as a tensor view, and you are responsible for keeping the underlying memory alive.” It’s the C++ equivalent of taking a sticky note that says “TENSOR” and slapping it on an existing pile of memory. PyTorch believes it. Everyone goes home happy. (Yes, a small tensor metadata struct still gets allocated. The storage doesn’t, which is the part that costs you per-bin. HPC pedants, please re-holster the pitchforks; we are nearly through the section.)

The non_blocking=True on .to(torch::kCUDA) then kicks off an asynchronous PCIe DMA transfer and immediately returns. The CPU goes off to pack the next bin while the GPU silently slurps the previous bin across the bus. The packer and the GPU are now running in parallel. This is the part where you start hearing the GPU fan spin up like it finally has something real to do.

Phase 5 — FlashAttention-2 enjoys its zero-padding lunch

The Python side, in main_working_file.py, is now hilariously short:

# Phase 0, Step C: Initialize the C++ Background Engine
    # Locks in hardware limits and spawns background worker thread
    warpgroup_backend.initialize_engine(vram_capacity)

    # Phase 1: Ingest and Tokenize
    # Streams tokens into the C++ background queue
    ingest_and_tokenize(
        file_path, 
        tokenizer, 
        vram_capacity, 
        cxx_backend=warpgroup_backend
    )

    # Phase 4 & 5: Inference Execution
    print("nStarting Inference Phase...")
    
    try:
        # Phase 4: Orchestration & Queue Management
        while not warpgroup_backend.is_queue_empty():
            # Retrieve the hardware-optimized bin tensors
            # This triggers the zero-copy DMA handoff from pinned memory
            bin_tensors = warpgroup_backend.get_next_bin()
            
            # Phase 5: GPU Execution (FlashAttention-2)
            with torch.no_grad():
                # The wrapper translates the 1D bin to 2D for the HF model
                output = model(bin_tensors[0], cu_seqlens=bin_tensors[1])
                
            print(f"Executed FlashAttention-2 for bin size: {bin_tensors[0].shape[0]} tokens.")

Small but important disclaimer for the careful reader: a stock HuggingFace forward() does not accept cu_seqlens directly. The model here is a thin VarlenModelWrapper from streaming_dataloader.py that unsqueezes the 1-D packed stream into the (1, N) shape HF expects and — crucially — rebuilds position_ids so every sub-sequence starts at position 0 (otherwise document B’s first token sees position len(A) and rotary embeddings go sideways). For bitwise-correct cross-document masking on a multi-sequence bin you also want the underlying HF model loaded with attn_implementation='flash_attention_2' and the attention layers wired to call flash_attn_varlen_func with the same cu_seqlens — the wrapper’s own docstring spells out exactly why and how. FA-2 is what actually performs the variable-length attention; WarpGroup’s contribution is feeding it densely and on-time.

A few lines of business logic. That’s all that’s left. Everything else is happening in the C++ engine while Python sips coffee.

5. The receipts (i.e., the numbers)

Time to humiliate the baseline. All numbers from the repo’s README.

Quick note on benchmarking methodology, before anyone reaches for the rocks: every comparison below runs the same model checkpoint, the same tokenizer, the same input corpus, and the same dtype (bf16) on the same GPU at default clocks. The “baseline (HuggingFace)” path is HF’s stock padded-batch pipeline with attn_implementation="flash_attention_2" — i.e., it’s already using FA-2, not a deliberately handicapped naïve loop. The optimized path uses the same FA-2 kernel. The only axis of difference is how sequences are batched (FFD packing into a VRAM-aware 1-D bin vs. padding to a rectangular batch_size × longest_sequence tensor). Workload type is prefill-style document evaluation, not autoregressive streaming decode — that distinction matters for the next subsection. Repro scripts are in example_runs/ if you’d like to argue with the numbers.

Stress test: H100, Qwen2.5–7B, 400 mixed-length PDFs

The dataset deliberately interleaves tiny documents (45–130 words) with massive ones (1820–2000 words) — basically the worst case for padded batching.

Metric	Baseline (HF)	WarpGroup	Improvement
Padding overhead	48.41%	0.55%	47.9 pp absolute reduction
Throughput	14,713 tok/s	30,672 tok/s	2.08× higher
Peak VRAM	19.88 GB	16.50 GB	17% lower (3.38 GB saved)
Dynamic VRAM (est.)*	~5.38 GB	~2.00 GB	~62% lower dynamic memory
Wall clock	28.69 s	13.76 s	2.08× faster

Translation: the baseline spent half its tokens padding zeros. Half. Imagine ordering a pizza and 4 of the 8 slices are just cardboard.

Production scaling: same hardware, uniform 50–1900 word docs

Metric	Baseline (HF)	WarpGroup	Improvement
Padding overhead	36.20%	0.67%	35.5 pp absolute reduction
Throughput	18,047 tok/s	30,700 tok/s	1.70× higher
Wall clock	17.86 s	10.50 s	1.70× faster

Even on a “nice,” uniformly-distributed dataset (i.e., the only kind that benchmark blog posts ever use), WarpGroup still wins by 70%, because padding overhead exists even when your distribution is well-behaved.

Entry-level hardware: GTX 1080 (8 GB), SmolLM2–360M

Metric	Baseline (HF)	WarpGroup	Improvement
Padding overhead	41.13%	0.00%	Baseline padding eliminated
Throughput	405 tok/s	2,387 tok/s	5.89× higher
Peak VRAM	2.85 GB	1.86 GB	35% lower

This one is my favorite. Why? Because the smaller your hardware, the worse padding hurts. A GTX 1080 going 5.89× faster means small startups, hobbyists, university labs, and anyone running on a single consumer card just got a free hardware upgrade. The same 1080 that was “barely enough” is now “actually pretty good.”

Bonus round: not crashing

Remove the MAX_LEN cap entirely, and the baseline does this:

torch.OutOfMemoryError: Tried to allocate 30.00 GiB

Because it tried to make a (batch_size × longest_sequence) rectangle and the longest sequence was, uh, big.

WarpGroup: completes successfully, peak VRAM 3.60 GB. Because the Phase-0 autotune locks a strict hardware-aligned input budget, and the packer refuses to admit any bin that exceeds it. Allocator fragmentation and kernel scratch workspaces can still surprise you in theory; in practice, the pathological padding-driven OOMs that fixed-shape batching trips on simply stop happening. No 3 AM Slack messages from your on-call cousin. Just sequences, packed, executed, done.

“OK, but how is this different from vLLM / paged attention / continuous batching?”

Reasonable question, and worth answering directly because the inference-infra world has a lot of overlapping primitives and an HPC reader will ask this in the first comment.

vLLM / continuous batching is optimized for decode-time serving: many concurrent requests at different generation steps, schedule the next token across them, keep the GPU saturated under streaming load. Its headline primitive is paged attention — a KV-cache memory manager that pages physically non-contiguous blocks like an OS pagetable.
TensorRT-LLM, SGLang, FlashInfer all support variants of varlen attention. Their packing logic typically lives inside a serving runtime tuned for live, latency-sensitive request streams.
WarpGroup-Backend targets the other half of the workload spectrum: offline / high-throughput, prefill-style jobs. Document evaluation, RAG indexing, batched embedding extraction, batched OAM-log summarization, bulk classification, eval harnesses. The unit of work is a finite corpus of variable-length sequences, not a streaming firehose of decode requests. The focus is host-side packing density, GIL-free async dispatch, and a tight pinned-memory handoff — not KV-cache paging.

Think of it this way: vLLM is a restaurant manager seating arriving diners across tables in real time. WarpGroup is a catering operation packing the day’s box lunches into delivery vans before the trucks leave the depot. Different problems, complementary primitives, frequently co-deployable in the same building.

6. So… how do I actually try it?

The repo ships with a one-shot reproducer:

# 1. Clone
git clone 
cd WarpGroup-backend
# 2. Python env + deps
python3.12 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -e .
# 3. Compile the C++ backend
mkdir build && cd build
cmake ..
make -j4
cp warpgroup_backend*.so ..
cd ..
# 4. Smoke test
python3 main_working_file.py

If you want the proper benchmark side-by-side (baseline HF vs. WarpGroup), the example_runs/ folder has a scripted end-to-end runner that generates a synthetic PDF corpus, runs both stacks, and writes JSON results + bar charts.

You will need:

Linux, CUDA toolkit, an NVIDIA GPU (consumer or datacenter, both work).
A PyTorch build with CUDA support (don’t ship the CPU-only one and then act surprised when nothing accelerates).
An LLM that supports flash_attention_2 (Qwen2.5, Llama-3, Mistral, SmolLM2, etc.).

7. Plot twist — this is just MAC scheduling in a CUDA costume

I should probably confess at this point: I’m not a “GPU person” by training. I came up through telecom — 5G NR, with a foot creeping firmly into 6G research — and I started looking at LLM inference infrastructure because every problem in this codebase felt unsettlingly familiar.

Look at this side-by-side and tell me with a straight face these are different problems:

5G NR MAC scheduler (at the gNB)WarpGroup-Backend (at the GPU)Variable-size MAC SDUs from each UE / logical channelVariable-size token sequences from each documentPack into a Transport Block every TTIPack into a VRAM bin every dispatch cycleTB size bounded by available PRBs × MCS bitsBin size bounded by empirical VRAM budgetMust align to LDPC code block segmentation (TS 38.212 §5.2.2)Must align to 16-token Tensor Core tilesLogical Channel Prioritization (LCP) picks what goes inFirst-Fit Decreasing picks what goes inHard deadline: one slot (0.5 ms at numerology μ=1)Soft deadline: keep the GPU fedSkip a TB → PDSCH throughput cratersSkip a bin → GPU sits idle, throughput craters

If you have ever read 3GPP TS 38.321, you are staring at the same algorithm. The MAC scheduler at the base station has been packing variable-size SDUs into Transport Blocks — sized to fit a fixed PRB grid, aligned to LDPC code-block thresholds, prioritized across logical channels — since LTE-Advanced. The only things that change in WarpGroup are the units (tokens, not bits), the budget (VRAM, not PRBs), and the alignment quantum (16-token tiles, not LDPC code-block sizes).

Even Phase 0 has a telecom doppelgänger. The repo probes the GPU with synthetic sequences until it OOMs, then backs off 10%. The RAN does the equivalent every TTI: it watches CQI / SINR reports, picks an MCS, watches BLER, then backs off. Both are saying the same thing — the spec gives you a theoretical maximum, but the only honest number is the one you measure under live conditions.

A quick aside to two very different audiences

To my HPC and CUDA-first friends reading this: I know. You’ve been doing exactly this since the first GPGPU papers landed in 2003. Bin packing is a freshman algorithms class, cudaHostAlloc is in every CUDA tutorial, and pinned memory is — for you — basically a personality trait. None of this is news. Please put the pitchforks down.

But it is news for telecom engineers, and that’s half the reason this article exists. For twenty years our world was FPGAs, ASICs, and PRBs. We optimized spectrum, not silicon. Ask the average RAN engineer to explain Tensor Core tile alignment and you’ll get a polite stare. Then AI-RAN, NWDAF, NVIDIA Aerial, SoftBank AITRAS, the AI-RAN Alliance, and the 3GPP Rel-20 study items all happened inside roughly the same eighteen months, and the next decade of telecom careers now demands being bilingual between spectrum-world and GPU-world — with a lot of us starting from approximately zero on the GPU half. If the term “pinned memory” sounded like a foreign language until ten minutes ago: welcome, you’re not behind, you’re early. The intuition translates cleanly anyway. You already know how to pack variable-size payloads into a fixed-budget window under hardware-alignment constraints. You just used to call it MAC scheduling. Same animal, new zoo.

Consider this article a half-step on that road.

Why a working telecom engineer should care right now

This isn’t an abstract analogy. Four concrete reasons it lands in 2026:

6G is officially AI-native. ITU-R IMT-2030, 3GPP Rel-20 study items, the AI-RAN Alliance, O-RAN’s AI/ML working groups — they all assume LLMs and large ML models live inside the network, not bolted onto the OSS/BSS later. Beam management, RIC xApps/rApps, NWDAF analytics, intent-based configuration, agentic OAM — each of these is a candidate workload for an inference stack that does exactly what WarpGroup does.
Voice is already a token stream. Neural audio codecs (Encodec, SoundStream, Mimi) tokenize speech at 25–75 Hz. Voice-LLMs like Moshi, AudioPaLM, and Spirit-LM consume those tokens directly. A call center handling 10,000 concurrent calls is, computationally, a thundering herd of variable-length token streams arriving asynchronously from heterogeneous sources — exactly the workload this repo’s stress test simulates. Replace “400 PDFs” with “400 RTP streams” and the math is identical: same skew, same padding tax, same OOM cliff.
MEC has tiny GPUs. Multi-access Edge Compute nodes at the gNB / UPF tier don’t get to play with 8× H100 racks. They get one L4, maybe an A10, maybe a single H100 if procurement was in a good mood. Squeezing 2–6× more throughput out of a single edge GPU is the difference between “AI features available at the edge” and “AI features only at the core DC, plus 30 ms of extra round-trip.” That gap is exactly where URLLC-class applications live or die.
OAM telemetry is the boring killer app. PM counters, syslog events, CDRs, NetFlow / IPFIX records, NF traces — these are variable-length streams of structured-ish text that benefit massively from LLM-based summarization, anomaly detection, and intent translation. They also have brutal length variance: a normal call trace is a few hundred tokens; a single 5G handover-failure trace can run past 50,000. Padded batching on that distribution will make your inference cluster cry, then your operations director cry, then your CFO cry.

So when I see a codebase that does VRAM-aware FFD packing with hardware-tile alignment and a single-DMA pinned-memory handoff, I don’t see “GPU optimization.” I see the inference-side analog of a MAC scheduler. The reason I’m spending evenings on this isn’t a career pivot — it’s the same job, on different silicon, for the next generation of telecom workloads that will live half in the spectrum and half in the GPU.

Also, frankly, after a decade of reading 3GPP specs, a codebase you can git clone the entire scheduler from in 30 seconds is a vacation.

8. The moral, if you came here for one

There are three takeaways that I think generalize beyond this specific repo:

1. Default batching is a polite lie. “batch_size = 8” tells you nothing about how full your GPU is. The correct unit is tokens in VRAM, and you have to measure it because no library will tell you the truth. The day you start thinking in tokens-per-bin instead of items-per-batch is the day your throughput graph stops embarrassing you.

2. The interesting performance work is at the boundaries. The expensive parts of an LLM pipeline are not the matrix multiplies — those have been hand-optimized by NVIDIA engineers with mortgages riding on it. The expensive parts are the transitions: tokens-to-tensors, host-to-device, Python-to-C++, scheduler-to-GPU. Almost every “Wait, I made it 2× faster” story in modern ML is a boundary story.

3. Sometimes the right answer is “write the C++.” Not all of it. Not even most of it. Python can absolutely coordinate high-performance inference systems — vLLM proves it every day. But moving the hot-path packing loop into a background C++ thread sheds two specific Python-side costs that bite under load: GIL contention with the ingest thread, and interpreter + allocator overhead in a tight scheduling loop where every microsecond is part of the budget. The right Python / C++ ratio for high-throughput inference infra isn’t 100/0 or 0/100 — it’s a thin, well-defined PyBind11 boundary with the latency-critical scheduler on the C++ side and all the interesting stuff (model code, business logic, glue) on the Python side. WarpGroup is ~64% Python, ~30% C++, ~5% build glue. That ratio is not an accident.

9. Where this goes next

The repo’s roadmap hints at the obvious next move: multi-GPU sharding. Extend the dispatcher to manage multiple C++ queues and distribute dynamically sized bins across local interconnects (NVLink / PCIe). The hard part isn’t the C++ — it’s deciding how to balance bin sizes across devices when sequence-length distributions are skewed. (Pull requests welcome, etc.)

If you want to nerd out further, the parts I’d love to see explored:

Adaptive kPackWaitWindow — that 5 ms accumulation window is a hand-tuned constant. It probably wants to be a function of observed producer rate.
Speculative bin reservation — pre-pin a second bin’s worth of host memory so we can pack the next batch while the current one is still on the wire.
Continuous batching for generation — right now this is a prefill/eval pipeline. Hooking it up to streaming decode for a chat server would be the natural extension.

10. Wrap

Padding is the silent tax on every LLM workload that touches variable-length text. WarpGroup-Backend’s contribution isn’t a new algorithm — FFD bin packing has existed since the 1970s — it’s the engineering integration: empirical VRAM autotuning, GIL-free async dispatch, 16-token alignment for the downstream kernels, and a single-DMA pinned-memory handoff into FlashAttention-2’s varlen kernel, all glued together so that a single python3 main_working_file.py produces 2× to 6× throughput on real hardware.

If you build LLM inference infrastructure for a living, clone the repo, read the C++ files (they have generous comments and the occasional dry joke), and consider how many of your own pipelines are currently paying the padding tax.

If you build telecom systems for a living and you suspect the next decade of your job is going to involve a lot more inference servers than you originally signed up for — same advice. The MAC scheduler in your gNB and the bin packer in this repo are reading from the same playbook.

If you’re a beginner who just wanted to understand why GPUs hate variable-length text — congratulations, you now know more than 80% of people building this stuff for a living. Go forth and stop padding things.

About the repo

If you enjoyed this, the kindest things you can do are: ⭐ the repo, share this post, and tell one PyTorch user in your life that their batch_size is lying to them.

Now go yell at your GPU. Lovingly.

Source link

nimda 3 hours ago

0 2 25 minutes read