Meta and Stanford Researchers Propose Fast Byte Latent Transformer That Reduces Inference Memory Bandwidth by Over 50% Without Tokenization

0 0 6 minutes read

Meta and Stanford Researchers Propose Fast Byte Latent Transformer That Reduces Inference Memory Bandwidth by Over 50% Without Tokenization

A team of researchers from Meta, Stanford University, and the University of Washington have introduced three new methods that greatly accelerate production in the Byte Latent Transformer (BLT) – a model of language structures that work directly on raw bytes instead of tokens.

Byte-Level Models Are Slow to Understand

To understand what this new research solves, you need to understand the tradeoff between byte-level language modeling.

Many language models are in use today tokens — text fragments generated by subword tokens as byte-pair encoding (BPE). A token usually represents a few characters or a whole word. While this works well, tokenization comes with some well-known disadvantages: sensitivity to input noise, poor handling of multilingual text, poor character-level recognition, and fragility of structured input such as code and numbers.

Byte-level models bypass all of this by working directly on raw bytes — the lowest-level representation of text. The Byte Latent Transformer (BLT) was a major step forward: it simulated the performance of token-based models at scale by dynamically concatenating bytes into variable lengths. leaflets using an entropy-based classification strategy. Regions with high-entropy (hard to predict) get shorter patches; predictable spaces get longer. Countdown stack ends hidden token representationsnot raw bytes — three components are used: a local encoder, a large global Transformer, and a local decoder — with an average patch size of 4 bytes and a maximum of 8.

The remaining problem is the speed of thought. Even with the hierarchical BLT design, the local decoder still produces one byte at a time automatically. Since a typical subtext token corresponds to a few bytes, BLT requires many pre-decoder passes to produce the same amount of text that the token-level model produces in one step. In modern LLM operations, the bottleneck is often not inclusive memory bandwidth — iteratively loads model weights and key-value stores into memory. More decoder passes means more memory overhead, which directly translates to slower production.

Three Ways, One Goal: Few Forward Passes

The research team presents three strategies that mitigate this problem, each trading speed versus generation quality differently.

BLT Diffusion (BLT-D)

It's a core offering and a quick differentiation. The main idea is to replace autoregressive byte-by-decoding with block-wise discrete diffusion on the local recorder.

During training, the decoder receives two inputs: a pure byte sequence (original text) and a you are spoiled a sequence of byte blocks of fixed length. For each block, a propagation time step of t is taken from U(0,1), and each byte in the block is a sample. independently instead of a [MASK] possible symbol t. This means that the level of masking varies according to the training example – a lower ut leaves more bytes visible; it higher covers most of them. The B block size (set to 4, 8, or 16 bytes in testing) typically extends beyond the BLT patch size limit of 4 bytes, teaching the decoder to predict bytes further into the future than usual. The total training loss includes the standard loss of predicting the next byte in a clean sequence and the mask-byte prediction loss of corrupted blocks – conceptually similar to how the hidden language model in BERT works, but implemented at the byte level within the hierarchical BLT architecture.

Theoretically, BLT-D implements a block of [MASK] positions and iteratively reveals multiple byte positions per decoder step using one of two strategies: confidence-based transparency (reveals positions where the predicted probability exceeds a threshold α) or entropy-bounded (EB) sampling (selects the largest subset of positions where the cumulative entropy remains below a threshold γ). Both techniques produce more bytes per forward pass than one. The global encoder and model – the expensive parts of BLT – are used once per block rather than once per patch, further reducing the number of model calls. BLT-D also supports KV caching, benefiting from any techniques that reduce the memory KV-cache footprint.

For 3B parameters, BLT-D-4 (block size 4) almost matches the BLT performance scores while requiring less than half the memory bandwidth. BLT-D-16 (block size 16) achieves an 87-92% reduction in estimated memory-bandwidth cost compared to BLT, making it the fastest configuration tested – even though it has a low pass@1 score in coding benchmarks (HumanEval, MBPP).

BLT Self-Speculation (BLT-S)

It takes a different route, drawing on it speculative speculation — a method where a cheap draft model raises tokens and a big model verifies them in parallel. What makes the BLT-S unique is that it does not require a separate draft model and no structural changes or additional training. It also modifies an existing lightweight BLT local decoder as a decoder.

In a standard BLT approximation, the decoder stops outputting whenever an entropy-based solver determines that a new patch boundary has been reached – typically every four bytes. BLT-S instead allows the recorder to automatically generate up to a fixed window size k (8 or 16 bytes in testing) regardless of entropy spikes, setting the state to the last available hidden token. After generating a draft of k bytes, the full model re-encrypts the candidate sequence with the encoder, global model, and decoder and produces predictions for the next byte. Written bytes are accepted up to the first mismatch; the first byte that cannot be matched is replaced by the verified prediction.

Under greedy coding, this process ensures that guaranteed results are the same the same in standard automatic BLT recording – no quality loss. BLT-S increases the forward decoder passes slightly but significantly reduces the encoder and global model calls. For 3B parameters with k=16, BLT-S can achieve a memory bandwidth reduction of up to 77% without any loss in performance.

BLT Diffusion+Verification (BLT-DV)

It stays in the middle. Because BLT-D is trained for both the distributional objective and the subsequent standard prediction objective, the weights of the same model can be automatically applied using the causal decoder mask – no separate model and no additional training is required. BLT-DV uses this: distribution prepares a block of bytes first, then one automatic forward pass validates the draft, accepting bytes up to the first mismatch. Theoretically, one-step classification combined with validation yielded the fastest BLT-DV configuration. Although single-step propagation alone often leads to rapid degradation of generation quality, the validation step effectively prevents this. For 3B parameters, BLT-DV can achieve a memory bandwidth reduction of up to 81% compared to BLT.

Understanding Numbers

All models were trained on the BLT-1T dataset (1 trillion tokens from public sources including the Datacomp-LM subset), with 1B parameter models trained in 240,000 steps and 3B parameter models in 480,000 steps. The test includes four generation tasks: translating French to English and German to English using the FLORES-101 benchmark (4-shot, SentencePiece BLEU) and two coding indexes — HumanEval (0-shot, pass@1) and MBPP (3-shot, pass@1).

Beyond the production functions, the research team also tests the BLT-D in five supported benchmarks: ARC-Easy, ARC-Challenge, PIQA, HellaSwag, and MMLU. Since BLT-D is trained with the purpose of predicting the next byte in accordance with the distribution objective, it can calculate the probability automatically by applying a causal mask to the decoder – the same verification step BLT-DV relies on. The results show that the BLT-D variant achieves scores approaching the baseline BLT in all five benchmarks, confirming that combining block classification does not impair the automatic reasoning model's ability.

Performance is reported in three proxy metrics: decoder network performance evaluations (NFEs), encoder/global NFEs models, and the estimated memory-bandwidth in gigabytes derived from parameter calculations and forward calculations under 16-bit precision. The research team makes it clear that these are proxy metrics – turning the NFE reduction into a real wall clock improvement requires the implementation of a much more refined approach, which the research team flags as a very important direction for future work.

Translation operations benefit greatly from BLT-D for all block sizes. The coding functions show more sensitivity to the block size: BLT-D-16 offers the largest efficiency gains but shows a reasonable score drop in HumanEval and MBPP. Another significant finding from the generational diversity analysis: when using entropy-bounded sampling with top-p sampling at inference, more decoder NFEs are associated with a higher rate of token type (rate of word diversity). This means that the efficiency-variability tradeoff is adjustable during prediction without retraining.

Key Takeaways

BLT-D introduces clear block-wise distribution in the local BLT recorder, training with combined next-byte prediction and masked-byte prediction loss to generate multiple bytes in forward passes instead of one at a time.
BLT-S uses a lightweight BLT decoder as a projection framework – no different models, no architecture changes, no additional training – and produces the same output as standard BLT under greedy decoding.
BLT-DV combines the draft distribution with an automatic validation step using the same BLT-D model weights, recovering the quality lost from the clustering distribution only without additional training.
All methods may achieve average memory-bandwidth costs more than 50% lower than BLT in production operations; BLT-D-16 may reach 87–92% reduction.
BLT-D's predictive ability remains robust in probability-based benchmarks (ARC-Easy, ARC-Challenge, PIQA, HellaSwag, MMLU), and its generation variability is adjustable during inference by using entropy-bound sample rates.

Check it out Paper. Also, feel free to follow us Twitter and don't forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us

Source link

nimda 2 hours ago

0 0 6 minutes read