Pretraining Hierarchical Memories: Distinguishing the Long Tail and General Knowledge

nimda January 9, 2026

0 6 1 minute read

Pretraining Hierarchical Memories: Distinguishing the Long Tail and General Knowledge

The dramatic performance gains of modern language models now depend on scaling parameters: larger models store more information about the world and logic better. But compressing all the global information into parameters is not necessary, as only a fraction is used quickly, and it is not possible for peripheral devices with limited memory for determining and calculating time. We address this shortcoming with an improved memory architecture and pre-training strategy aligned with existing hardware paradigms. We present small language models that access large hierarchical parametric memory banks that encode global information. During pre-training and inference, we download a small, context-dependent block of memory and add it to the model. Our pre-training learns to store long-tailed world knowledge in memory parameters, while a small language model serves as an anchor that captures common knowledge and common reasoning abilities. With trillion token scale testing, we show significant benefits: a 160M parameters model augmented with 18M parameters memory downloaded from a 4.6B memory bank achieves the same performance as a standard model with 2x more parameters. Through extensive testing, we learn the correct type and size of parameter memories in transformers, limiting them to more than 21B parameters. We find that our proposed hierarchical feed-forward memories perform robustly on all transformer architectures, whether added during pre-training or post-hoc.

Source link

nimda January 9, 2026

0 6 1 minute read