Reactive Machines

Cram Less to Fit More: Training Data Pruning Improves Memorization

This paper was accepted at the ICLR 2026 Workshop on Navigating and Addressing Basic Model Data Problems.

Major linguistic models (LLMs) can struggle to memorize factual information in their fields, which often leads to misperceptions and poor performance in knowledge-intensive tasks. In this paper, we formalize fact memorization from an information theory perspective and study how the distribution of training data affects fact accuracy. We show that the accuracy of the truth is very small (below the power limit) whenever the amount of information contained in the facts of the training data exceeds the capacity of the model. This gets worse when the reality of the frequency distribution is skewed (eg power law). We propose data selection schemes based on the training loss only that aim to limit the number of truths in the training data and expand their distribution multiple times. For semi-structured datasets containing high entropy truths, our selection method effectively increases the truth accuracy in the power limit. When pre-training language models from scratch on the annotated Wikipedia corpus, our selection method enables the GPT2-Small model (110m parameters) to memorize 1.3X more business facts compared to standard training, which is similar to the performance of the 10X large model (1.3B parameters) pretrained on the full dataset.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button