Cram Less to Fit More: Training Data Pruning Improves Memorization

0 4 1 minute read

Cram Less to Fit More: Training Data Pruning Improves Memorization

This paper was accepted at the ICLR 2026 Workshop on Navigating and Addressing Basic Model Data Problems.

Major linguistic models (LLMs) can struggle to memorize factual information in their fields, which often leads to misperceptions and poor performance in knowledge-intensive tasks. In this paper, we formalize fact memorization from an information theory perspective and study how the distribution of training data affects fact accuracy. We show that the accuracy of the truth is very small (below the power limit) whenever the amount of information contained in the facts of the training data exceeds the capacity of the model. This gets worse when the reality of the frequency distribution is skewed (eg power law). We propose data selection schemes based on the training loss only that aim to limit the number of truths in the training data and expand their distribution multiple times. For semi-structured datasets containing high entropy truths, our selection method effectively increases the truth accuracy in the power limit. When pre-training language models from scratch on the annotated Wikipedia corpus, our selection method enables the GPT2-Small model (110m parameters) to memorize 1.3X more business facts compared to standard training, which is similar to the performance of the 10X large model (1.3B parameters) pretrained on the full dataset.

Source link

nimda 4 weeks ago

0 4 1 minute read

Cram Less to Fit More: Training Data Pruning Improves Memorization

nimda

Leave a Reply Cancel reply

Subscribers, Revenue, Market Share & Global Reach

5-return back to the base

Gemma 3 270m: Model of a hyper-effective compact of AI

Chilean Artist Alejandra Acosta's Exquisite Embroidery Illustrations for the World's First Book Describing Life on Other Worlds – The Marginalian

Cut researchers present the work that calls llms: Eliminating SQL relief to improve the accuracy of information and efficiency

OASIS: Simuleringar av social interaction mellan en miljon agent

FALCON 3 models are now available at Amazon Sagemaker Jumpstart

This AI paper introduces codesters: Physical models are symbolic language with code / guide

Meta SAM 2.1 is now available in Amazon SageMaker JumpStart

nimda

Subscribe to our mailing list to get the new updates!

Meta AI and KAUST Researchers Propose Neural Computers That Wrap Computation, Memory, and I/O into a Single Learned Model

Microsoft VibeVoice Hands-On Coding Tutorial Including ASR-Aware Speaker, Real-Time TTS, and Speech-to-Speech Pipelines

Related Articles

Halliburton enhances seismic workflow creation with Amazon Bedrock and Generative AI

Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures

Secure short-term GPU capacity for ML workloads with EC2 Capacity Blocks for ML and SageMaker training plans

Faka ubunjiniyela obunama-LLM: Amasu & Izibonelo zePython