The Data-Quality Illusion: Rethinking Quality-Based Filtering for LLM Pretraining

Large models are pre-trained on large datasets that appear on the web that contain documents of mixed quality, making data filtering very important. A popular method is Classifier-based Quality Filtering (CQF), which trains a two-classifier to distinguish between pre-training data and a small, high-quality set. It assigns each pre-training document a quality score defined as a classification score and retains only the highest scores. We provide an in-depth analysis of CQF. We show that although CQF improves the performance of the underlying task, it does not necessarily improve linguistic modeling on high-quality datasets. We explain this paradox by saying that CQF consistently filters a high-quality data set. We also compare the behavior of models trained with CQF and those trained on synthetic data of increasing quality, obtained by random token permutation, and find very different trends. Our results challenge the assumption that CQF captures a meaningful assumption of data quality.
- ‡ Work done while at Apple
- § University of Oxford


