The Data-Quality Illusion: Rethinking Quality-Based Filtering for LLM Pretraining

nimda January 16, 2026

0 27 1 minute read

The Data-Quality Illusion: Rethinking Quality-Based Filtering for LLM Pretraining

Large models are pre-trained on large datasets that appear on the web that contain documents of mixed quality, making data filtering very important. A popular method is Classifier-based Quality Filtering (CQF), which trains a two-classifier to distinguish between pre-training data and a small, high-quality set. It assigns each pre-training document a quality score defined as a classification score and retains only the highest scores. We provide an in-depth analysis of CQF. We show that although CQF improves the performance of the underlying task, it does not necessarily improve linguistic modeling on high-quality datasets. We explain this paradox by saying that CQF consistently filters a high-quality data set. We also compare the behavior of models trained with CQF and those trained on synthetic data of increasing quality, obtained by random token permutation, and find very different trends. Our results challenge the assumption that CQF captures a meaningful assumption of data quality.

‡ Work done while at Apple
§ University of Oxford

Figure 1: Classifier-based Qualitative Filtering (CQF) Pipeline. A document embedding model (eg sBert, Artic-Embed, or FastText) embeds documents from a high-quality dataset and a prior training set. A binary classifier is trained on the embeddings to distinguish the HQ set from the pre-training set. The scores given by the classifier are used to rank the documents from the previous training set. The top k subset of those documents comprise the new filtered CQF dataset.

Source link

nimda January 16, 2026

0 27 1 minute read