Reactive Machines

Beyond One Extractor: Rethinking HTML-to-Text Extraction for LLM Pretraining

One of the first pre-processing steps for creating web-scale LLM pretraining datasets involves extracting text from HTML. Despite the great diversity of web content, the existing open source datasets mostly use a single fixed identifier for all web pages. In this work, we investigate whether this practice leads to the masking and misuse of Internet data. We first show that although different generators may lead to the same model performance in common language comprehension tasks, the pages that survive a static filtering pipeline can be very different. This suggests a simple intervention: by taking the Union over different extractors, we can increase the DCLM-Baseline token production up to 71% while maintaining the benchmark performance. We also show that for structured content such as tables and code blocks, the choice of extractor can have a significant impact on low-level task performance, with a difference of up to 10 percentage points (pp) in WikiTQ and 3 pp in HumanEval.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button