Synthetic Bootstrapped Pretraining – Apple Machine Learning Research

We present Synthetic Bootstrapped Pretraining (SBP), a linguistic model (LM) training procedure that first learns a model of the relationship between documents from a pretraining dataset and then uses it to assemble a large new corpus for joint training. While typical prior training teaches LMs to learn causal relationships between tokens within a single document, they are designed to effectively model the rich, learnable relationships between documents that may lead to better performance. We validate SBP by designing a computationally efficient training preset and pre-train a 3B parameter and a 6B parameter model on up to 1T tokens from scratch. We find that SBP consistently improves on the dynamic base and delivers up to 60% performance improvement achieved by the top oracle with access to 20x more unique data. Qualitative analysis revealed that synthesized texts go beyond mere pronunciation — SBP first summarizes the main idea from the seed material and then designs a new narrative on top of it. Apart from robust performance, SBP admits a natural Bayesian definition: the synthesizer learns implicitly to extract the hidden concepts shared between related documents.
- † Stanford University
- ‡ Equal contribution



