MixAtlas: Development of Uncertainty Data Integration for Multimodal LLM Midtraining

This paper was accepted at the Workshop on Navigating and Addressing Data Problems for Foundation Models (NADPFM) at ICLR 2026.
Systematic site recalibration can greatly improve sampling efficiency and stream reallocation; however, the optimization of multimodal pretraining data mix remains unexplored. Current recipes for multimodal training tune mixtures from only one perspective such as data format or task type. We present MixAtlas, a principled framework for effective multimodal composites by using systematic domain decomposition and small host models. MixAtlas aggregates training data along two definable axes—emph{image concepts} and emph{task supervision}—allowing for definable mix control and well-analyzed attribution of stream performance to specific domains within each axis. Using sparse proxy models and a Gaussian-process surrogate, we evaluate the hybrid environment at 1/100th the cost of full training. The resulting hybrids bring significant improvements: up to 3 faster convergence and consistent gains of 2—5% across various benchmarks over existing methods, as well as strong increases in text-rich benchmarks such as ChartQA (+10%) and TextVQA (+13%). Importantly, we show that the mixtures obtained with small proxy models transfer to the training of large-scale models, preserving both the efficiency and accuracy achieved. Overall, MixAtlas makes multimodal mixture preparation efficient and descriptive, providing portable, computationally efficient recipes for training next-generation MLLMs.
- † Virginia Tech
- ‡ University of Washington
- ** Work done while at Apple



