Beyond Real Data: Synthetic Data Through the Lens of So

nimda March 30, 2026

0 4 1 minute read

Beyond Real Data: Synthetic Data Through the Lens of So

Synthetic data can improve generalization when real data is scarce, but overreliance can introduce distributional inconsistencies that degrade performance. In this paper, we present a learning-theoretical framework to measure trade-offs between real-data processing. Our method proposes an algorithmic stability to find normal synthetic error bounds, which shows a high ratio of synthetic to real data that minimizes the expected experimental error as a function of the Wasserstein distance between real and synthetic distributions. Motivating our framework in the kernel ridge regression setup with mixed data, we provide a detailed analysis that may be of independent interest. Our theory predicts the existence of an optimal measure, which leads to a U-shaped behavior of the experimental error with respect to the synthetic data component. Empirically, we confirm this prediction on the CIFAR-10 and clinical brain MRI data sets. Our theory extends to the important case of domain normalization, showing that carefully combining target synthetic data with limited source data can reduce domain variation and improve generalization. We conclude with practical guidance for applying our results to both on-site and off-site situations.