SafetyPairs: Classifying Safety-Important Image Features Through Counterfactual Image Generation

0 2 1 minute read

SafetyPairs: Classifying Safety-Important Image Features Through Counterfactual Image Generation

This paper was accepted at the Principled Design for Trustworthy AI – Interpretability, Robustness, and Security All Proceedings Workshop at ICLR 2026.

What exactly makes a particular image unsafe? Systematically distinguishing between positive and problematic images is a challenging problem, as subtle changes in an image, such as an offensive action or symbol, can significantly alter its security implications. However, existing image security datasets are coarse and vague, providing only broad security labels without distinguishing the specific factors that drive these differences. We present SafetyPairs, a scalable framework for generating false pairs of images, differing only in features relevant to a given security policy, thus investigating their security label. Using image editing models, we perform targeted transformations on images that change their security labels while leaving non-security-related information unchanged. Using SafetyPairs, we develop a new safety benchmark, which serves as a powerful source of test data that highlights weaknesses in the abilities of visual language models to distinguish between subtly different images. Beyond the experiment, we find that our pipeline works as an effective data augmentation technique that improves the sample training efficiency of lightweight guard models. We release a benchmark consisting of more than 3,020 SafetyPair images covering a taxonomy of 9 security categories, providing the first systematic resource for studying the security classification of well-characterized images.