Interpreting CLIP: Insights into the Robustness of ImageNet Distribution Transformations

What distinguishes rigid models from non-rigid ones? While in the ImageNet distribution shifts it has been shown that such differences in robustness can be largely traced back to differences in the training data, so far it is not known what that translates to in terms of what the model has learned. In this work, we close this gap by investigating the representation spaces of 16 robust CLIP encoders with different bases (ResNets and ViTs) and prior training sets (OpenAI, LAION-400M, LAION-2B, YFCC15M, CC12M and DataComp), and compare them. and representation spaces for less robust models with the same fundamentals, but different (prior) training sets or objectives (pre-training of CLIP in ImageNet-Captions, and supervised training or correction in ImageNet).Through this analysis, we generate three novel insights. First, we observe the presence of outliers in robust CLIP visual encoders, which to our knowledge is the first time that these have been observed in non-linguistic and non-transitive models. Second, we find the presence of outliers to be a sign of the robustness of the ImageNet transformation in the models, as we find them in the robust models in our analysis. Finally, we also investigate the number of unique concepts encoded in the representation space and find the non-sense CLIP models to cover the maximum number of unique concepts in their representation space. However, we do not find this to be an indication of the robustness of the ImageNet transformation and we think it is related to language supervision. Since the presence of outliers can be detected without access to any data from transformed datasets, we believe that they can be a useful tool for practitioners to get a sense of the robustness of the change distribution of a pre-trained model during deployment.