Text-Conditional JEPA for Learning Semantically Rich Visual Representations

An image-based Joint-Embedding Predictive Architecture (JEPA) provides a promising approach to learning the observed observations by predicting a hidden feature. However with the inherent visual uncertainty in hidden areas, feature prediction remains challenging and may fail to learn semantic representations. In this work, we propose Text-Conditional JEPA (TC-JEPA) that uses image captions to reduce prediction uncertainty. Specifically, we modify the predicted patch features using a well-analyzed text filter that includes a few more attention to input text tokens. In such a case, patch characteristics are predictable as a function of text, thus statistically significant. We show TC-JEPA improves stream performance and training stability, with promising scaling properties. TC-JEPA also offers a new teaching paradigm for visual language based on feature prediction alone, a unique method that works well in a variety of tasks, especially those that require fine-grained understanding and reasoning.



