NarrativeTrack: Exploring Video Language Models Beyond the Frame

Multimodal large-scale linguistic models (MLLMs) have achieved remarkable progress in visual language reasoning, yet their ability to understand temporally unfolding narratives in videos has been underexplored. A true narrative understanding needs to support who is doing what, when, and where, to maintain a consistent representation of the business across all visual and temporal contexts. Introducing NarrativeTrack, the first benchmark for assessing narrative comprehension in MLLMs with well-researched business analytics. Unlike existing benchmarks that are limited to short clips or scene-level semantics, we divide videos into collective entities and evaluate their progress with Compositional Reasoning Progression (CRP), a systematic evaluation framework that gradually increases the complexity of narratives in all three dimensions: business existence, business changes, and business. CRP challenges models to evolve from ad hoc insistence to evolutionary and well-analysed rational thinking. An enterprise-centric automation pipeline enables the mass production of ad-hoc business presentations, providing the basis for CRP. An examination of high-level MLLMs reveals that models fail to robustly track entities across observed changes and temporal dynamics, often misrepresenting identities under changing context. Open-source general-purpose MLLMs exhibit strong understanding but weak temporal coherence, while video-specific MLLMs capture temporal context but express business context. These results reveal a fundamental trade-off between conceptual and temporal reasoning, indicating that narrative understanding only emerges from their integration. NarrativeTrack provides the first systematic framework for diagnosing and developing temporally based narrative understanding in MLLMs.
- † University of Illinois Urbana-Champaign
- ** Work done while at Apple



