From Where Things Are to What They Do for You: Benchmarking Spatial-Functional Intelligence for Multimodal LLMs

The true spatial intelligence of multimodal agents goes beyond low-level geometric perception, from knowing where things are to understanding what they are doing. Although existing benchmarks, such as the VSI-Bench, successfully test this basic geometry category, they fall short of testing the higher-order cognitive skills that are essential for grounded intelligence. To close this gap, we present the Spatial-Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with more than 1700 questions based on a variety of tests, focused on indoor video. The SFI-Bench was designed to systematically assess two complementary dimensions of advanced thinking: (1) Structural Spatial Reasoning, understanding complex structures and building coherent spatial representations, and (2) Functional Reasoning, object accessibility and context-dependent use. Its operations, which include conditional computation, multi-hop relational reasoning, functional matching, and knowledge-based problem solving, directly challenge the model's ability to integrate perception, memory, and decision making. Our experiments revealed that current MLLMs often struggle to integrate spatial memory with active and external knowledge, highlighting a critical barrier. The SFI-Bench therefore provides an important tool for measuring and driving progress towards psychologically efficient and truly supported agents.
- † Mila, University of Montréal
- ‡ New York University
- ** Work done while at Apple



