D4RT: Integrated, Fast 4D Scene Reconstruction and Tracking
Introducing D4RT, an integrated AI model for 4D scene reconstruction and tracking in space and time.
Whenever we look at the world, we do an amazing job of memory and prediction. We see and understand things as they are at a particular moment in time, as they were a moment ago, and as they will be in the future. Our mental model of the world maintains a continuous representation of reality and we use that model to draw logical conclusions about causal relationships between the past, present, and future.
To help machines see the world the way we do, we can equip them with cameras, but that only solves the installation problem. To make sense of this installation, computers must solve a complex, inverse problem: taking a video – which is a series of flat 2D projections – and finding or understanding a rich, 3D volume, moving world.
Today, we present D4RT (Dynamic 4D Reconstruction and Tracking), a new AI model that combines the reconstruction of a dynamic scene into a single, efficient framework, bringing us closer to the next frontier of artificial intelligence: a complete view of our dynamic reality.
Level Four Challenge
To understand a dynamic scene captured in 2D video, an AI model must track every pixel of every object as it moves between the three dimensions of space and the fourth dimension of time. In addition, it must separate this movement from the movement of the camera, maintaining a coherent representation even when objects move behind each other or leave the frame entirely. Traditionally, capturing this level of geometry and movement from 2D videos requires deep computing processes or a patchwork of specialized AI models – some for depth, others for motion angles or cameras – leading to slow and fragmented AI reconstruction.
D4RT's simplified architecture and novel query method puts it at the forefront of 4D reconstruction while being 300x more efficient than previous methods – fast enough for real-time applications in robotics, virtual reality, and more.
How D4RT works: A Question-Based Approach
The D4RT works as a combined encoder-decoder transformer structure. The encoder first processes the input video into a compressed representation of the geometry and spatial motion. Unlike older systems that use different modules for different tasks, D4RT calculates only what it needs using a flexible query method that focuses on one, key question:
“Where a given pixel from the available video in 3D space by default the timeas seen in a selected camera?”
Building on our previous work, the lightweight decoder then queries this representation to answer specific instances of the query asked. Because the queries are independent, they can be processed parallely on modern AI hardware. This makes the D4RT very fast and scalable, whether it's tracking a few points or reconstructing an entire area.



