LatentVLA: Latent Logical Models for Automated Driving

we discussed the AlpamayoR1 (AR1), an autonomous driving model that integrates the VLM to act as a reasoning backbone. It relies on carefully collected chain-of-causation datasets. Training on this dataset enables the AR1 to “think” in natural language to solve challenging driving situations. But what if natural language is not the best support for reasoning in driving situations? After all, when faced with a driving situation that requires a quick reaction, human drivers often act reflexively rather than "talking to each other step-by-step". What is the alternative to driving models? In this article, we break down the architecture of LatentVLA, a satisfying take on language-based approaches that require no natural language dataset, he is thinking in a hidden place and usage distillation knowledge meet real-time constraints. Subtle Action Learning A large part of AR1's success lies in the chain-of-causation dataset, a collection that requires industry-scale efforts, a carefully defined labeling pipeline and extensive validation. In contrast, LatentVLA takes a completely different position: the authors argue that the raw driving data already contains the structure needed to train a large model and that natural language is inherently biased and difficult to adapt to actions. In addition, generating natural language reasoning conversations is not successful as some tokens do not contribute meaningfully to the reasoning process (e.g. fixed words). Therefore, they present a self-guided framework that is used for prediction hidden ego-centric actions in a small hidden place. In other words, the model uses unlabeled driving data to make predictions which is a step the driver must have taken generating this data. These subtle actions will serve as building blocks for the subtle realm of thought. Learning Representation To predict implicit actions from unlabeled data, the authors use a method such as LAPO (learning to act without actions) [2]. This approach relies on an encoder-decoder setup where the encoder (also referred to as “inverse-dynamics model”, IDM) uses the next two frames to predict the next action vector and the decoder (referred to as “forward dynamics model”, FDM) uses the current frame and the predicted action vector to reconstruct the next frame. This clever setup forces the learned action representation to explain what they must have been taken to detect state changes in our dataset. However, this representation of continuous action is incompatible with the VLMs we intend to use. To classify it, the authors use VQ-VAE (Vector-Quantised Variational Auto-Encoder), which displays continuous vectors in the vectors closest to the read. the codebook (ie a dictionary of different verbs) in a separable way. This is the action that FDM will use to understand the next frame. By optimizing the following frame reconstruction error, we jointly trained IDM and FDM to encode a predictive action representation. Continuous action representations studied by LAPO in unlabeled gameplay videos of popular arcade games. Source: [2] Distinguishing Ego Actions from Environmental Noise Now you might think: "The actions of the driver are not the only things that influence the next frame when driving, what if a bird flies in front of the camera? Does this spoil the representation of the action?". To this, the authors answer yes again no, there needs to be a way to distinguish the impact of the driver's actions in the future natural forces. A good solution to this problem is to use a two-stage encoder-decoder setup: Embedded in the ground-truth trajectory, the ego-state and the previous frame, the encoder predicts the latent action. Since this action is conditioned on the dynamics of the vehicle by trajectory and ego-state, we only need to model natural forces to enable the decoder to reconstruct the next frame. This "environmental action" then it is calculated and the codebook used so far is frozen in the next section. Included in the previous draft and the environmental actionthe encoder enters another hidden action. Similarly, since the forces of nature are known and part of the situation, this hidden secondary action is forced to be coded. ego-centric dynamics. Using the new codebook, this action is reduced to discrete ego-action. Finally, we feed both actions to the output to reconstruct the next frame. This setup ensures a clear separation of the actions of the ego and the forces of nature. VLM training Building on the learned action representation, the authors train the Qwen2.5-VL model to predict the same latent actions as the decoder model. This is achieved by having the encoder predict the trajectory of 12 hidden actions for a given input frame and having the VLM optimize its probability of a wrong entry: A notable difference with other methods that use action codebooks is the number of action tokens used by LatentVLA. Where other models such as AutoVLA use a 2048 special token action codebook, LatentVLA it only uses 16. This causes: Simple learning activity: in the 2048-dimensional codebook, actions probably represent precise driving decisions such as "steer left at a 16-degree angle". With only 16 tokens, the model probably accepts high-level instructions such as "slow down", "take a narrow right turn", which requires little display to learn. To save the VLM pre-training information: it does not have to learn more than 2000 "new words". Information Distillation Where AlpamayoR1 relies on efficient tokenization and uniform distribution of flows to maintain real-time performance, LatentVLA goes in a completely different direction: information filtering. So far, the authors present a fusion module within existing E2E architectures (iPad [4] and Transfuser [5]). This integration module is provided with visual embedding and action by the VLM and the output features in the Bird's-Eye-View (BEV) space. These embeddings serve as keys and values in cross-checking the BEV questions generated by the E2E model. This allows the E2E model to integrate information from the VLM. LatentVLA also includes several E2E architectures, for simplicity, we only look at Transfuser integration. Source: [1] However, the VLM is still too large to be used efficiently during testing. So, small 50M-parameter decision transformer trained to act in great roles 3.8B Qwen2.5-VL VLM. This is achieved by reducing the KL gap between the teacher and the distributed students: This framework allows LatentVLA to work with a highly integrated logic backbone and provides a common way to integrate VLM knowledge into traditional E2E architectures at a low cost. A visual representation of the LatentVLA architecture with filtering information. Source: [1] Testing LatentVLA is trained and tested in NavSim [6]a data set composed of more than 100.000 frames collected from real-world driving measurements. NavSim includes a which no longer works simulator for testing open-loop programming. In other words, the models predict the trajectory in the next few seconds given the input images. Then, this trajectory is used in the BEV simulation that works by predicting the actions of the ego-vehicle. don't touch the actions of other agents (thus "inactive"). This allows for easy measurement of planning-related metrics such as the Predictive Driver Model Score (PDMS): a composite metric that measures driving safety, performance, and risk by combining simulation outputs. However, this type of assessment has some important shortcomings, as we will discuss later. A representation of the NavSim scene (left) and the simulation output (right). Source: [1] In this benchmark, LatentVLA achieves state-of-the-art results, improving on standard E2E and LLM-based architectures. However, the increase in performance obtained by combining VLM information on the iPad with Transfuser appears to be limited. Focusing on PDMS, we see that the iPad base scores 91.7%. The Distilled LatentVLA variant increases the score to 92.1 (+0.4%) and the undistilled version reaches 92.4 (another +0.3%). This small development begs the question of whether high-level thinking and world knowledge really matters in driving. In my opinion they have the potential to open up a new level of driving gameplay, but this is not measured well by the disjointed programming simulators. Limitations of open source programming In recent years, it has been widely accepted that only testing driving models in open-loop configuration gives an incomplete picture of their actual driving capabilities. Indeed, open-loop programming is very different from driving and arguably easier. The main reason is that open-loop programming does not involve interaction with the environment (the simulator is the best which no longer works) and reduces the simulation of the expert's trajectory. This creates many problems in real-world situations: Small deviations from the learned paths lead to cascading errors: without dynamic interactions with the environment and other agents, unfolding models struggle to adjust trajectories that do not align well with the ones they have learned. Trajectories are inherently multimodal: in each driving situation, there are multiple routes and acceleration patterns that lead to safer driving outcomes. However, learning to simulate a single expert's trajectory collapses these multiple pathways, limiting the modeling potential. For these reasons, it is important to properly test driving models in closed (ie active) simulators and warrants the use of post-RL training methods as discussed in the AR1 article. I'd bet that the difference between LatentVLA and its non-VLM counterparts is huge in these cases as reasoning can help reduce the limitations of open-loop training. The conclusion In this article, we discussed LatentVLA, a method that aims to integrate VLM knowledge into standard E2E models without relying on natural language. This approach is smart in the sense that it allows learning useful representations from unlabeled data while competing tasks like AR1 rely on large-scale datasets that are carefully defined to avoid natural language ambiguities. However, LatentVLA will benefit from thorough testing, especially in closed settings. Thanks for reading this far! If you found this article useful, please consider it to share it; It really helps to support the time and effort put into producing this work. As always, feel free contact me if you have any follow-up questions, thoughts, or ideas. If you would like to support my independent research and writing, feel free to do so I bought a coffee 😉 Until next time! 👋 References LatentVLA LAPO VQ-VAE The iPad Transfuser Source link