Imagining the Future: Latent Lookahead Training for Transformers

This paper was accepted at the Workshop on Latent & Implicit Thinking – Going Beyond CoT Reasoning 2026 at ICLR.
Automatic language models trained to predict the next token generate text by sampling one token at a time. Although it is highly scalable, this objective forces the model to commit itself to every step, preventing it from testing or thinking about many concrete developments. Furthermore, the computing share for all tokens is the same; each token is constructed based on a single pass, which may limit the model's definition in cases where complex tokens require more computation in nature. Addressing these limitations, we introduce implicit lookup, a training strategy that allows models to “think” before generating: at selected positions in a sequence, before committing to the next token, the model performs a multi-step forward lookup in the latent space. Specifically, instead of sampling future tokens, we use the hidden space of the network by iteratively feeding its hidden state back to the context in τ steps, investing more computation in predicting that token. This produces latent τ predictions that are monitored against the following true lower τ tokens, which encourages the model to “look ahead” and improve its predictions. We show that implicit approximations perform better than automatic and non-automatic bases in programming tasks such as maze solving, Sudoku, and ProsQA, where foresight is important.
- ** Work done while at Apple



