LiTo: Surface Light Field Tokenization

We propose a 3D latent representation that jointly combines geometric object models with view-dependent appearance. Most previous works focus on reconstructing 3D geometry or predicting different views that are independent of the view, and thus it is difficult to capture the realistic effects that depend on the view. Our method enables deep RGB images to provide samples of the surface light field. By encoding random samples of this surface light field into a compact set of hidden vectors, our model learns to represent both geometry and appearance within a 3D hidden space. This display reproduces view-dependent effects such as special highlights and Fresnel reflections under complex illumination. We also train a latent flow matching model on this representation to learn its spatial distribution of a single input image, which allows the generation of 3D objects with visibility consistent with the input's lighting and properties. Experiments show that our method achieves higher recognition quality and better input reliability than existing methods.



