This AI Paper Introduces Toto: Automated Video Models for Combined Image and Video Pre-Job Training for All Various Jobs

Automatic pre-training has proven to be a game changer in machine learning, especially in relation to sequential data processing. Predictive modeling of sequential elements has been very successful in natural language processing and, increasingly, has been explored within the domains of computer vision. Video modeling is one of the unexplored areas, offering opportunities to extend to action recognition, object tracking, and robotics applications. This improvement is due to the growth of data sets and the innovation of transformer architectures that treat visual inputs as structured tokens suitable for automatic training.
Modeling videos has unique challenges due to temporal variability and redundancy. Unlike plain text, video frames often contain abstract information, making it difficult to tokenize and read appropriate representations. A suitable video modeling should be able to overcome this weakness while capturing the spatiotemporal relationships in the frames. Most frameworks focus on image-based representation, leaving the development of video architectures open. The task requires new ways to balance efficiency and effectiveness, especially when video prediction and robotic manipulation are at play.
Learning visual representations with convolutional networks and hidden autoencoders has been successful in image tasks. Such methods often fail with respect to video applications as they cannot fully express temporal dependencies. Tokenization methods such as dVAE and VQGAN often convert visual information into tokens. This has proven to work well, but scaling such a method becomes challenging in situations with mixed datasets including images and videos. Patch-based tokenization is not common enough to meet the various tasks well in video.
A team of researchers from Meta FAIR and UC Berkeley introduced the Toto family of autoregressive video models. Their innovation helps to address the limitations of traditional methods, treating videos as a sequence of visible visual tokens and using causal transformer properties to predict the next tokens. Researchers have developed models that can easily combine image and video training by training on a combined dataset that includes more than a million image and video tokens. The combined approach enabled the team to leverage the power of automated training in both domains.
Toto models use the dVAE token and 8k token vocabulary to process images and video frames. Each frame is resized and tokenized separately, resulting in a sequence of 256 tokens. These tokens are then processed by a causal transformer using RMSNorm features and RoPE embeddings to establish improved model performance. The training was carried out on the ImageNet dataset and HowTo100M, a token with a resolution of 128 × 128 pixels. The researchers also prepared the models for the tasks below by replacing the central integration with attention integration to ensure a better quality of representation.
The models show good performance in all benchmarks. For ImageNet classification, the largest Toto model achieved the highest 1st accuracy of 75.3%, outperforming other generation models such as MAE and iGPT. In the Kinetics-400 action recognition task, the models reached the highest accuracy of 1 of 74.4%, which proves their ability to understand complex temporal forces. In the DAVIS dataset of semi-supervised video surveillance, the models achieve a J&F score of up to 62.4, thus improving over the state-of-the-art benchmarks established by DINO and MAE. In addition, for robotic tasks such as object manipulation, Toto models learn very quickly and sample well. For example, the Toto-base model detects the real-world task of picking a cube on a Franka robot with 63% accuracy. Overall, these are surprising results in terms of the diversity and scope of these proposed models with different applications.
The work provided significant advances in video modeling by addressing redundancy and tokenization challenges. The researchers successfully demonstrated “through combined training on both images and videos, that this type of automatic training is effective across a range of tasks.” New designs and tokenization techniques provide the basis for further dense speculation and recognition research. This is one important step in unlocking the full potential of video modeling in real-world applications.
Check it out Paper and Design Page. All credit for this study goes to the researchers of this project. Also, don't forget to follow us Twitter and join our Telephone station again LinkedIn Grup. Don't forget to join our 65k+ ML SubReddit.
🚨 UPCOMING FREE AI WEBINAR (JAN 15, 2025): Increase LLM Accuracy with Artificial Data and Experimental Intelligence–Join this webinar for actionable insights into improving LLM model performance and accuracy while protecting data privacy.

Nikhil is an intern consultant at Marktechpost. He is pursuing a dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is constantly researching applications in fields such as biomaterials and biomedical sciences. With a strong background in Material Science, he explores new developments and creates opportunities to contribute.
✅ [Recommended Read] Nebius AI Studio expands with vision models, new language models, embedded and LoRA (Enhanced)