STIV: The scale of the scale and video of the picture

Video generation field has made amazing development, but a vivid need is still leading the development of stronger and beautiful models. In this project, we present a broad-based study study with the links to model buildings, recipes, and dilation design strategies, eliminate the way to a simple and balanced video drainage. Our draft includes a photo status in transformer (DIT) by a frame replacement, while including the state of the text with the combined directory of a conditional image name. The project empowers STIV to make both Text-to-video (T2V) and Tex-Image-to-video documents (T2V) at the same time. In addition, the STIV can easily be expanded in different applications, such as video prediction, frame, multiple video production, and video turneration, etc. With Ti2V, STIV shows strong performance. The 512 model is up to 83.1 in the VBECH T2V, passes both leading models like CoGvideox-5B, Pika, Kling, and Gen-3. The model of the same size and achieves the effect of 90.1 for VBECH I2V work for the decision 512. By providing the transparent and expandable recipe for cutting models of cutting-edge and edge, we intend to enable future research and accelerating progress in providing variable and reliable video solution.
- 30 University of California, Los Angeles
- ** Work done while in Apple



