MANZANO: A Simple and Sharp Multimodal Integrated Model with Hybrid Vision Tokenizer

Large-scale Linguistic Models (LLMs) capable of understanding and generating visual content hold great potential. However, existing open source models often suffer from performance trade-offs between these capabilities. Introducing Manzano, a lightweight and scalable hybrid framework that greatly reduces this tension by combining a hybrid image token with a well-chosen training recipe. A single shared vision encoder supplies two lightweight adapters that generate continuous embeddings for image-to-text understanding and separate tokens for text-to-image generation within a common semantic space. A combined autoregressive LLM predicts high-level semantics in the form of text and image tokens, with a diffusion decoder later translating the image tokens into pixels. The structures, along with the combined training recipe in understanding and generating the data, enable the growing joint learning of both skills. Manzano achieves the most modern results among compact models, and is competitive with specialized models, especially in text-rich tests. Our studies show less conflict of functions and consistent benefits from the scaling model size, which validates our choice of integrated token design.
- † Meta
- ** Work done while at Apple


