MiniMax-Text-01 and MiniMax-VL-01 released: Powerful Models with Lightning Attention, 456B Parameters, 4M Token Modes, and State-of-the-Art Accuracy

Large-scale Language Models (LLMs) and Visual Language Models (VLMs) transform natural language understanding, multimodal integration, and complex reasoning tasks. However, one important limitation remains: current models cannot handle very large cases well. This challenge has motivated researchers to explore new methods and structures to improve the measurement, efficiency, and effectiveness of these models.
Existing models typically support token context lengths between 32,000 and 256,000, which limits their ability to handle situations that require larger context windows, such as extended program instructions or multi-step logic tasks. Increasing the context size is computationally expensive due to the quadratic complexity of conventional softmax attention processes. Researchers have explored alternative approaches to attention, such as passive attention, direct attention, and spatial models, to address these challenges, but large-scale implementation remains limited.
Minimal attention focuses on parallel inputs to reduce computing overhead, while row attention simplifies matrix attention for simplicity. However, adoption has been slow due to compatibility issues with existing architectures and lackluster real-world performance. For example, state-space models effectively process long sequences but often lack the robustness and accuracy of transformer-based systems for complex tasks.
Researchers from MiniMax have introduced the MiniMax-01 series, including two variations to address these limitations:
- MiniMax-Text-01: MiniMax-Text-01 contains 456 billion parameters, with 45.9 billion activated per token. It uses a mixed attention approach for active processing of long-range content. Its content window is up to 1 million tokens during training and 4 million tokens during prediction.
- MiniMax-VL-01: The MiniMax-VL-01 incorporates a lightweight Vision Transformer (ViT) module and processes 512 billion vision language tokens through a four-stage training pipeline.
The models use a new method of lightning attention, which reduces the computational complexity of processing long sequences. Also, integrating the structure of the Mixture of Experts (MoE) improves efficiency and effectiveness. MiniMax models include 456 billion parameters, of which 45.9 billion are activated per token. This combination allows models to process context windows of up to 1 million tokens during training and output 4 million tokens during prediction. Using advanced computing techniques, the MiniMax-01 series offers unprecedented power in long-range content processing while maintaining performance in line with high-end models such as GPT-4 and Claude-3.5.
The lightning attention method achieves linear computational complexity, allowing the model to scale effectively. A mixture of attention structures alternates between lightning and softmax attention layers, ensuring a balance between computing efficiency and retrieval power. The models also include an improved Linear Attention Sequence Parallelism (LASP+) algorithm, which handles wide sequences well. Also, the vision language model MiniMax-VL-01 includes a lightweight vision transformer module, enabling it to process 512 billion vision language tokens through a four-stage training process. These innovations are accompanied by improved CUDA kernels and parallelization techniques, which achieve over 75% Model Flops utilization on Nvidia H20 GPUs.
Performance tests reveal that the MiniMax models achieve the best results in all the various benchmarks:
- For example, MiniMax-Text-01 is 88.5% accurate in MMLU and performs competitively against models like GPT-4.
- The visual language model MiniMax-VL-01 outperforms many peers, with an accuracy rate of 96.4% in DocVQA and 91.7% in AI2D benchmarks.
These models also offer a context window 20–32 times larger than their traditional counterparts, which greatly improves their use in long context applications.
In conclusion, the MiniMax-01 series, which includes the MiniMax-Text-01 and the MiniMax-VL-01, represents success in meeting the challenges of durability and long-term conditions. It combines innovative technologies such as lightning attention and hybrid architecture. Using advanced computing frameworks and optimization techniques, researchers have introduced a solution that extends the context's capacity to an unprecedented 4 million tokens and matches or surpasses the performance of advanced models such as GPT-4.
Check out Paper and models on the same face. All credit for this study goes to the researchers of this project. Also, don't forget to follow us Twitter and join our Telephone station again LinkedIn Grup. Don't forget to join our 65k+ ML SubReddit.
🚨 Recommend Open-Source Platform: Parlant is a framework that changes the way AI agents make decisions in customer-facing situations. (Promoted)
Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the power of Artificial Intelligence for the benefit of society. His latest endeavor is the launch of Artificial Intelligence Media Platform, Marktechpost, which stands out for its extensive coverage of machine learning and deep learning stories that sound technically sound and easily understood by a wide audience. The platform boasts of more than 2 million monthly views, which shows its popularity among the audience.
📄 Meet 'Height': Independent project management tool (Sponsored)