We are delighted to introduce two remarkable models, MiniMax-Text-01 and MiniMax-VL-01. MiniMax-Text-01 is a powerful language model boasting 456 billion total parameters, with 45.9 billion activated per token. To unlock its long-context capabilities, it adopts a hybrid architecture integrating Lightning Attention, Softmax Attention, and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies like Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, and Expert Tensor Parallel (ETP), its training context length extends to 1 million tokens, and it can handle up to 4 million tokens during inference. Consequently, MiniMax-Text-01 showcases top-tier performance on various academic benchmarks. Building on MiniMax-Text-01's prowess, we developed MiniMax-VL-01 for enhanced visual capabilities. It uses the “ViT-MLP-LLM” framework common in multimodal LLMs. It is initialized and trained using three key components: a 303-million-parameter Vision Transformer (ViT) for visual encoding, a randomly initialized two-layer MLP projector for image adaptation, and MiniMax-Text-01 as the base LLM. This model features a dynamic resolution mechanism. Input images are resized according to a pre-set grid, with resolutions ranging from 336×336 to 2016×2016, while maintaining a 336×336 thumbnail. The resized images are split into non - overlapping patches of the same size. These patches and the thumbnail are encoded separately and then combined to form a full image representation. As a result, MiniMax-VL-01 has achieved top-level performance on multimodal leaderboards, demonstrating its edge in complex multimodal tasks.
The architecture of MiniMax-Text-01 is briefly described as follows:
- Total Parameters: 456B
- Activated Parameters per Token: 45.9B
- Number Layers: 80
- Hybrid Attention: a softmax attention is positioned after every 7 lightning attention.
- Number of attention heads: 64
- Attention head dimension: 128
- Mixture of Experts:
- Number of experts: 32
- Expert hidden dimension: 9216
- Top-2 routing strategy
- Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000
- Hidden Size: 6144
- Vocab Size: 200,064
For MiniMax-VL-01, the additional ViT architecture details is as follows:
- Total Parameters: 303M
- Number of layers: 24
- Patch size: 14
- Hidden size: 1024
- FFN hidden size: 4096
- Number of heads: 16
- Attention head dimension: 64
Tasks | GPT-4o (11-20) | Claude-3.5-Sonnet (10-22) | Gemini-1.5-Pro (002) | Gemini-2.0-Flash (exp) | Qwen2.5-72B-Inst. | DeepSeek-V3 | Llama-3.1-405B-Inst. | MiniMax-Text-01 |
---|---|---|---|---|---|---|---|---|
General | ||||||||
MMLU* | 85.7 | 88.3 | 86.8 | 86.5 | 86.1 | 88.5 | 88.6 | 88.5 |
MMLU-Pro* | 74.4 | 78.0 | 75.8 | 76.4 | 71.1 | 75.9 | 73.3 | 75.7 |
SimpleQA | 39.0 | 28.1 | 23.4 | 26.6 | 10.3 | 24.9 | 23.2 | 23.7 |
C-SimpleQA | 64.6 | 56.8 | 59.4 | 63.3 | 52.2 | 64.8 | 54.7 | 67.4 |
IFEval (avg) | 84.1 | 90.1 | 89.4 | 88.4 | 87.2 | 87.3 | 86.4 | 89.1 |
Arena-Hard | 92.4 | 87.6 | 85.3 | 72.7 | 81.2 | 91.4 | 63.5 | 89.1 |
Reasoning | ||||||||
GPQA* (diamond) | 46.0 | 65.0 | 59.1 | 62.1 | 49.0 | 59.1 | 50.7 | 54.4 |
DROP* (F1) | 89.2 | 88.8 | 89.2 | 89.3 | 85.0 | 91.0 | 92.5 | 87.8 |
Mathematics | ||||||||
GSM8k* | 95.6 | 96.9 | 95.2 | 95.4 | 95.8 | 96.7 | 96.7 | 94.8 |
MATH* | 76.6 | 74.1 | 84.6 | 83.9 | 81.8 | 84.6 | 73.8 | 77.4 |
Coding | ||||||||
MBPP + | 76.2 | 75.1 | 75.4 | 75.9 | 77.0 | 78.8 | 73.0 | 71.7 |
HumanEval | 90.2 | 93.7 | 86.6 | 89.6 | 86.6 | 92.1 | 89.0 | 86.9 |
* Evaluated following a 0-shot CoT setting.
4M Needle In A Haystack Test
Ruler
Model | 4k | 8k | 16k | 32k | 64k | 128k | 256k | 512k | 1M |
---|---|---|---|---|---|---|---|---|---|
GPT-4o (11-20) | 0.970 | 0.921 | 0.890 | 0.888 | 0.884 | - | - | - | - |
Claude-3.5-Sonnet (10-22) | 0.965 | 0.960 | 0.957 | 0.950 | 0.952 | 0.938 | - | - | - |
Gemini-1.5-Pro (002) | 0.962 | 0.960 | 0.960 | 0.958 | 0.938 | 0.917 | 0.916 | 0.861 | 0.850 |
Gemini-2.0-Flash (exp) | 0.960 | 0.960 | 0.951 | 0.957 | 0.937 | 0.860 | 0.797 | 0.709 | - |
MiniMax-Text-01 | 0.963 | 0.961 | 0.953 | 0.954 | 0.943 | 0.947 | 0.945 | 0.928 | 0.910 |
LongBench v2
Model | overall | easy | hard | short | medium | long |
---|---|---|---|---|---|---|
Human | 53.7 | 100.0 | 25.1 | 47.2 | 59.1 | 53.7 |
w/ CoT | ||||||
GPT-4o (11-20) | 51.4 | 54.2 | 49.7 | 59.6 | 48.6 | 43.5 |
Claude-3.5-Sonnet (10-22) | 46.7 | 55.2 | 41.5 | 53.9 | 41.9 | 44.4 |
Deepseek-V3 | - | - | - | - | - | - |
Qwen2.5-72B-Inst. | 43.5 | 47.9 | 40.8 | 48.9 | 40.9 | 39.8 |
MiniMax-Text-01 | 56.5 | 66.1 | 50.5 | 61.7 | 56.7 | 47.2 |
w/o CoT | ||||||
GPT-4o (11-20) | 50.1 | 57.4 | 45.6 | 53.3 | 52.4 | 40.2 |
Claude-3.5-Sonnet (10-22) | 41.0 | 46.9 | 37.3 | 46.1 | 38.6 | 37.0 |
Deepseek-V3 | 48.7 | - | - | - | - | - |
Qwen2.5-72B-Inst. | 42.1 | 42.7 | 41.8 | 45.6 | 38.1 | 44.4 |
MiniMax-Text-01 | 52.9 | 60.9 | 47.9 | 58.9 | 52.6 | 43.5 |
MTOB
Context Type | no context | half book | full book | Δ half book | Δ full book |
---|---|---|---|---|---|
eng → kalam (ChrF) | |||||
GPT-4o (11-20) | 9.90 | 54.30 | - | 44.40 | - |
Claude-3.5-Sonnet (10-22) | 20.22 | 53.62 | 55.65 | 33.39 | 35.42 |
Gemini-1.5-Pro (002) | 16.79 | 53.68 | 57.90 | 36.89 | 41.11 |
Gemini-2.0-Flash (exp) | 12.20 | 49.50 | 53.30 | 37.30 | 41.10 |
Qwen-Long | 16.55 | 48.48 | 45.94 | 31.92 | 29.39 |
MiniMax-Text-01 | 6.0 | 51.74 | 51.60 | 45.7 | 45.6 |
kalam → eng (BLEURT) | |||||
GPT-4o (11-20) | 33.20 | 58.30 | - | 25.10 | - |
Claude-3.5-Sonnet (10-22) | 31.42 | 59.70 | 62.30 | 28.28 | 30.88 |
Gemini-1.5-Pro (002) | 32.02 | 61.52 | 63.09 | 29.50 | 31.07 |
Gemini-2.0-Flash (exp) | 33.80 | 57.50 | 57.00 | 23.70 | 23.20 |
Qwen-Long | 30.13 | 53.14 | 32.15 | 23.01 | 2.02 |
MiniMax-Text-01 | 33.65 | 57.10 | 58.00 | 23.45 | 24.35 |
Tasks | GPT-4o (11-20) |
Claude-3.5-Sonnet (10-22) | Gemini-1.5-Pro (002) | Gemini-2.0-Flash (exp) | Qwen2-VL-72B-Inst. | InternVL2.5-78B | LLama-3.2-90B | MiniMax-VL-01 |
---|---|---|---|---|---|---|---|---|
Knowledge | ||||||||
MMMU* | 63.5 | 72.0 | 68.4 | 70.6 | 64.5 | 66.5 | 62.1 | 68.5 |
MMMU-Pro* | 54.5 | 54.7 | 50.9 | 57.0 | 43.2 | 47.3 | 36.0 | 52.7 |
Visual Q&A | ||||||||
ChartQA*relaxed | 88.1 | 90.8 | 88.7 | 88.3 | 91.2 | 91.5 | 85.5 | 91.7 |
DocVQA* | 91.1 | 94.2 | 91.5 | 92.9 | 97.1 | 96.1 | 90.1 | 96.4 |
OCRBench | 806 | 790 | 800 | 846 | 856 | 847 | 805 | 865 |
Mathematics & Sciences | ||||||||
AI2D* | 83.1 | 82.0 | 80.9 | 85.1 | 84.4 | 86.8 | 78.9 | 83.3 |
MathVista* | 62.1 | 65.4 | 70.6 | 73.1 | 69.6 | 68.4 | 57.3 | 68.6 |
OlympiadBenchfull | 25.2 | 28.4 | 32.1 | 46.1 | 21.9 | 25.1 | 19.3 | 24.2 |
Long Context | ||||||||
M-LongDocacc | 41.4 | 31.4 | 26.2 | 31.4 | 11.6 | 19.7 | 13.9 | 32.5 |
Comprehensive | ||||||||
MEGA-Benchmacro | 49.4 | 51.4 | 45.9 | 53.9 | 46.8 | 45.3 | 19.9 | 47.4 |
User Experience | ||||||||
In-house Benchmark | 62.3 | 47.0 | 49.2 | 72.1 | 40.6 | 34.8 | 13.6 | 56.6 |
* Evaluated following a 0-shot CoT setting.
Here, we provide a simple example to demonstrate how to use MiniMax-Text-01 and MiniMax-VL-01 respectively
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, QuantoConfig, GenerationConfig
# load hf config
hf_config = AutoConfig.from_pretrained("MiniMaxAI/MiniMax-Text-01", trust_remote_code=True)
# quantization config, int8 is recommended
quantization_config = QuantoConfig(
weights="int8",
modules_to_not_convert=[
"lm_head",
"embed_tokens",
] + [f"model.layers.{i}.coefficient" for i in range(hf_config.num_hidden_layers)]
+ [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.num_hidden_layers)]
)
# assume 8 GPUs
world_size = 8
layers_per_device = hf_config.num_hidden_layers // world_size
# set device map
device_map = {
'model.embed_tokens': 'cuda:0',
'model.norm': f'cuda:{world_size - 1}',
'lm_head': f'cuda:{world_size - 1}'
}
for i in range(world_size):
for j in range(layers_per_device):
device_map[f'model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("MiniMaxAI/MiniMax-Text-01")
prompt = "Hello!"
messages = [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by MiniMax based on MiniMax-Text-01 model."}]},
{"role": "user", "content": [{"type": "text", "text": prompt}]},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# tokenize and move to device
model_inputs = tokenizer(text, return_tensors="pt").to("cuda")
# load bfloat16 model, move to device, and apply quantization
quantized_model = AutoModelForCausalLM.from_pretrained(
"MiniMaxAI/MiniMax-Text-01",
torch_dtype="bfloat16",
device_map=device_map,
quantization_config=quantization_config,
trust_remote_code=True,
offload_buffers=True,
)
# generate response
generation_config = GenerationConfig(
max_new_tokens=20,
eos_token_id=200020,
use_cache=True,
)
generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
print(f"generated_ids: {generated_ids}")
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig, QuantoConfig, GenerationConfig
import torch
import json
import os
from PIL import Image
# load hf config
hf_config = AutoConfig.from_pretrained("MiniMaxAI/MiniMax-VL-01", trust_remote_code=True)
# quantization config, int8 is recommended
quantization_config = QuantoConfig(
weights="int8",
modules_to_not_convert=[
"vision_tower",
"image_newline",
"multi_modal_projector",
"lm_head",
"embed_tokens",
] + [f"model.layers.{i}.coefficient" for i in range(hf_config.text_config.num_hidden_layers)]
+ [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.text_config.num_hidden_layers)]
)
# set device map
model_safetensors_index_path = os.path.join("MiniMax-VL-01", "model.safetensors.index.json")
with open(model_safetensors_index_path, "r") as f:
model_safetensors_index = json.load(f)
weight_map = model_safetensors_index['weight_map']
vision_map = {}
for key, value in weight_map.items():
if 'vision_tower' in key or 'image_newline' in key or 'multi_modal_projector' in key:
new_key = key.replace('.weight','').replace('.bias','')
if new_key not in vision_map:
vision_map[new_key] = value
# assume 8 GPUs
world_size = 8
device_map = {
'language_model.model.embed_tokens': 'cuda:0',
'language_model.model.norm': f'cuda:{world_size - 1}',
'language_model.lm_head': f'cuda:{world_size - 1}'
}
for key, value in vision_map.items():
device_map[key] = f'cuda:0'
device_map['vision_tower.vision_model.post_layernorm'] = f'cuda:0'
layers_per_device = hf_config.text_config.num_hidden_layers // world_size
for i in range(world_size):
for j in range(layers_per_device):
device_map[f'language_model.model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'
# load processor
processor = AutoProcessor.from_pretrained("MiniMaxAI/MiniMax-VL-01", trust_remote_code=True)
messages = [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by MiniMax based on MiniMax-VL-01 model."}]},
{"role": "user", "content": [{"type": "image", "image": "placeholder"},{"type": "text", "text": "Describe this image."}]},
]
prompt = processor.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
raw_image = Image.open("figures/image.jpg")
# tokenize and move to device
model_inputs = processor(images=[raw_image], text=prompt, return_tensors='pt').to('cuda').to(torch.bfloat16)
# load bfloat16 model, move to device, and apply quantization
quantized_model = AutoModelForCausalLM.from_pretrained(
"MiniMaxAI/MiniMax-VL-01",
torch_dtype="bfloat16",
device_map=device_map,
quantization_config=quantization_config,
trust_remote_code=True,
offload_buffers=True,
)
generation_config = GenerationConfig(
max_new_tokens=100,
eos_token_id=200020,
use_cache=True,
)
# generate response
generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
print(f"generated_ids: {generated_ids}")
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = processor.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
@misc{minimax2025minimax01scalingfoundationmodels,
title={MiniMax-01: Scaling Foundation Models with Lightning Attention},
author={MiniMax and Aonian Li and Bangwei Gong and Bo Yang and Boji Shan and Chang Liu and Cheng Zhu and Chunhao Zhang and Congchao Guo and Da Chen and Dong Li and Enwei Jiao and Gengxin Li and Guojun Zhang and Haohai Sun and Houze Dong and Jiadai Zhu and Jiaqi Zhuang and Jiayuan Song and Jin Zhu and Jingtao Han and Jingyang Li and Junbin Xie and Junhao Xu and Junjie Yan and Kaishun Zhang and Kecheng Xiao and Kexi Kang and Le Han and Leyang Wang and Lianfei Yu and Liheng Feng and Lin Zheng and Linbo Chai and Long Xing and Meizhi Ju and Mingyuan Chi and Mozhi Zhang and Peikai Huang and Pengcheng Niu and Pengfei Li and Pengyu Zhao and Qi Yang and Qidi Xu and Qiexiang Wang and Qin Wang and Qiuhui Li and Ruitao Leng and Shengmin Shi and Shuqi Yu and Sichen Li and Songquan Zhu and Tao Huang and Tianrun Liang and Weigao Sun and Weixuan Sun and Weiyu Cheng and Wenkai Li and Xiangjun Song and Xiao Su and Xiaodong Han and Xinjie Zhang and Xinzhu Hou and Xu Min and Xun Zou and Xuyang Shen and Yan Gong and Yingjie Zhu and Yipeng Zhou and Yiran Zhong and Yongyi Hu and Yuanxiang Fan and Yue Yu and Yufeng Yang and Yuhao Li and Yunan Huang and Yunji Li and Yunpeng Huang and Yunzhi Xu and Yuxin Mao and Zehan Li and Zekang Li and Zewei Tao and Zewen Ying and Zhaoyang Cong and Zhen Qin and Zhenhua Fan and Zhihang Yu and Zhuo Jiang and Zijia Wu},
year={2025},
eprint={2501.08313},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.08313},
}
For general use and evaluation, we provide a Chatbot with online search capabilities and the online API for developers.
Contact us at model@minimaxi.com.