MiniMaxAI/MiniMax-M3
MiniMax M3 vision-language MoE (427B total / 26B active) for frontier coding, agent toolchains, and 1M-token reasoning via MSA sparse attention — native multimodal (image + video + computer use); BF16 checkpoint with an MXFP8 variant from NVIDIA. Runs on NVIDIA (Hopper/Blackwell) and on AMD CDNA4 (MI350X/MI355X) and CDNA3 (MI300X/MI325X).
Frontier coding and agent (SWE-Bench Pro 59.0, Terminal-Bench 2.1 66.0); MSA sparse attention; 1M context
Guide
Overview
MiniMax-M3 is a frontier vision-language MoE model from MiniMax.
- MSA (MiniMax Sparse Attention) — scalable sparse-attention architecture
that lifts the context window to 1M tokens. MiniMax reports per-token
compute at 1M context reduced to ~1/20 of the previous generation, with
9× prefill and >15× decode speedup vs dense baselines.
- Frontier coding and agent capabilities — SWE-Bench Pro 59.0%, Terminal-Bench 2.1 66.0%, SWE-fficiency 34.8%, KernelBench Hard 28.8%, MCP Atlas 74.2%.
- Native multimodal — image + video inputs, plus computer-use; trained multimodally from step 0.
- Two reasoning modes —
thinking(complex reasoning / agents) andnon-thinking(latency-sensitive), switchable per request.
Prerequisites
- OS: Linux
- Python: 3.10 - 3.13
- NVIDIA: compute capability >= 9.0 (Hopper) recommended; 8x H200 / H20 for a tight single-node BF16 fit, or multi-node TP for long-context headroom
- AMD: MI350X/MI355X (gfx950), MI300X/MI325X (gfx942), ROCm 7.2+. BF16 needs TP=8; the MXFP8 variant runs from TP=4.
--block-size 128is mandatory on every platform (MSA sparse/index cache).
Docker (NVIDIA)
MiniMax-M3 support has not yet shipped in a stable vLLM release — use the dedicated Docker image:
docker pull vllm/vllm-openai:minimax-m3
Docker (AMD ROCm)
MiniMax-M3 support has not yet shipped in a stable vLLM release — use the dedicated Docker image or nightly after the release:
docker pull vllm/vllm-openai-rocm:minimax-m3
docker run --rm -it --device /dev/kfd --device /dev/dri --group-add video \
--cap-add SYS_PTRACE --security-opt seccomp=unconfined --ipc=host \
--shm-size=16g -p 8000:8000 \
--entrypoint /bin/bash \
vllm/vllm-openai-rocm:minimax-m3
Launching the Server
NVIDIA — TP8 (8x H200 / H20)
vllm serve MiniMaxAI/MiniMax-M3 \
--tensor-parallel-size 8 \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
TP8 + Expert Parallel
vllm serve MiniMaxAI/MiniMax-M3 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
DP8 + Expert Parallel
vllm serve MiniMaxAI/MiniMax-M3 \
--data-parallel-size 8 \
--enable-expert-parallel \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
AMD ROCm (MI350X/MI355X (gfx950), MI300X/MI325X (gfx942))
On AMD MI300X / MI325X / MI355X, run with CUDA graphs and set the following before any of the serve commands below. It avoids the MiniMax-M3 decode breakable-cudagraph path that would otherwise force eager execution (per @hongxiayang):
export VLLM_USE_BREAKABLE_CUDAGRAPH=0
For gfx950: Prefer using the MXFP8 variant MiniMaxAI/MiniMax-M3-MXFP8 for TP=4 and a smaller
footprint. Use TP=8 for lower latency or long context length, or the default bf16 model.
TP8 (Text or Vision)
vllm serve MiniMaxAI/MiniMax-M3 \
--tensor-parallel-size 8 \
--block-size 128 \
--attention-backend TRITON_ATTN \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend ROCM_AITER_FA \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
TP8 + Expert Parallel
vllm serve MiniMaxAI/MiniMax-M3 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--block-size 128 \
--attention-backend TRITON_ATTN \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend ROCM_AITER_FA \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
DP8 + Expert Parallel
vllm serve MiniMaxAI/MiniMax-M3 \
--data-parallel-size 8 \
--enable-expert-parallel \
--block-size 128 \
--attention-backend TRITON_ATTN \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend ROCM_AITER_FA \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
FP8 KV Cache
Add --kv-cache-dtype fp8 to any command for ~1.5× the KV pool — lossless
in our testing across the full native context. Especially worth it for high
concurrency or long context, where KV is the binding constraint.
Context Length & GPU Memory
The full 1M-token window (context_length: 1048576) needs a large KV
cache. To save GPU memory, you can optionally cap the context with
--max-model-len:
vllm serve MiniMaxAI/MiniMax-M3 \
--tensor-parallel-size 8 \
--block-size 128 \
--max-model-len 131072 # 128K instead of the full 1M
AMD ROCm notes: Native context is 512K. To go past it, supply a YaRN rope_scaling on the text config (a top-level override silently misses the decoder's config) and allow the long max length. TP=8 + fp8 KV is the practical combo at 1M:
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
vllm serve MiniMaxAI/MiniMax-M3 \
--block-size 128 \
--kv-cache-dtype fp8 \
--tensor-parallel-size 8 \
--max-model-len 1048576 \
--attention-backend TRITON_ATTN \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend ROCM_AITER_FA \
--tool-call-parser minimax_m3 \
--enable-auto-tool-choice \
--reasoning-parser minimax_m3 \
--hf-overrides '{"text_config":{"rope_scaling":{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":524288}}}'
- Set
--max-model-lento the longest prompt + output you actually serve (e.g.32768,131072,262144). A smaller window frees KV-pool headroom for higher concurrency and lets the model fit on fewer GPUs; if you need the full 1M window, consider scaling out with multi-node TP instead.
Client Usage
Recommended sampling parameters (from the model card):
temperature = 1.0top_p = 0.95top_k = 40
Default system prompt:
You are a helpful assistant. Your name is MiniMax-M3 and is built by MiniMax.
Example chat request:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMaxAI/MiniMax-M3",
"temperature": 1.0,
"top_p": 0.95,
"messages": [
{"role": "system", "content": "You are a helpful assistant. Your name is MiniMax-M3 and is built by MiniMax."},
{"role": "user", "content": "Explain MSA sparse attention in 3 bullets."}
]
}'
Thinking Modes
M3 reasoning is controlled by the thinking_mode, there are three values:
enabled— the model thinks before every response, including after tool results. Use for complex reasoning and agents.disabled— no thinking; the model answers directly. Use for latency-sensitive turns.adaptive(default when unset) — the model decides whether to think based on the task.
Pass it per request through chat_template_kwargs. The same value also tunes
the minimax_m3 reasoning parser, so reasoning_content and content are
split correctly in every mode.
# Start the MiniMax-M3 model by referring to the command above first.
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="MiniMaxAI/MiniMax-M3",
messages=[{"role": "user", "content": "Prove there are infinitely many primes."}],
extra_body={"chat_template_kwargs": {"thinking_mode": "enabled"}},
)
msg = response.choices[0].message
# vLLM exposes the <mm:think> block as `reasoning` (the older
# `reasoning_content` field is deprecated but still aliased).
print(getattr(msg, "reasoning", None) or getattr(msg, "reasoning_content", None))
print(msg.content) # the final answer
Benchmarking
vllm bench serve \
--backend vllm \
--model MiniMaxAI/MiniMax-M3 \
--endpoint /v1/completions \
--dataset-name random \
--random-input 2048 \
--random-output 1024 \
--max-concurrency 10 \
--num-prompt 100
Quantized Variant (MXFP8)
MiniMaxAI/MiniMax-M3-MXFP8
is an MXFP8 checkpoint quantized by NVIDIA from the original FP16 weights —
roughly half the VRAM of the BF16 release. Select the mxfp8 variant above,
or pass the repo id directly to vllm serve:
vllm serve MiniMaxAI/MiniMax-M3-MXFP8 \
--tensor-parallel-size 8 \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
For best MXFP8 throughput, prefer Blackwell (B200/B300) for native MX tensor cores, or AMD CDNA4 (MI350X/MI355X, gfx950) for native MXFP8 matrix cores.
Quantized Variant (MXFP4, AMD)
amd/MiniMax-M3-MXFP4 is an
AMD-quantized MXFP4 checkpoint (weights + activations OCP MXFP4, via
AMD-Quark) for CDNA4 (MI350X/MI355X, gfx950) — roughly half the VRAM of MXFP8,
so M3 fits a single 8x MI355X node from TP=4. It uses the same Triton MSA
attention path as the other AMD variants. This variant is AMD-only (use the
mxfp8 variant on Blackwell for native MX matrix cores).
MoE backend: aiter (performance) vs emulation (works today)
Image requirement for the AITER path. MXFP4 MoE on the high-performance AITER backend needs aiter
>= 0.1.16.post2(vllm#46692) and the MoE enablement vllm#46419. Until #46419 ships in a publishedvllm/vllm-openai-rocmimage, a plain nightly will not bring up MXFP4 on--moe-backend aiter— build from source (or use theemulationbackend below, which runs on current images).
AITER MoE (recommended for serving once available):
export VLLM_USE_BREAKABLE_CUDAGRAPH=0
VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MOE=1 \
vllm serve amd/MiniMax-M3-MXFP4 \
--tensor-parallel-size 4 \
--block-size 128 \
--moe-backend aiter \
--attention-backend TRITON_ATTN \
--language-model-only \
--no-enable-prefix-caching \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
Emulation MoE (numerically faithful, slower — runs on current images).
This is the path AMD used to measure accuracy (TP=8); swap
--moe-backend emulation for aiter and drop the AITER env:
vllm serve amd/MiniMax-M3-MXFP4 \
--tensor-parallel-size 8 \
--block-size 128 \
--moe-backend emulation \
--attention-backend TRITON_ATTN \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
Accuracy
AMD reports gsm8k (5-shot, flexible-extract) 94.19 for amd/MiniMax-M3-MXFP4
vs 95.30 for the BF16 MiniMaxAI/MiniMax-M3 — 98.84% recovery
(lm-eval, --moe-backend emulation, per the
model card).
KV cache dtype on the MXFP4 variant. Unlike the BF16 and MXFP8 checkpoints,
amd/MiniMax-M3-MXFP4ships no calibrated KV scales. Adding--kv-cache-dtype fp8still starts and serves, but vLLM falls back to an uncalibrated KV scale of 1.0 and logsUsing uncalibrated q_scale 1.0 ... This may cause accuracy issues. Leave the KV cache at its default dtype unless you have validated accuracy for your workload.
Serving validated on 8x MI355X (gfx950), TP=4, --moe-backend aiter, with a
ROCm vllm-dev image carrying the AITER MXFP4 MoE path (vLLM 0.23.1): the
model serves and the minimax_m3 reasoning/tool parsers split reasoning
from content correctly.
Troubleshooting
--block-sizemismatch. MSA's sparse block size is 128; the vLLM KV cache block size must match. Using the default (16) misaligns the sparse attention indexing (on AMD it crashes withNo common block size for 16).- Parsers.
--tool-call-parserand--reasoning-parserboth useminimax_m3— distinct fromminimax_m2used by earlier releases. - Long context KV cache. See Context Length & GPU Memory above — cap
--max-model-lenor scale to multi-node TP if you OOM. - Vision encoder. The encoder is small, so at high TP the Encoder
Parallel option runs it data-parallel (
--mm-encoder-tp-mode data) to avoid TP comm overhead; it also turns on the vision-encoder attention backend (FlashInfer on NVIDIA,--mm-encoder-attn-backend FLASHINFER; AITER FlashAttention on AMD,ROCM_AITER_FA) and the host-shared-memory multimodal processor cache (--mm-processor-cache-type shm). For text-only workloads enable Text only (--language-model-only) to skip loading the encoder and free VRAM — it is mutually exclusive with Encoder Parallel.