vLLM/Recipes
MiniMax

MiniMaxAI/MiniMax-M3

MiniMax M3 vision-language MoE (427B total / 26B active) for frontier coding, agent toolchains, and 1M-token reasoning via MSA sparse attention — native multimodal (image + video + computer use); BF16 checkpoint with an MXFP8 variant from NVIDIA. Runs on NVIDIA (Hopper/Blackwell) and on AMD CDNA4 (MI350X/MI355X) and CDNA3 (MI300X/MI325X).

Frontier coding and agent (SWE-Bench Pro 59.0, Terminal-Bench 2.1 66.0); MSA sparse attention; 1M context

moe427B / 26B1,048,576 ctxvLLM 0.24.0+textmultimodal
Guide

Overview

MiniMax-M3 is a frontier vision-language MoE model from MiniMax.

  • MSA (MiniMax Sparse Attention) — scalable sparse-attention architecture that lifts the context window to 1M tokens. MiniMax reports per-token compute at 1M context reduced to ~1/20 of the previous generation, with

    9× prefill and >15× decode speedup vs dense baselines.

  • Frontier coding and agent capabilities — SWE-Bench Pro 59.0%, Terminal-Bench 2.1 66.0%, SWE-fficiency 34.8%, KernelBench Hard 28.8%, MCP Atlas 74.2%.
  • Native multimodal — image + video inputs, plus computer-use; trained multimodally from step 0.
  • Two reasoning modesthinking (complex reasoning / agents) and non-thinking (latency-sensitive), switchable per request.

Prerequisites

  • OS: Linux
  • Python: 3.10 - 3.13
  • NVIDIA: compute capability >= 9.0 (Hopper) recommended; 8x H200 / H20 for a tight single-node BF16 fit, or multi-node TP for long-context headroom
  • AMD: MI350X/MI355X (gfx950), MI300X/MI325X (gfx942), ROCm 7.2+. BF16 needs TP=8; the MXFP8 variant runs from TP=4.
  • --block-size 128 is mandatory on every platform (MSA sparse/index cache).

Docker (NVIDIA)

MiniMax-M3 support has not yet shipped in a stable vLLM release — use the dedicated Docker image:

docker pull vllm/vllm-openai:minimax-m3

Docker (AMD ROCm)

MiniMax-M3 support has not yet shipped in a stable vLLM release — use the dedicated Docker image or nightly after the release:

docker pull vllm/vllm-openai-rocm:minimax-m3
docker run --rm -it --device /dev/kfd --device /dev/dri --group-add video \
  --cap-add SYS_PTRACE --security-opt seccomp=unconfined --ipc=host \
  --shm-size=16g -p 8000:8000 \
  --entrypoint /bin/bash \
  vllm/vllm-openai-rocm:minimax-m3

Launching the Server

NVIDIA — TP8 (8x H200 / H20)

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

TP8 + Expert Parallel

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

DP8 + Expert Parallel

vllm serve MiniMaxAI/MiniMax-M3 \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

AMD ROCm (MI350X/MI355X (gfx950), MI300X/MI325X (gfx942))

On AMD MI300X / MI325X / MI355X, run with CUDA graphs and set the following before any of the serve commands below. It avoids the MiniMax-M3 decode breakable-cudagraph path that would otherwise force eager execution (per @hongxiayang):

export VLLM_USE_BREAKABLE_CUDAGRAPH=0

For gfx950: Prefer using the MXFP8 variant MiniMaxAI/MiniMax-M3-MXFP8 for TP=4 and a smaller footprint. Use TP=8 for lower latency or long context length, or the default bf16 model.

TP8 (Text or Vision)

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --attention-backend TRITON_ATTN \
  --mm-encoder-tp-mode data \
  --mm-encoder-attn-backend ROCM_AITER_FA \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

TP8 + Expert Parallel

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --block-size 128 \
  --attention-backend TRITON_ATTN \
  --mm-encoder-tp-mode data \
  --mm-encoder-attn-backend ROCM_AITER_FA \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

DP8 + Expert Parallel

vllm serve MiniMaxAI/MiniMax-M3 \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --block-size 128 \
  --attention-backend TRITON_ATTN \
  --mm-encoder-tp-mode data \
  --mm-encoder-attn-backend ROCM_AITER_FA \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

FP8 KV Cache

Add --kv-cache-dtype fp8 to any command for ~1.5× the KV pool — lossless in our testing across the full native context. Especially worth it for high concurrency or long context, where KV is the binding constraint.

Context Length & GPU Memory

The full 1M-token window (context_length: 1048576) needs a large KV cache. To save GPU memory, you can optionally cap the context with --max-model-len:

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --max-model-len 131072        # 128K instead of the full 1M

AMD ROCm notes: Native context is 512K. To go past it, supply a YaRN rope_scaling on the text config (a top-level override silently misses the decoder's config) and allow the long max length. TP=8 + fp8 KV is the practical combo at 1M:

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  vllm serve MiniMaxAI/MiniMax-M3 \
  --block-size 128 \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 8 \
  --max-model-len 1048576 \
  --attention-backend TRITON_ATTN \
  --mm-encoder-tp-mode data \
  --mm-encoder-attn-backend ROCM_AITER_FA \
  --tool-call-parser minimax_m3 \
  --enable-auto-tool-choice \
  --reasoning-parser minimax_m3 \
  --hf-overrides '{"text_config":{"rope_scaling":{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":524288}}}'
  • Set --max-model-len to the longest prompt + output you actually serve (e.g. 32768, 131072, 262144). A smaller window frees KV-pool headroom for higher concurrency and lets the model fit on fewer GPUs; if you need the full 1M window, consider scaling out with multi-node TP instead.

Client Usage

Recommended sampling parameters (from the model card):

  • temperature = 1.0
  • top_p = 0.95
  • top_k = 40

Default system prompt:

You are a helpful assistant. Your name is MiniMax-M3 and is built by MiniMax.

Example chat request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMaxAI/MiniMax-M3",
    "temperature": 1.0,
    "top_p": 0.95,
    "messages": [
      {"role": "system", "content": "You are a helpful assistant. Your name is MiniMax-M3 and is built by MiniMax."},
      {"role": "user", "content": "Explain MSA sparse attention in 3 bullets."}
    ]
  }'

Thinking Modes

M3 reasoning is controlled by the thinking_mode, there are three values:

  • enabled — the model thinks before every response, including after tool results. Use for complex reasoning and agents.
  • disabled — no thinking; the model answers directly. Use for latency-sensitive turns.
  • adaptive (default when unset) — the model decides whether to think based on the task.

Pass it per request through chat_template_kwargs. The same value also tunes the minimax_m3 reasoning parser, so reasoning_content and content are split correctly in every mode.


# Start the MiniMax-M3 model by referring to the command above first.

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3",
    messages=[{"role": "user", "content": "Prove there are infinitely many primes."}],
    extra_body={"chat_template_kwargs": {"thinking_mode": "enabled"}},
)
msg = response.choices[0].message
# vLLM exposes the <mm:think> block as `reasoning` (the older
# `reasoning_content` field is deprecated but still aliased).
print(getattr(msg, "reasoning", None) or getattr(msg, "reasoning_content", None))
print(msg.content)  # the final answer

Benchmarking

vllm bench serve \
  --backend vllm \
  --model MiniMaxAI/MiniMax-M3 \
  --endpoint /v1/completions \
  --dataset-name random \
  --random-input 2048 \
  --random-output 1024 \
  --max-concurrency 10 \
  --num-prompt 100

Quantized Variant (MXFP8)

MiniMaxAI/MiniMax-M3-MXFP8 is an MXFP8 checkpoint quantized by NVIDIA from the original FP16 weights — roughly half the VRAM of the BF16 release. Select the mxfp8 variant above, or pass the repo id directly to vllm serve:

vllm serve MiniMaxAI/MiniMax-M3-MXFP8 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

For best MXFP8 throughput, prefer Blackwell (B200/B300) for native MX tensor cores, or AMD CDNA4 (MI350X/MI355X, gfx950) for native MXFP8 matrix cores.

Quantized Variant (MXFP4, AMD)

amd/MiniMax-M3-MXFP4 is an AMD-quantized MXFP4 checkpoint (weights + activations OCP MXFP4, via AMD-Quark) for CDNA4 (MI350X/MI355X, gfx950) — roughly half the VRAM of MXFP8, so M3 fits a single 8x MI355X node from TP=4. It uses the same Triton MSA attention path as the other AMD variants. This variant is AMD-only (use the mxfp8 variant on Blackwell for native MX matrix cores).

MoE backend: aiter (performance) vs emulation (works today)

Image requirement for the AITER path. MXFP4 MoE on the high-performance AITER backend needs aiter >= 0.1.16.post2 (vllm#46692) and the MoE enablement vllm#46419. Until #46419 ships in a published vllm/vllm-openai-rocm image, a plain nightly will not bring up MXFP4 on --moe-backend aiter — build from source (or use the emulation backend below, which runs on current images).

AITER MoE (recommended for serving once available):

export VLLM_USE_BREAKABLE_CUDAGRAPH=0
VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MOE=1 \
  vllm serve amd/MiniMax-M3-MXFP4 \
  --tensor-parallel-size 4 \
  --block-size 128 \
  --moe-backend aiter \
  --attention-backend TRITON_ATTN \
  --language-model-only \
  --no-enable-prefix-caching \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

Emulation MoE (numerically faithful, slower — runs on current images). This is the path AMD used to measure accuracy (TP=8); swap --moe-backend emulation for aiter and drop the AITER env:

vllm serve amd/MiniMax-M3-MXFP4 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --moe-backend emulation \
  --attention-backend TRITON_ATTN \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

Accuracy

AMD reports gsm8k (5-shot, flexible-extract) 94.19 for amd/MiniMax-M3-MXFP4 vs 95.30 for the BF16 MiniMaxAI/MiniMax-M398.84% recovery (lm-eval, --moe-backend emulation, per the model card).

KV cache dtype on the MXFP4 variant. Unlike the BF16 and MXFP8 checkpoints, amd/MiniMax-M3-MXFP4 ships no calibrated KV scales. Adding --kv-cache-dtype fp8 still starts and serves, but vLLM falls back to an uncalibrated KV scale of 1.0 and logs Using uncalibrated q_scale 1.0 ... This may cause accuracy issues. Leave the KV cache at its default dtype unless you have validated accuracy for your workload.

Serving validated on 8x MI355X (gfx950), TP=4, --moe-backend aiter, with a ROCm vllm-dev image carrying the AITER MXFP4 MoE path (vLLM 0.23.1): the model serves and the minimax_m3 reasoning/tool parsers split reasoning from content correctly.

Troubleshooting

  • --block-size mismatch. MSA's sparse block size is 128; the vLLM KV cache block size must match. Using the default (16) misaligns the sparse attention indexing (on AMD it crashes with No common block size for 16).
  • Parsers. --tool-call-parser and --reasoning-parser both use minimax_m3 — distinct from minimax_m2 used by earlier releases.
  • Long context KV cache. See Context Length & GPU Memory above — cap --max-model-len or scale to multi-node TP if you OOM.
  • Vision encoder. The encoder is small, so at high TP the Encoder Parallel option runs it data-parallel (--mm-encoder-tp-mode data) to avoid TP comm overhead; it also turns on the vision-encoder attention backend (FlashInfer on NVIDIA, --mm-encoder-attn-backend FLASHINFER; AITER FlashAttention on AMD, ROCM_AITER_FA) and the host-shared-memory multimodal processor cache (--mm-processor-cache-type shm). For text-only workloads enable Text only (--language-model-only) to skip loading the encoder and free VRAM — it is mutually exclusive with Encoder Parallel.

References