MiniMax M3: The Open-Weight Model That Brings Frontier Coding, 1M Context, and Native Multimodality Together

Released June 1, 2026 — MiniMax M3 is the first open-weight model to combine frontier-level coding, a 1M-token context window, and native multimodal understanding in a single architecture. And at roughly 15-17× cheaper than Claude Opus 4.7, it's shaking up the AI landscape.

Introduction: The Three Pillars of a New Frontier Model

On June 1, 2026, MiniMax officially released M3, calling it "the first and only open-weight model" to bring together three capabilities that, until now, were exclusive to closed-source frontier models: frontier-level coding and agentic performance, a 1M-token context window, and native multimodal input (text, image, and video).

According to MiniMax's official launch post, M3 "reaches frontier-level performance on specialized tasks such as coding and agentic work" using a brand-new attention architecture called MSA (MiniMax Sparse Attention) — proposed entirely by their research team. The model also supports image and video input natively and can operate a desktop computer.

This three-pillar combination has been table stakes for closed-source models like Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro. M3 is the first open-weight entry into that tier.

MSA (MiniMax Sparse Attention): The Engine Behind 1M Context

At the architectural heart of M3 is MSA (MiniMax Sparse Attention) — a block-sparse attention mechanism built on top of Grouped Query Attention (GQA). Its purpose is straightforward: escape the quadratic computational cost of full attention that makes scaling context beyond 128K tokens impractical.

How MSA Works

MSA operates in two stages:

Index Branch (Routing): A lightweight scorer selects a small Top-k set of KV blocks for each query or GQA group. This acts as a pre-filter.
Main Branch (Attention): The model runs exact attention only over the selected blocks — not the full history.

Unlike compressed latent KV approaches (like MLA), MSA works on real, uncompressed key/value tensors at block granularity, preserving attention expressiveness while dramatically reducing compute. MiniMax's own optimization adopted a "KV outer gather Q" approach — using KV blocks as the outer loop to aggregate queries that hit them — achieving more than 4× faster arithmetic intensity than open-source alternatives like Flash-Sparse-Attention and flash-moba.

Measured Performance Gains

At a context length of 1 million tokens, MiniMax reports:

Metric	Improvement
Per-token compute vs M2	1/20th of previous generation
Prefilling stage speedup	More than 9×
Decoding stage speedup	More than 15×
Effective context coverage	Significantly higher than DSA and MoBA

Crucially, across multiple ablations, MSA matched full attention on the vast majority of capabilities — meaning the sparsity doesn't come at the cost of quality.

Benchmark Performance: SWE-Bench Pro and Beyond

MiniMax reports that M3 reaches frontier performance on a suite of internationally recognized coding and agentic benchmarks:

Benchmark	M3 Score	What It Measures
SWE-Bench Pro	59.0%	Real-world software engineering fixes
Terminal-Bench 2.1	66.0%	Terminal command execution and agentic tasks
SWE-fficiency	34.8%	Efficient code change granularity
KernelBench Hard	28.8%	Low-level CUDA/kernel optimization
MCP Atlas	74.2%	Tool-use via Model Context Protocol

These scores place M3 in direct competition with models like GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7 on specific software engineering tasks. On SWE-Bench Pro, MiniMax claims M3 beats GPT-5.5 and Gemini 3.1 Pro, approaching Opus 4.7's performance.

Additionally, MiniMax published a follow-up article on June 9, 2026 titled "MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Evolutionary Search" — where they revealed that with the MaxProof framework, M3 exceeded the human gold-medal threshold on both IMO 2025 and USAMO 2026 olympiad benchmarks.

Real-World Demos That Matter

Beyond benchmarks, MiniMax showcased three autonomous, long-horizon tasks that demonstrate M3's combined capabilities:

1. Independent Paper Reproduction (12 hours)

M3 was given an ICLR 2025 Outstanding Paper Award-winning paper — "Learning Dynamics of LLM Finetuning" — and asked to reproduce it autonomously. Over nearly 12 hours, M3:

Independently produced 18 commits and 23 experimental figures
Successfully matched prediction-probability trends during SFT stages
Observed the squeezing effect in DPO experiments
Verified the Extend mitigation method proposed in the original paper

Multimodal capabilities were required to read curves, data, and formulas in the paper. The 1M context window allowed the paper, code, and experiment logs to fit simultaneously. And the agentic coding capability made the multi-hour autonomous execution possible.

2. CUDA Kernel Optimization (24 hours)

FP8 matrix multiplication (GEMM) on NVIDIA Hopper architecture is notoriously difficult to optimize — typically requiring one to two weeks of work from an experienced engineering team. M3 was given only a task description and a broken Triton skeleton with no reference implementation.

Over 24 hours of continuous execution, M3:

Completed 147 benchmark submissions and 1,959 tool calls
Improved Hopper FP8 hardware peak utilization from 7.6% → 71.3%
Achieved a 9.4× speedup over its initial version
Went through 6 landmark optimization rounds (baseline, autotune, bottleneck diagnosis, CUDA Graph integration, persistent kernel rewrite, host-side scheduling)

Notably, M3's best solution appeared on submission #145 — meaning it persisted through multiple performance plateaus where other models (except Opus 4.7) gave up around submission #30.

3. Letting M3 Train Other Models (PostTrainBench)

M3 was given four base models that had only completed pretraining (no downstream capabilities) and tasked to autonomously complete data synthesis, training, evaluation, and iteration within 12 hours — across 5 skills: mathematical reasoning (AIME2025), tool calling (BFCL), scientific reasoning (GPQA Main), arithmetic (GSM8K), and code generation (HumanEval).

M3 scored 0.37 on PostTrainBench — below Opus 4.7 (0.42) and GPT-5.5 (0.39), but clearly ahead of all other models tested.

Pricing and Availability

M3's pricing strategy is aggressively competitive. Here's the breakdown from MiniMax's official pricing page:

Tier	Input (per 1M tokens)	Output (per 1M tokens)	Prompt Cache Read
≤512K input (Standard)	$0.60	$2.40	$0.12
≤512K input (Launch Discount)	$0.30	$1.20	$0.06
>512K input (Standard)	$1.20	$4.80	$0.24
>512K input (Launch Discount)	$0.60	$2.40	$0.12

For context: Claude Opus 4.7 is priced at approximately $5 per million input tokens. M3 at the discounted rate is ~16× cheaper for output and ~17× cheaper for input.

Subscription Plans

MiniMax also offers a Token Plan for individuals and small teams:

Plan	Monthly Cost	Tokens/Month
Plus	$20	Up to ~1.7B tokens
Max	$50	Up to ~5.1B tokens
Ultra	$120	Up to ~9.8B tokens

On MiniMax's landing page, they compare the $20 Plus plan directly: "$20 = 10× Claude Pro. Same price, 10× the throughput."

Availability

The model is already available via:

MiniMax API (pay-as-you-go)
MiniMax Code (their agent product built specifically for M3)
Token Plan subscriptions
Third-party providers including Fireworks AI, Together AI, and Novita (per Hugging Face data)
Hugging Face as open weights (MiniMaxAI/MiniMax-M3)
On Hugging Face, M3 has already accumulated over 1,000 downloads since its June 2 upload. Supported providers include Fireworks AI (fastest throughput at ~131 tokens/s), Together AI (cheapest output pricing at $1.20 per million output tokens), and Novita.

The Open-Weight Advantage

Perhaps the most significant aspect of M3 is its open-weight release. The weights are available on Hugging Face at MiniMaxAI/MiniMax-M3, and MiniMax promises the weights and technical report on GitHub within approximately 10 days of launch.

This is a big deal for several reasons:

Self-hosting: Teams can deploy M3 on their own infrastructure without per-token API costs
Fine-tuning: The open weights enable domain-specific customization
Research transparency: The MSA architecture can be studied and potentially improved by the community
No vendor lock-in: Unlike closed models, you can switch providers freely

However — a note of caution: "open-weight" does not necessarily mean "open source." One source noted that the license terms were not published at launch, and the open-source status depends on the final license. The Hugging Face page confirms the model card includes the standard MiniMax license structure.

How It Stacks Up Against the Competition

Here's how M3 positions against the current frontier:

Dimension	MiniMax M3	Claude Opus 4.7	GPT-5.5	Gemini 3.1 Pro
Context	1M	1M	1,050,000	2M+
SWE-Bench Pro	59.0%	~Frontier (higher)	~Frontier (higher)	~Frontier
Multimodal	Native (text+image+video)	Text+Image	Text+Image+Audio	Native (all)
Open-Weight	✅ Yes	❌ No	❌ No	❌ No
Price/1M input	$0.30 (launch)	~$5.00	~$3.00	~$1.50
Desktop Control	✅ Yes	❌ No	❌ No	❌ No
Deployment	Self-host or API	API only	API only	API only

M3's differentiation is clear: it's the only model in this tier that can be self-hosted while delivering competitive benchmark scores. It trades absolute top performance (Opus 4.7 still leads on some metrics) for an open deployment model at a fraction of the cost.

Conclusion and What It Means for Developers

MiniMax M3 represents a genuine inflection point in the open-weight AI landscape. For the first time, developers and teams have access to a model that:

Competes with frontier closed models on coding and agentic benchmarks
Offers a 1M context window with MSA architecture that makes it actually usable (not just a spec sheet number)
Handles text, image, and video natively without bolt-on vision modules
Can be self-hosted with open weights
Costs 15-17× less than equivalent closed frontier models

The real-world demos — the paper reproduction, the 24-hour CUDA optimization, the model training task — are what separate M3 from the hype. These aren't cherry-picked benchmarks; they're genuine stress tests of long-horizon autonomous capability.

Is it the best model in every category? No. Opus 4.7 still leads on PostTrainBench and likely on general reasoning quality. GPT-5.5 has deeper ecosystem integration. Gemini has Google's infrastructure muscle.

But M3 is the most accessible frontier-level model available today. For teams that need production-grade coding agents, long-context document analysis, or multimodal workflows without the per-token pricing anxiety of closed APIs — M3 is the model to beat.

And that's the real story: the open-weight frontier just caught up.

Published June 13, 2026