What hardware you need for MiniMax-M2.7 230B (A10B) in 4-bit

Running MiniMax-M2.7 230B locally requires extreme VRAM, even with 4-bit quantization, and a dual high-end GPU setup is the practical baseline today. This article shows real VRAM usage and performance from a dual RTX Pro 6000 Blackwell system using MXFP4 quantization, with a focus on hardware limits and inference speed.

Test setup and model details

This test focuses on hardware behavior rather than model quality, with the goal of measuring VRAM use and inference speed under realistic conditions. The system uses an AMD EPYC 9534 processor with 64 cores and 256 GB of system memory, installed on a GENOA2D24G-2L motherboard with PCIe 5.0 x16 support. The GPU configuration includes two RTX Pro 6000 Blackwell cards, each with 96 GB of VRAM.

The software stack is stable and current. The system runs Debian 12 (bookworm), with CUDA 12.8 and NVIDIA driver version 590.55.01. All tests were executed using llama.cpp build 8809.

The model used is MiniMax-M2.7 230B A10B, quantized with MXFP4 from Unsloth AI. The quantized model size ranges between roughly 126 GB and 136 GB depending on format. For reference, the full unquantized model is about 457 GB, which makes local deployment impractical without aggressive compression.

Why 4-bit is the practical baseline

4-bit quantization is the most practical balance between memory use and output quality for a model of this size. It reduces the footprint enough to fit across two 96 GB GPUs while still maintaining stable inference behavior.

Lower-bit formats can reduce memory even further. For example, a 3-bit variant such as UD-Q3_K_XL sits close to 102 GB and can fit into systems with large unified memory pools, like Apple Silicone, Strix Halo or DGX Spark. However, this article focuses on GPU-based inference where stability and predictable scaling matter more than absolute compression.

Because the model itself is very large, even aggressive quantization can still deliver usable output. In this setup, the limiting factor is not precision but available VRAM.

VRAM usage across context sizes

VRAM usage grows steadily with context size, and the increase becomes more aggressive beyond 32k tokens. The distribution across the two GPUs remains fairly balanced, but total consumption rises quickly as longer contexts are used.

VRAM usage table

Context	GPU0 (GB)	GPU1 (GB)	Total (GB)
4k	66	63	129
8k	67	63	130
16k	68	64	132
32k	70	66	136
64k	74	70	144
128k	83	78	161
200k	91	86	177

At lower context sizes, memory usage stays well within limits, but the trend changes as context increases. By the time the system reaches 200k tokens, both GPUs operate close to their maximum capacity.

Inference performance (tokens per second)

Inference performance declines as context size increases, and the drop becomes more noticeable at higher ranges. This behavior is expected, but the scale of the slowdown is important for real-world use.

Benchmark table

Context	Prompt Processing (t/s)	Token Generation (t/s)
4k	3294.33	108.27
8k	3042.47	101.76
16k	2663.30	89.68
32k	1947.89	72.71
64k	918.38	52.75
128k	456.61	34.26
200k	315.32	25.35

Prompt processing remains fast at smaller contexts, but it decreases sharply as context grows. Token generation speed follows the same pattern, although it remains usable even at very large context sizes.

What this means for real hardware

This setup shows that two 96 GB GPUs are sufficient to run the model in 4-bit quantization, but the margin is not large. The system handles moderate context sizes without issues, yet higher contexts push both GPUs close to full utilization.

A single GPU configuration cannot support this model at this quantization level because the memory requirement exceeds the capacity of even the largest current cards. Systems with around 128 GB of total VRAM also struggle once context size increases, which limits their usefulness for extended workloads.

In practice, systems in the 192 GB VRAM class provide the necessary headroom for stable operation. While PCIe 5.0 improves data movement, overall performance in this scenario depends far more on total available VRAM than on bandwidth.

Practical context limits

The most efficient operating range for this setup lies between 4k and 32k context. Within this range, the system maintains strong performance and stable memory usage, which makes it suitable for most tasks, like agentic tool use. It takes about a minute for the system to process 64k context.

At 128k context and beyond, memory pressure increases significantly and inference speed drops.

When pushing into 128k to 200k context, the prompt processing speeds are slow. Performance remains usable, but this range is best reserved for specific tasks that do not require very fast response.

Final thoughts

MiniMax-M2.7 230B remains a demanding model even in 4-bit form, and running it locally requires careful hardware planning. The dual RTX Pro 6000 Blackwell setup demonstrates that it is possible to achieve stable inference, but it also highlights how quickly memory constraints appear as context grows.

The results show that 4-bit quantization enables practical deployment, yet VRAM remains the defining constraint. Context size has a strong impact on both memory use and performance.