Home / LLM Hardware News

Apple M5 Max for Local LLMs: First Benchmarks vs RTX Pro 6000 and RTX 5090

Chavy Levi • Mar 11, 2026 at 4:08am PDT

💬 0 Comments

macbook pro with m5 max chip and rtx pro 6000 gpu

Early benchmarks for the Apple M5 Max are starting to appear. Reddit user cryingneko recently posted the first local LLM tests on a 14-inch MacBook Pro with the M5 Max, configured with 128GB unified memory, 18-core CPU, 40-core GPU, and 614 GB/s memory bandwidth.

For local LLM enthusiasts this configuration is interesting for one reason: memory capacity. With 128GB of unified memory, the system can load models that normally require large GPU setups. The question is how fast it actually runs them.

The tests were performed with mlx_lm, Apple’s MLX inference stack. Two large models were used for the main comparison.

Qwen3.5-122B-A10B in 4-bit quantization (69.6 GB)
Qwen3-Coder-Next in 8-bit quantization (84.7 GB)
Qwen3.5-27B in n 6-bit quantization (26GB)
gpt-oss-120b-MXFP4 in 8-bit quantization (64GB)

All models fit comfortably inside the 128GB memory budget, leaving room for large context windows.

To understand where this hardware stands, we compare the results to our benchmarks from an RTX Pro 6000 Blackwell 96GB, and our RTX 5090 tests. RTX PRO 6000 is currently the only single consumer workstation GPU capable of running the bigger models fully in VRAM.

Test System

Apple M5 Max MacBook Pro (14-inch)

Unified memory: 128GB
Memory bandwidth: 614 GB/s
CPU: 18 cores
GPU: 40 cores

Inference backend: MLX (mlx_lm)

Qwen3.5-122B-A10B Performance

This is the most interesting test because it shows how a laptop handles a 122B mixture-of-experts model.

Context	Device	Prompt Processing (t/s)	Generation (t/s)	GPU Faster
4K	M5 Max	881	65.9
4K	RTX Pro 6000	3055	98.4	3.46× prompt / 49% gen
16K	M5 Max	1239	60.6
16K	RTX Pro 6000	2836	93.7	2.29× prompt / 55% gen
32K	M5 Max	1068	54.9
32K	RTX Pro 6000	2582	91.3	2.41× prompt / 66% gen

The RTX Pro 6000 clearly leads in prompt processing. Generation speed is closer. The GPU produces roughly 50–65 percent more tokens per second during decoding.

Still, the MacBook result is notable. Generating around 55–65 tokens per second on a 122B model on a laptop is already usable for local workflows.

Memory usage peaked around 72–76GB, which leaves room for large context sizes. The model itself requires around 78GB with full context buffers, so the 128GB configuration can likely handle the full 256K context supported by the model.

Qwen3-Coder-Next 8bit Performance

The second model tested was Qwen3-Coder-Next in 8-bit quantization. This model is larger on disk than the 122B MoE model due to the higher precision.

Context	Prompt Processing	Generation	Peak Memory
4K	754 t/s	79.3 t/s	87.1 GB
16K	1802 t/s	74.3 t/s	88.1 GB
32K	1887 t/s	68.6 t/s	89.6 GB
64K	1432 t/s	48.2 t/s	92.6 GB

Generation speed stays in the 48–79 tokens/s range, depending on context length. Memory usage approaches 90GB, which still fits within the 128GB unified pool.

Unfortunately direct RTX Pro 6000 results for this exact quantization is not available currently in our LLM GPU benchmark dataset, but based on the previous comparison we would expect the GPU to lead primarily in prompt ingestion.

gpt-oss-120B (Q8)

This model is a good comparison point because it fits comfortably in both the M5 Max 128GB memory pool and the 96GB VRAM of the RTX Pro 6000.

Context	Device	Prompt Processing	Generation	GPU Faster
4K	M5 Max	1325 t/s	87.9 t/s
4K	RTX Pro 6000	6512 t/s	221.3 t/s	4.9× prompt / 2.5× gen
16K	M5 Max	2710 t/s	76.0 t/s
16K	RTX Pro 6000	5721 t/s	193.2 t/s	2.1× prompt / 2.5× gen
32K	M5 Max	2537 t/s	64.5 t/s
32K	RTX Pro 6000	5000 t/s	178.5 t/s	2.0× prompt / 2.8× gen

The RTX Pro 6000 is clearly ahead in both prompt ingestion and token generation. Generation throughput is roughly 2.5 to 3 times higher on the GPU.

However the M5 Max still reaches 65 to 88 tokens per second depending on context size, which is usable for interactive inference on a 120B model.

Qwen3.5-27B Distilled (6bit)

This smaller model highlights how GPUs scale with lighter workloads. Here we compare the M5 Max with an RTX 5090 32GB running a similar size model.

Context	Device	Prompt Processing	Generation	GPU Faster
4K	M5 Max	811 t/s	23.6 t/s
4K	RTX 5090	2884 t/s	49.1 t/s	3.5× prompt / 2.1× gen
16K	M5 Max	686 t/s	20.3 t/s
16K	RTX 5090	2657 t/s	47.0 t/s	3.9× prompt / 2.3× gen
32K	M5 Max	591 t/s	14.9 t/s
32K	RTX 5090	2297 t/s	44.7 t/s	3.9× prompt / 3.0× gen

The RTX 5090 shows a large advantage with the dense model. Prompt processing is almost 4 times faster, while token generation is roughly 2 to 3 times faster.

Another limitation is context size. The RTX 5090 benchmark was not able to load 64K context in llama.cpp, while the M5 Max handled 64K without issues due to its larger unified memory pool.

Performance Per Dollar

This is where things become interesting for enthusiasts.

A MacBook Pro 16 with M5 Max, 128GB memory and 2TB storage costs about $5099.

The RTX Pro 6000 Blackwell workstation GPU costs around $8800, and that is only the GPU. A full system will cost significantly more.

The dedicated GPU is still faster, especially in prompt processing and large batch workloads. But the cost difference is large.

From a value perspective the MacBook offers something unusual: the ability to run 120B class models locally on a single portable system, now with even better prompt processing speed.

M5 Max vs Previous Apple Chips

The early impression from these tests is that prompt processing improved noticeably compared to the M4 Max. That is likely due to both GPU improvements and memory subsystem changes.

Apple’s MLX stack also continues to mature, which affects real-world performance.

Final Thoughts

The early benchmarks suggest that the M5 Max is a meaningful upgrade for local LLM inference on Apple Silicon.

The GPU workstation class hardware still dominates raw throughput. The RTX Pro 6000 remains significantly faster and has stronger CUDA ecosystem support.

For local LLM enthusiasts the main takeaway is simple. The M5 Max is not replacing high-end GPUs, but it continues the trend of making large model inference possible on compact single-machine systems.