Apple M5 Max for Local LLMs: First Benchmarks vs RTX Pro 6000 and RTX 5090
Early benchmarks for the Apple M5 Max are starting to appear. Reddit user cryingneko recently posted the first local LLM tests on a 14-inch MacBook Pro with the M5 Max, configured with 128GB unified memory, 18-core CPU, 40-core GPU, and 614 GB/s memory bandwidth.
For local LLM enthusiasts this configuration is interesting for one reason: memory capacity. With 128GB of unified memory, the system can load models that normally require large GPU setups. The question is how fast it actually runs them.
The tests were performed with mlx_lm, Apple’s MLX inference stack. Two large models were used for the main comparison.
- Qwen3.5-122B-A10B in 4-bit quantization (69.6 GB)
- Qwen3-Coder-Next in 8-bit quantization (84.7 GB)
- Qwen3.5-27B in n 6-bit quantization (26GB)
- gpt-oss-120b-MXFP4 in 8-bit quantization (64GB)
All models fit comfortably inside the 128GB memory budget, leaving room for large context windows.
To understand where this hardware stands, we compare the results to our benchmarks from an RTX Pro 6000 Blackwell 96GB, and our RTX 5090 tests. RTX PRO 6000 is currently the only single consumer workstation GPU capable of running the bigger models fully in VRAM.
Test System
Apple M5 Max MacBook Pro (14-inch)
Unified memory: 128GB
Memory bandwidth: 614 GB/s
CPU: 18 cores
GPU: 40 cores
Inference backend: MLX (mlx_lm)
Qwen3.5-122B-A10B Performance
This is the most interesting test because it shows how a laptop handles a 122B mixture-of-experts model.
| Context | Device | Prompt Processing (t/s) | Generation (t/s) | GPU Faster |
|---|---|---|---|---|
| 4K | M5 Max | 881 | 65.9 | |
| 4K | RTX Pro 6000 | 3055 | 98.4 | 3.46× prompt / 49% gen |
| 16K | M5 Max | 1239 | 60.6 | |
| 16K | RTX Pro 6000 | 2836 | 93.7 | 2.29× prompt / 55% gen |
| 32K | M5 Max | 1068 | 54.9 | |
| 32K | RTX Pro 6000 | 2582 | 91.3 | 2.41× prompt / 66% gen |
The RTX Pro 6000 clearly leads in prompt processing. Generation speed is closer. The GPU produces roughly 50–65 percent more tokens per second during decoding.
Still, the MacBook result is notable. Generating around 55–65 tokens per second on a 122B model on a laptop is already usable for local workflows.
Memory usage peaked around 72–76GB, which leaves room for large context sizes. The model itself requires around 78GB with full context buffers, so the 128GB configuration can likely handle the full 256K context supported by the model.
Qwen3-Coder-Next 8bit Performance
The second model tested was Qwen3-Coder-Next in 8-bit quantization. This model is larger on disk than the 122B MoE model due to the higher precision.
| Context | Prompt Processing | Generation | Peak Memory |
|---|---|---|---|
| 4K | 754 t/s | 79.3 t/s | 87.1 GB |
| 16K | 1802 t/s | 74.3 t/s | 88.1 GB |
| 32K | 1887 t/s | 68.6 t/s | 89.6 GB |
| 64K | 1432 t/s | 48.2 t/s | 92.6 GB |
Generation speed stays in the 48–79 tokens/s range, depending on context length. Memory usage approaches 90GB, which still fits within the 128GB unified pool.
Unfortunately direct RTX Pro 6000 results for this exact quantization is not available currently in our LLM GPU benchmark dataset, but based on the previous comparison we would expect the GPU to lead primarily in prompt ingestion.
gpt-oss-120B (Q8)
This model is a good comparison point because it fits comfortably in both the M5 Max 128GB memory pool and the 96GB VRAM of the RTX Pro 6000.
| Context | Device | Prompt Processing | Generation | GPU Faster |
|---|---|---|---|---|
| 4K | M5 Max | 1325 t/s | 87.9 t/s | |
| 4K | RTX Pro 6000 | 6512 t/s | 221.3 t/s | 4.9× prompt / 2.5× gen |
| 16K | M5 Max | 2710 t/s | 76.0 t/s | |
| 16K | RTX Pro 6000 | 5721 t/s | 193.2 t/s | 2.1× prompt / 2.5× gen |
| 32K | M5 Max | 2537 t/s | 64.5 t/s | |
| 32K | RTX Pro 6000 | 5000 t/s | 178.5 t/s | 2.0× prompt / 2.8× gen |
The RTX Pro 6000 is clearly ahead in both prompt ingestion and token generation. Generation throughput is roughly 2.5 to 3 times higher on the GPU.
However the M5 Max still reaches 65 to 88 tokens per second depending on context size, which is usable for interactive inference on a 120B model.
Qwen3.5-27B Distilled (6bit)
This smaller model highlights how GPUs scale with lighter workloads. Here we compare the M5 Max with an RTX 5090 32GB running a similar size model.
| Context | Device | Prompt Processing | Generation | GPU Faster |
|---|---|---|---|---|
| 4K | M5 Max | 811 t/s | 23.6 t/s | |
| 4K | RTX 5090 | 2884 t/s | 49.1 t/s | 3.5× prompt / 2.1× gen |
| 16K | M5 Max | 686 t/s | 20.3 t/s | |
| 16K | RTX 5090 | 2657 t/s | 47.0 t/s | 3.9× prompt / 2.3× gen |
| 32K | M5 Max | 591 t/s | 14.9 t/s | |
| 32K | RTX 5090 | 2297 t/s | 44.7 t/s | 3.9× prompt / 3.0× gen |
The RTX 5090 shows a large advantage with the dense model. Prompt processing is almost 4 times faster, while token generation is roughly 2 to 3 times faster.
Another limitation is context size. The RTX 5090 benchmark was not able to load 64K context in llama.cpp, while the M5 Max handled 64K without issues due to its larger unified memory pool.
Performance Per Dollar
This is where things become interesting for enthusiasts.
A MacBook Pro 16 with M5 Max, 128GB memory and 2TB storage costs about $5099.
The RTX Pro 6000 Blackwell workstation GPU costs around $8800, and that is only the GPU. A full system will cost significantly more.
The dedicated GPU is still faster, especially in prompt processing and large batch workloads. But the cost difference is large.
From a value perspective the MacBook offers something unusual: the ability to run 120B class models locally on a single portable system, now with even better prompt processing speed.
M5 Max vs Previous Apple Chips
The early impression from these tests is that prompt processing improved noticeably compared to the M4 Max. That is likely due to both GPU improvements and memory subsystem changes.
Apple’s MLX stack also continues to mature, which affects real-world performance.
Final Thoughts
The early benchmarks suggest that the M5 Max is a meaningful upgrade for local LLM inference on Apple Silicon.
The GPU workstation class hardware still dominates raw throughput. The RTX Pro 6000 remains significantly faster and has stronger CUDA ecosystem support.
For local LLM enthusiasts the main takeaway is simple. The M5 Max is not replacing high-end GPUs, but it continues the trend of making large model inference possible on compact single-machine systems.
Leave a Reply
No comments yet.