GPT-OSS 120B: Offloading MoE Layers to CPU Boosts RTX 3090 and 5090 Performance

I’ve been testing the --n-cpu-moe flag in llama.cpp to see how much it improves performance with large Mixture of Experts models. The standard method of splitting layers between the GPU and CPU can be slow for these models. This flag offers a more targeted approach by moving just the expert layers to system RAM while keeping the more critical attention layers in VRAM.

After my recent tests with a triple RTX 3090 setup on GPT-OSS 120B, I decided to do this single RTX 3090 test to measure the difference on a more common configuration. I ran some benchmarks to see the real-world numbers.

RTX 3090 Performance with GPT-OSS 120B

My test system was: a single NVIDIA RTX 3090 with 24 GB of VRAM, an AMD EPYC 7343 CPU, and 64 GB of DDR4 3200 MT/s system memory. The OS was Ubuntu 24.04 LTS with the latest llama.cpp server as inference engine and Open WebUI as front end.

The Standard Approach: Simple Layer Splitting

When I tried to run the gpt-oss 120B model with a standard layer split, I could only fit 13 of the 32 layers into VRAM. The performance was not practical for interactive use, with token generation speed below one token per second.

Context Length	Prompt Processing	Token Generation
71 tokens	42.50 t/s	0.90 t/s

The Smart Approach: Offloading MoE Layers to CPU

By changing my approach and using the --n-cpu-moe flag, I told llama.cpp to keep the main layers on the GPU (–n-gpu-layers 99) but offload 27 of the MoE layers to system RAM. This kept VRAM usage just under the 24 GB limit.

Context Length	Prompt Processing	Token Generation
85 tokens	48.49 t/s	1.81 t/s
500 tokens	173.60 t/s	1.57 t/s
1400 tokens	229.07 t/s	1.61 t/s
5500 tokens	246.18 t/s	1.41 t/s

The results show token generation speed increased from 0.90 t/s to a more usable average of 1.6 t/s. This makes the model functional on a single GPU where it wasn’t before.

RTX 5090 Performance with GPT-OSS 120B

Next, I tested the same model on a higher-end RTX 5090 to see if the MoE offloading technique is still beneficial with a more capable GPU.

Standard Layer Splitting

With its larger VRAM, the RTX 5090 held 17 layers of the model using a standard split. The performance was better than the 3090, but a large part of the model was still running on the slower CPU and system RAM.

The command I used to run the llama-server:

./llama-server \
  --model /home/allan/llama.cpp/models/gpt-oss-120b-F16.gguf \
  --port 10000 \
  --ctx-size 8192 \
  --jinja \
  --flash-attn \
  --n-gpu-layers 99 \
  --n-cpu-moe 21

Context Length	Prompt Processing	Token Generation
86 tokens	81.95 t/s	3.40 t/s
5305 tokens	445.74 t/s	3.37 t/s

Offloading MoE Layers

Applying the same MoE offloading strategy (--n-gpu-layers 99 --n-cpu-moe 21), I saw another clear performance improvement.

Context Length	Prompt Processing	Token Generation
87 tokens	84.91 t/s	9.60 t/s
5650 tokens	448.15 t/s	8.14 t/s

On this hardware, token generation speed increased from around 3.4 t/s to over 8 t/s. This shows that managing which parts of the model are in VRAM is a better strategy than simply offloading whole layers, even on more powerful GPUs.

Key Takeaways and Practical Advice

My test results show that the --n-cpu-moe flag in llama.cpp is a useful tool for running large MoE models on systems with limited VRAM. It’s a more efficient way to manage resources.

For your own builds, my advice is to set the GPU layers to a maximum value like --n-gpu-layers 99 to prioritize non-expert layers for VRAM. Then, you can adjust the --n-cpu-moe value, starting high and lowering it until your VRAM is almost full. This should maximize GPU utilization for the parts of the model that benefit most from high memory bandwidth.

This technique also highlights the importance of fast system RAM. Since the expert layers will be read from your DDR4 or DDR5, higher bandwidth memory will directly impact your token generation speed. This makes the CPU and motherboard choice just as important as the GPU for these hybrid setups. From what I’ve seen, intelligently offloading MoE layers can let you run larger models than you might have thought possible on your hardware.