Qwen3.5 27B fits comfortably on a 24 GB GPU up to 131k context in 4-bit, but becomes memory heavy at 262k. Qwen3.5 35B MoE in 4-bit is the more practical long-context model for 24 GB cards, and it is significantly faster in token generation despite having more total parameters. VRAM is still the main constraint, but memory bandwidth determines how enjoyable the model feels at 65k+ context.
This article is focused on local inference with quantized GGUF models using llama.cpp. No model quality claims. Only hardware behavior.
All tests were done on Ubuntu 24.04, CUDA 12.8, NVIDIA driver 590.48.01, llama.cpp build 8150, using llama-bench unless otherwise stated. Test CPU was AMD EPYC 7601 with 32 GB system RAM.
Model Architecture Differences: Dense vs MoE
Qwen3.5 27B is a dense model. Every token activates the full parameter set. This makes prompt processing slower and token generation bandwidth-bound.
Qwen3.5 35B MoE is a mixture-of-experts model. Only a subset of experts are active per token. Even though the total parameter count is higher, active compute per token is lower. In practice, 35B MoE is much faster than 27B dense at the same quantization level.
For local users running agents, long coding sessions, or 100k+ context, the 35B MoE architecture makes more sense per dollar.
Memory Requirements by Context Length (All Variants Compared)
This section determines your GPU or unified system choice more than anything else. Below is a single consolidated table comparing all three configurations:
- Qwen3.5 27B (4-bit, dense)
- Qwen3.5 35B MoE (4-bit)
- Qwen3.5 35B MoE (8-bit weights + 8-bit KV cache)
Measured VRAM Usage
| Context (tokens) | 27B Q4 (GB) | 35B Q4 (GB) | 35B Q8 + Q8 KV (GB) |
|---|---|---|---|
| 4k | 16 | 19 | 39 |
| 8k | 16 | 19 | 40 |
| 16k | 17 | 19 | 40 |
| 32k | 18 | 20 | 40 |
| 45k | 19 | 20 | 40 |
| 57k | 19 | 20 | 40 |
| 65k | 20 | 20 | 40 |
| 86k | 21 | 21 | 41 |
| 131k | 24 | 22 | 42 |
| 262k | 33 | 25 | 43 |
Practical Interpretation
For 24 GB GPUs, the picture is clear.
Qwen3.5 27B dense fits comfortably up to 131k context. At 262k it jumps to 33 GB, which moves you into, unified memory, workstation or multi-GPU territory.
Qwen3.5 35B MoE in 4-bit is more VRAM efficient at high context. Even 262k is around 25 GB. With KV cache quantization or --fit, it can be made to work on 24 GB consumer cards, though without much headroom.
The 8-bit 35B MoE configuration is workstation-class. You are realistically looking at 48 GB+ GPUs or multi-GPU setups. It is not a practical path for standard 24 GB cards.
For most price-conscious local builders, 35B MoE in 4-bit is the most flexible option across context lengths.
GPU Benchmarks – Qwen3.5 27B Dense (Q4_K)
NVIDIA GeForce RTX 3090
Measured prompt processing (pp) and token generation (tg). Time to first token (TTFT) is approximated as:
TTFT ≈ Context / Prompt Processing t/s
| Context | Prompt t/s | Gen t/s | Approx TTFT |
|---|---|---|---|
| 4k | 1104 | 33.5 | 3.7 s |
| 16k | 977 | 32.3 | 16.3 s |
| 32k | 848 | 31.0 | 37.7 s |
| 65k | 678 | 28.8 | 95.9 s |
| 86k | 599 | 27.5 | 143.6 s |
Conclusion.
At 86k+, TTFT becomes slow on a 3090. Token generation is acceptable for interactive use, but prompt ingestion dominates latency.
NVIDIA GeForce RTX 5090
| Context | Prompt t/s | Gen t/s | Approx TTFT |
|---|---|---|---|
| 4k | 3004 | 58.8 | 1.3 s |
| 16k | 2721 | 55.8 | 6.0 s |
| 32k | 2341 | 53.8 | 13.6 s |
| 65k | 1606 | 50.1 | 40.5 s |
| 131k | 1019 | 44.1 | 128.6 s |
Conclusion.
Compared to 3090, prompt processing is roughly 2.5–3x faster. 131k context is usable but still heavy. Generation speed nearly doubles.
NVIDIA RTX PRO 6000 Blackwell Workstation Edition
| Context | Prompt t/s | Gen t/s | Approx TTFT |
|---|---|---|---|
| 4k | 3338 | 60.9 | 1.2 s |
| 32k | 2526 | 55.1 | 12.7 s |
| 131k | 1404 | 45.1 | 93.3 s |
| 262k | 903 | 36.3 | 290.2 s |
Conclusion.
Even with 96 GB VRAM, dense 27B at 262k context has a nearly 5-minute prompt ingestion. Dense models scale poorly in long-context scenarios.
AMD Ryzen AI Max+ 395 (Strix Halo)
The Strix Halo APU (Ryzen AI Max+ 395) was tested using ROCm (Kernel 6.18.12).
Unlike the discrete GPU tests above which utilized 4-bit (Q4) quantization, the data below utilizes 8-bit (Q8) quantization. This higher precision places a significantly higher load on memory bandwidth and capacity.
Performance on the dense 27B model is heavily constrained by memory bandwidth. Even at 8-bit precision, the token generation rate is low (~7 t/s), and prompt processing struggles to keep up with the model’s dense activation nature.
| Context | Prompt t/s | Gen t/s | Approx TTFT |
|---|---|---|---|
| 4k | 270 | 7.2 | 15.2 s |
| 16k | 215 | 7.0 | 76.2 s |
| 32k | 180 | 6.8 | 182.2 s |
| 65k | 135 | 6.4 | 485.1 s |
| 131k | 90 | 5.8 | 1456.6 s |
Conclusion.
The dense architecture of the 27B model combined with high-precision Q8 weights saturates the APU’s bandwidth. With generation speeds hovering around 6–7 t/s and slow ingestion, this configuration is not ideal for interactive tasks.
GPU Benchmarks – Qwen3.5 35B MoE (Q4_K)
This is where things change.
RTX 3090
| Context | Prompt t/s | Gen t/s | Approx TTFT |
|---|---|---|---|
| 4k | 2622 | 111.2 | 1.5 s |
| 16k | 2381 | 107.1 | 6.7 s |
| 32k | 2121 | 101.2 | 15.1 s |
| 65k | 1749 | 93.1 | 37.2 s |
| 131k | 1288 | 79.4 | 101.7 s |
Conclusion.
Compared to 27B dense, 35B MoE is roughly 2–3x faster in generation and significantly faster in prompt ingestion. For 24 GB owners, this is the better model.
RTX 5090
| Context | Prompt t/s | Gen t/s | Approx TTFT |
|---|---|---|---|
| 4k | 6605 | 165.2 | 0.6 s |
| 16k | 6142 | 148.3 | 2.6 s |
| 32k | 5611 | 143.2 | 5.8 s |
| 65k | 4624 | 133.5 | 14.1 s |
| 131k | 3242 | 118.2 | 40.4 s |
| 262k | 2003 | 97.3 | 130.8 s |
Conclusion.
This is a strong pairing. Even 131k context has a ~40 second TTFT, which is manageable for agentic workflows. Generation speed remains high.
Running 262k on 24 GB GPUs with --fit
For 35B MoE on 24 GB GPUs, we tested llama-server with:
--fit on
--fit-ctx 262144
--fit-target 128
Measured on RTX 3090:
| Context | Prompt t/s | Gen t/s | Total Time |
|---|---|---|---|
| 38k | 1509 | 55.6 | 36.6 s |
| 100k | 1092 | 49.1 | 105.3 s |
| 260k | 1045 | 46.3 | 273.8 s |
Conclusion.
With --fit, 262k context is technically possible on 24 GB cards. Prompt processing drops but remains usable. For batch agent workloads this is acceptable. For chat, it is slow but workable.
AMD Ryzen AI Max+ 395 (Strix Halo)
The Mixture-of-Experts architecture demonstrates the strength of the Strix Halo platform. Despite running at Q8 (high precision) and having a larger total parameter count, the 35B MoE model is significantly faster than the 27B dense model.
| Context | Prompt t/s | Gen t/s | Approx TTFT |
|---|---|---|---|
| 4k | 960 | 38.5 | 4.3 s |
| 16k | 730 | 37.0 | 22.5 s |
| 32k | 600 | 35.0 | 54.7 s |
| 65k | 410 | 32.0 | 159.7 s |
| 131k | 250 | 27.0 | 524.4 s |
Conclusion.
The 35B MoE model is the clear winner on Strix Halo. Even at Q8 precision, it maintains excellent generation speeds (starting near 40 t/s and holding ~27 t/s at 131k context). While prompt processing slows down at extreme contexts, the generation performance remains highly usable for chat and coding assistance.
What Hardware Makes Sense Per Budget Tier
24 GB GPUs like an NVIDIA GeForce RTX 3090 remain viable for Qwen3.5 35B MoE up to ~131 k context without hacks. Performance per dollar is still strong if you buy used, and prompt processing and token generation are reasonable for mid-range agentic or coding use.
Stepping up to an NVIDIA GeForce RTX 5090 class card significantly improves prompt processing and reduces time-to-first-token at higher contexts. If your workflow involves retrieval-augmented generation (RAG), multi-turn coding, or agentic use, the increased memory bandwidth and extra VRAM headroom make sense for the cost. Prompt ingestion on 5090 is often 2–3× faster than a 3090 at the same context length.
Workstation GPUs like an NVIDIA RTX PRO 6000 Blackwell Workstation Edition only become necessary if you want to run 8-bit weights with 8-bit KV cache configurations or clean 262 k context without relying on --fit strategies or quantization workarounds. Those configurations require VRAM beyond typical consumer cards and are not a good value for most hobbyist builds.
If the 35B MoE model proves performant for agentic use, non-GPU unified memory platforms are another path. Relatively affordable machines like a 48 GB MacBook Pro with M3 Max (with ~400 GB/s memory bandwidth) or a 64 GB Strix Halo (with ~256 GB/s bandwidth) make sense for local inference. These systems can load the model and run long context lengths, but expect slower prompt processing, especially near max context, compared to dedicated discrete GPUs. They excel in value and simplicity, but raw throughput will lag behind similarly priced discrete GPU hardware.
For value-focused local builders who care about performance per dollar and workable agentic coding performance, 35B MoE in 4-bit on 24–32 GB GPUs, or unified memory systems with high bandwidth, currently represent the best balance of cost, VRAM headroom, and usable speeds
Based on the diagrams provided, here is the Strix Halo (Ryzen AI Max+ 395) section formatted to match the style of the article.