RTX 3090
-
Apr. 3, 2026 / Featured
What Hardware for Gemma 4 26B and 31B LLM Local Use
The new Gemma 4 models from Google DeepMind have landed, and for local LLM users this is one of the more practical releases in a while. The lineup gives us two interesting mid-size targets: a 26B MoE model (A4B) and a 31B dense model. Both support up to 256K context, tool calling, and personal agent-style...
-
Feb. 26, 2026 / Hardware Insights
Qwen3.5 27B and Qwen3.5 35B: What Hardware Do You Actually Need? (GPU Benchmarks Inside)
Qwen3.5 27B fits comfortably on a 24 GB GPU up to 131k context in 4-bit, but becomes memory heavy at 262k. Qwen3.5 35B MoE in 4-bit is the more practical long-context model for 24 GB cards, and it is significantly faster in token generation despite having more total parameters. VRAM is still the main constraint,...
-
Feb. 4, 2026 / Hardware Insights
Qwen3 Coder Next 80B A3B: what it takes to run it locally
Direct answer first: Qwen3 Coder Next 80B A3B is one of the most hardware-friendly 80B-class coding models released so far. Thanks to its MoE design with roughly 3B active parameters, a single high-VRAM GPU can run it at full 256k context, and even dual consumer GPUs can handle the 3-bit version comfortably. VRAM, not raw...
-
Nov. 10, 2025 / Hardware Insights
GPT-OSS 120B: Offloading MoE Layers to CPU Boosts RTX 3090 and 5090 Performance
I’ve been testing the --n-cpu-moe flag in llama.cpp to see how much it improves performance with large Mixture of Experts models. The standard method of splitting layers between the GPU and CPU can be slow for these models. This flag offers a more targeted approach by moving just the expert layers to system RAM while...