Product: RTX 5090 | Hardware Corner

Feb. 26, 2026 / Hardware Insights

Qwen3.5 27B and Qwen3.5 35B: What Hardware Do You Actually Need? (GPU Benchmarks Inside)

Qwen3.5 27B fits comfortably on a 24 GB GPU up to 131k context in 4-bit, but becomes memory heavy at 262k. Qwen3.5 35B MoE in 4-bit is the more practical long-context model for 24 GB cards, and it is significantly faster in token generation despite having more total parameters. VRAM is still the main constraint,...

rtx 3090 on a test bech runnign qwen 3.5 35b MoE

Feb. 4, 2026 / Hardware Insights

Qwen3 Coder Next 80B A3B: what it takes to run it locally

Direct answer first: Qwen3 Coder Next 80B A3B is one of the most hardware-friendly 80B-class coding models released so far. Thanks to its MoE design with roughly 3B active parameters, a single high-VRAM GPU can run it at full 256k context, and even dual consumer GPUs can handle the 3-bit version comfortably. VRAM, not raw...

qwen3 coder next building pc for local use

Nov. 10, 2025 / Hardware Insights

GPT-OSS 120B: Offloading MoE Layers to CPU Boosts RTX 3090 and 5090 Performance

I’ve been testing the --n-cpu-moe flag in llama.cpp to see how much it improves performance with large Mixture of Experts models. The standard method of splitting layers between the GPU and CPU can be slow for these models. This flag offers a more targeted approach by moving just the expert layers to system RAM while...

rtx 3090 and rtx 5090 stading on top of moe layers

Oct. 16, 2025 / LLM Benchmarks

RTX 5090 LLM Benchmark Results: 10K Tokens/sec Prompt Processing, 139K Context

I recently completed extensive local LLM inference benchmarks on the NVIDIA RTX 5090 32 GB. My primary focus was gathering raw performance data on critical metrics for the local enthusiast: prompt processing speed (PP), token generation throughput (TG), and the maximum context window I could sustain using 4-bit quantization (Q4_K_XL). My goal here is to...