LLM Benchmarks

  • Jan. 22, 2026 / Hardware Insights

    We Tested GLM-4.7 Flash 30B MoE — Here’s the GPU You Actually Need

    Z.ai released GLM 4.7 Flash only a few days ago, but meaningful local testing had to wait. The initial llama.cpp support was incomplete, and without proper fixes it was not possible to measure real performance. Those fixes have now landed, and with the latest llama.cpp build we were finally able to test the model properly...

    glm 4.7 flash tested on rtx 5090 rtx 3090 with llm
  • Dec. 11, 2025 / Hardware Insights

    We Tested Devstral 2 (24B & 123B) — Here’s the Hardware You Actually Need

    Mistral AI has just released its new coding model, Devstral 2. We’ve been using its predecessor, Devstral Small, locally for code completion and have been very impressed with its performance. Early reports on Devstral 2 put it on par with other top models like Kimi K2 and Deepseek v3.2, so we were eager to get...

    devstral 2 llm hardware options gpus laptops mini pc
  • Nov. 10, 2025 / Hardware Insights

    GPT-OSS 120B: Offloading MoE Layers to CPU Boosts RTX 3090 and 5090 Performance

    I’ve been testing the --n-cpu-moe flag in llama.cpp to see how much it improves performance with large Mixture of Experts models. The standard method of splitting layers between the GPU and CPU can be slow for these models. This flag offers a more targeted approach by moving just the expert layers to system RAM while...

    rtx 3090 and rtx 5090 stading on top of moe layers
  • Nov. 2, 2025 / Hardware Insights

    The Definitive GPU Ranking for LLMs: Token Generation & Prompt Processing Performance

    At Hardware Corner, we set out to create a data-driven benchmark hierarchy for local LLM inference – focusing on the two workloads that define real-world performance: prompt processing and token generation. Using llama.cpp’s latest llama-bench on Ubuntu 24.04 with CUDA 12.8, we measured a wide range of GPUs across model sizes, context lengths, and quantization...

  • Oct. 28, 2025 / LLM Benchmarks

    LLM VRAM Usage Compared: Benchmarking Popular 8B–123B Models Across 4K–256K Contexts

    As someone who runs language models locally, I know that VRAM is the one resource we can never have enough of. Every parameter, every token of context, and the growing KV cache all chip away at that precious memory. To cut through the speculation and get hard data, I decided to benchmark some of today’s...

    A color-coded table comparing VRAM usage and context lengths for popular large language models, showing GPU tiers from 12 GB to 96 GB and context ranges from 4K to 131K.
  • Oct. 17, 2025 / LLM Benchmarks

    RTX 4090 LLM Benchmarks: Performance Across 4K – 131K Context Sizes

    I tested the RTX 4090 with five quantized models to measure real-world inference performance for local LLM workloads. This is the second article in my GPU benchmark series, following my recent RTX 5090 tests. I ran these benchmarks to provide concrete performance data across different model sizes and context lengths using llama.cpp. Testing Environment My...

    NVIDIA GeForce RTX 4090 graphics card with performance benchmark graph background, illustrating powerful GPU performance for local LLM and AI model inference.
  • Oct. 16, 2025 / LLM Benchmarks

    RTX 5090 LLM Benchmark Results: 10K Tokens/sec Prompt Processing, 139K Context

    I recently completed extensive local LLM inference benchmarks on the NVIDIA RTX 5090 32 GB. My primary focus was gathering raw performance data on critical metrics for the local enthusiast: prompt processing speed (PP), token generation throughput (TG), and the maximum context window I could sustain using 4-bit quantization (Q4_K_XL). My goal here is to...

    NVIDIA RTX 5090 graphics card used for local LLM inference benchmarks, showing GPU performance visualization with data curves and grid background.
  • Sep. 16, 2025 / LLM Benchmarks

    Local LLM Models and Their Max Context Windows: A Reference Table

    When choosing a local LLM, one of the first specifications to check is its context window. The context size determines how many tokens you can feed into the model at once, which directly affects practical use cases like long-form reasoning, document analysis, or multi-turn conversations. For hardware enthusiasts running quantized models on limited VRAM, knowing...

    a table with context windows of different local llms
  • Sep. 10, 2025 / LLM Benchmarks

    Can Three RTX 3090s Really Run GPT-OSS 120B with Max Context? I Put It to the Test

    After testing the gpt-oss-20B model on a single RTX 3090, I had to push things further and see what the new heavyweight could do. In addition to the 20B model, OpenAI also released gpt-oss-120B, a massive 120-billion parameter open-weight Mixture-of-Experts (MoE) model with 5.1 billion active parameters. I first ran some experiments on an RTX...

    three rtx 3090 gpus connected for inference on llm