How to Run LLMs Localy

  • Feb. 4, 2026 / Hardware Insights

    Qwen3 Coder Next 80B A3B: what it takes to run it locally

    Direct answer first: Qwen3 Coder Next 80B A3B is one of the most hardware-friendly 80B-class coding models released so far. Thanks to its MoE design with roughly 3B active parameters, a single high-VRAM GPU can run it at full 256k context, and even dual consumer GPUs can handle the 3-bit version comfortably. VRAM, not raw...

    qwen3 coder next building pc for local use
  • Oct. 9, 2025 / How to Run LLMs Localy

    How Multi-Token Prediction Makes Local LLMs Faster – Without Extra VRAM.

    For anyone running LLMs locally, the goal is always more performance for less cost. We obsess over VRAM, memory bandwidth, and squeezing every last token per second out of our hardware. While prompt processing (TTFT) is often fast, the token generation that follows can be a bottleneck, especially on memory-bandwidth-limited systems. This one-token-at-a-time process, called...

    multi token prediction in local llm
  • Oct. 8, 2025 / How to Run LLMs Localy

    I optimized my Strix Halo for local LLMs: Here are the benchmarks and findings.

    If you’ve gotten your hands on an AMD Ryzen AI Max+ 395 (Strix Halo) system, you already know the raw hardware is impressive. That massive pool of unified LPDDR5x memory is a game-changer for running large models locally. But unlocking its full potential isn’t just plug-and-play. The key to getting the best possible performance lies...

  • Sep. 29, 2025 / How to Run LLMs Localy

    Speculative Decoding Explained: Faster Inference Without Quality Loss

    Unlock significant speed gains for large language models on your own hardware without sacrificing quality. Here’s how it works and how to set it up in popular inference engines. Why Local LLMs Run Slow If you run large language models on your own hardware, you know the biggest challenge is inference speed. Getting high-quality models...

    illustration of speculative decoding in llm inference - the main model and the draft model working togather
  • Sep. 18, 2025 / How to Run LLMs Localy

    Memory Bandwidth: How Does It Boost Tokens per Second in Local LLM Inference?

    You’ve spent weeks picking out the parts for a powerful new computer. It has a top-tier CPU, plenty of fast storage, and maybe even a respectable graphics card. You download your first large language model (LLM), excited to run it locally, only to find the experience is agonizingly slow. The text trickles out one word...

  • Sep. 15, 2025 / How to Run LLMs Localy

    What Is Context Length in LLMs and How It Impacts Your VRAM (and Speed)

    For local LLM enthusiasts, the race for models with larger “context lengths” feels like the next frontier. While developers boast models that can “remember” entire novels, the practical reality for anyone running hardware at home is that a bigger context window directly translates to a massive hit on your system’s resources, especially your precious VRAM....

    graphic representation of llm context
  • Sep. 5, 2025 / How to Run LLMs Localy

    Quantization for Local LLMs: How It Works and Which Formats Fit Your Setup

    Running large language models locally requires smart resource management. Quantization is the key technique that makes this possible by reducing memory requirements and improving inference speed. This practical guide focuses on what you need to know for local LLM deployment, not the mathematical theory[1] behind it. For the technical mathematical details of quantization, check out...

    graphical representation of quantization llm layers weights and activations