Running LLMs Locally

Running LLMs locally differs from cloud use in that it relies on your own hardware — GPUs, CPUs, and memory — to handle the workload. Local deployment is key for enthusiasts and professionals who want performance tuning, cost control, and data privacy without depending on external servers.

Topics:

Hardware Insights How to Run LLMs Localy LLM Benchmarks Local Agents

Products:

GPUs Strix Halo Apple Silicon RTX 5090 RTX 3090 RTX Pro 6000 OpenClaw RTX 4090 DGX Spark Qwen3

Tech:

Unified Memory VRAM

Beginner's Guide to Running LLMs

Running LLMs Locally Explained: An Introduction

By Allan Witt

Quantization for Local LLMs: How It Works and Which Formats Fit Your Setup

By Allan Witt

What Is Context Length in LLMs and How It Impacts Your VRAM (and Speed)

By Allan Witt

Memory Bandwidth: How Does It Boost Tokens per Second in Local LLM Inference?

By Chavy Levi

I optimized my Strix Halo for local LLMs: Here are the benchmarks and findings.

By Allan Witt

Frequently Asked Questions

Can I run an LLM without a GPU?

Yes, running LLMs on CPU is possible, but both inference and prompt processing will be slower compared to using a GPU. Inference speed depends heavily on memory bandwidth, so faster memory and more memory channels improve performance. This is why servers with multiple memory channels are popular among enthusiasts.

How much RAM do I need for local LLMs?

RAM requirements remain the same regardless of CPU or GPU usage. General guidance: 16GB RAM minimum for smaller models, 32GB for medium-sized models, and 64GB+ for large models unless using memory-efficient optimizations.

What are the most popular open-source LLMs?

Popular models by size include:

Small: Gemma 3, Phi-4
Medium: Llama 3.1 8B, Qwen 3 14B
Large: Qwen 3 30B, A3B, GPT-OSS 20B
Extra Large: GPT-OSS 120B, Llama 3.3 70B
Massive: Kim K2, DeepSeek R1 & R3

How do I optimize LLM inference for speed?

To improve inference performance:

Use appropriate quantization (4-bit or 8-bit).
Enable GPU acceleration when available.
Adjust context length to the minimum necessary.
Use flash attention for faster computation.
Apply KV cache quantization (4-bit or 8-bit).
Consider smaller, specialized models instead of larger, general-purpose ones.

Local LLM Hardware

What hardware you need for MiniMax-M2.7 230B (A10B) in 4-bit

16.04.2026

What GPU for Running OpenClaw Locally

07.04.2026

OpenClaw (local) — Hardware and LLM Overview

05.04.2026

Best LLM for MacBook Pro with M5 Max and 32GB

05.04.2026

What Hardware for Gemma 4 26B and 31B LLM Local Use

03.04.2026

Best Laptop for Running OpenClaw AI Agent Locally

05.04.2026

Best Mini Computer (PC/Mac) for Running OpenClaw AI Agent

05.04.2026

Your RTX Pro 6000 Blackwell Does Not Support FlashAttention-4

24.03.2026

This Desktop Machine Runs 1T Parameter LLMs Locally

19.03.2026

How Memory Chips Determine GPU Memory Bandwidth for Local LLM Inference

26.02.2026

Qwen3.5 27B and Qwen3.5 35B: What Hardware Do You Actually Need? (GPU Benchmarks Inside)

13.03.2026

Qwen3 Coder Next 80B A3B: what it takes to run it locally

04.03.2026

LLM Hardware News

New Intel B70 GPU for local LLM: first benchmarks and RTX 3090 comparison

26.03.2026

Apple M5 Max for Local LLMs: First Benchmarks vs RTX Pro 6000 and RTX 5090

03.04.2026

Local LLM Hardware Deal: 48GB Blackwell GPU Workstation Priced Near GPU Cost

05.03.2026

M5 Pro and M5 Max Local LLM Users Get 4x Faster Prefill, But Modest Token Gains

11.03.2026

LLM GPUs for Local AI Builds Jump in Price Across All VRAM Tiers

17.02.2026

Ditch the Mac Mini: PicoClaw and ZeroClaw Run OpenClaw on $10 Boards

04.03.2026

Key Terms

Quantization

A technique to reduce the memory footprint of LLMs by representing weights with fewer bits. Standard models use 16-bit or 32-bit precision, while quantized models can use 8-bit, 4-bit, or even lower precision, significantly reducing VRAM requirements with minimal quality loss.

Learn more about quantization →

Inference

The process of running a trained model to generate responses. Inference speed depends on hardware, model size, and optimization techniques. Unlike training, inference can be done with less computing resources using various optimization methods.

GGUF/GGML

Optimized formats for running LLMs locally on CPUs and GPUs. GGML is a tensor library designed for machine learning that allows efficient inference on consumer hardware. GGUF is its successor format with improved metadata handling and compatibility.

Tokenizer

A component that breaks down text into manageable chunks (tokens) for the LLM. Tokenizers convert raw text into numerical representations that models can process, and influence how efficiently the model handles different languages and special characters.