Running LLMs locally differs from cloud use in that it relies on your own hardware — GPUs, CPUs, and memory — to handle the workload. Local deployment is key for enthusiasts and professionals who want performance tuning, cost control, and data privacy without depending on external servers.
Yes, running LLMs on CPU is possible, but both inference and prompt processing will be slower compared to using a GPU. Inference speed depends heavily on memory bandwidth, so faster memory and more memory channels improve performance. This is why servers with multiple memory channels are popular among enthusiasts.
RAM requirements remain the same regardless of CPU or GPU usage. General guidance: 16GB RAM minimum for smaller models, 32GB for medium-sized models, and 64GB+ for large models unless using memory-efficient optimizations.
Popular models by size include:
To improve inference performance:
A technique to reduce the memory footprint of LLMs by representing weights with fewer bits. Standard models use 16-bit or 32-bit precision, while quantized models can use 8-bit, 4-bit, or even lower precision, significantly reducing VRAM requirements with minimal quality loss.
Learn more about quantization →The process of running a trained model to generate responses. Inference speed depends on hardware, model size, and optimization techniques. Unlike training, inference can be done with less computing resources using various optimization methods.
Optimized formats for running LLMs locally on CPUs and GPUs. GGML is a tensor library designed for machine learning that allows efficient inference on consumer hardware. GGUF is its successor format with improved metadata handling and compatibility.
A component that breaks down text into manageable chunks (tokens) for the LLM. Tokenizers convert raw text into numerical representations that models can process, and influence how efficiently the model handles different languages and special characters.