Computer Hardware Required to Run LLaMA AI Model Locally (GPU, CPU, RAM, SSD)

best pc for llama ai model

Large language models (LLMs) are powerful tools capable of generating natural language texts for various tasks such as programming, text summarization, role-playing, or serving as general AI assistants. One of the most advanced LLMs is LLaMA (Large Language Model Meta AI), a 70-billion-parameter model developed by Meta AI, a research division of Facebook.

To run the LLaMA model at home, you will need a computer equipped with a powerful GPU, capable of handling the substantial data and computational demands required for inferencing. In this article, we will discuss some of the hardware requirements necessary to run LLaMA and Llama-2 locally.

There are different methods for running LLaMA models on consumer hardware. The most common approach involves using a single NVIDIA GeForce RTX 3090 GPU. This GPU, with its 24 GB of memory, suffices for running a Llama model. For instance, one can use an RTX 3090, an ExLlamaV2 model loader, and a 4-bit quantized LLaMA or Llama-2 30B model, achieving approximately 30 to 40 tokens per second, which is huge.

However, to run the larger 65B model, a dual GPU setup is necessary. This configuration allows the model weights to fit within the VRAM. Popular combinations include 2x RTX 3090s or an RTX 3090 paired with an RTX 4090.

Dual GPU PC build to run Llama large language model in 2024:

Type Item Price
CPU AMD Ryzen 5 7600 3.8 GHz 6-Core $216.66
CPU Cooler Noctua NH-D15 $119.95
Motherboard Asus PROART B650-CREATOR ATX AM5 $229.99
Memory G.Skill Flare X5 64 GB (2 x 32 GB) DDR5-5600 CL36 $114.99
Storage Samsung 980 Pro 1 TB M.2-2280 PCIe 4.0 X4 NVME $93.98
Video Card RTX 3090 24 GB Video Card (Secondhand) ~ $800 @ Ebay
Video Card RTX 3090 24 GB Video Card (Secondhand) ~ $800 @ Ebay
Case Lian Li O11 Dynamic EVO XL $234.99
Power Supply Corsair HX1200 Platinum 1200 W 80+ Platinum $405.00
Case fans 5x be quiet! Pure Wings 2 87 CFM 120 mm Fan $54.50
     
  Total $3070.06

Running the LLaMA model on a CPU is also an option. This approach requires the GGML/GGUF version (LLaMA, Vicuna, Alpaca) of the model and a software named llama.cpp. Suitable CPUs for running LLaMA include the Core i7 12900K and Ryzen 9 5900X. For more information on this topic, refer to the CPU section.

Before we continue I want to mention that this guide is tailored toward PC users. We also have a guide about best Mac for large language models. You can check it out if your are a Mac user.

Let’s examine some of the PC hardware requirements necessary to operate the LLaMA model on a desktop PC:

GPU for running LLaMA and Llama-2

The GPU is the most crucial component of computer hardware for running LLaMA on a consumer-grade machine, as it handles the majority of the processing needed to operate the model. The performance of the GPU directly influences the inference speed.

While different variations and implementations of the model might demand less powerful hardware, the GPU remains the most vital part of the system.

GPU Requirements for 4-Bit Quantized LLaMA Models:

LLaMA Model Minimum VRAM Requirement Recommended GPU Examples
LLaMA / Llama-2 7B 6GB RTX 3060, GTX 1660, 2060, AMD 5700 XT, RTX 3050
LLaMA / Llama-2 13B 10GB AMD 6900 XT, RTX 2060 12GB, 3060 12GB, 3080, A2000
LLaMA / Llama-2 33B 20GB RTX 3080 20GB, A4500, A5000, 3090, 4090, 6000, Tesla V100, Tesla P40
LLaMA / Llama-2 65B/70B 40GB A100 40GB, 2x3090, 2x4090, A40, RTX A6000, 8000

Example of inference speed using ExLlama,  RTX 4090, and Intel i9-12900K  CPU

Model Size Context VRAM used Speed
LLaMA / Llama-2 7B 2,048 t 5 GB 175 t/s
LLaMA / Llama-2 13B 2,048 t 9 GB 90 t/s
LLaMA / Llama-2 33B 2,048 t 21 GB 41 t/s

LLaMA-7B

To run LLaMA-7B effectively, it is recommended to have a GPU with a minimum of 6GB VRAM. A suitable GPU example for this model is the RTX 3060, which offers a 8GB VRAM version. Other GPUs such as the GTX 1660, 2060, AMD 5700 XT, or RTX 3050, which also have 6GB VRAM, can serve as good options to support LLaMA-7B.

LLaMA-13B

For optimal performance with LLaMA-13B, a GPU with at least 10GB VRAM is suggested. Examples of GPUs that meet this requirement include the AMD 6900 XT, RTX 2060 12GB, 3060 12GB, 3080, or A2000. These GPUs provide the necessary VRAM capacity to handle the demands of a 13B model.

LLaMA-30B

To ensure smooth operation of LLaMA-30B, it is advisable to use a GPU with a minimum of 20GB VRAM. The RTX 3080 20GB, A4500, A5000, 3090, 4090, 6000, or Tesla V100 are examples of GPUs that offer the required VRAM capacity. These GPUs enable efficient processing and memory management for LLaMA-30B.

LLaMA-65B and 70B

LLaMA-65B and 70B performs optimally when paired with a GPU that has a minimum of 40GB VRAM. Suitable examples of GPUs for this model include the A100 40GB, 2x3090, 2x4090, A40, RTX A6000, or 8000. These GPUs provide the VRAM capacity to handle LLaMA-65B and Llama-2 70B weights.

CPU for LLaMA

In addition to the GPU, you will also need a CPU that can support the GPU and handle other tasks, such as data loading and preprocessing. If you are running a Llama model entirely on the GPU (GPTQ model) CPU is not so crucial. What the CPU does, is to helps load your prompt faster, where the LLM inference is entirely done on the GPU.

Good CPUs for LLaMA include the Intel Core i9-10900K, i7-12700K, Core i7 13700K or Ryzen 9 5900X and Ryzen 9 7900X, 7950X.

It is important to note that this article focuses on a build that is using the GPU for inference. However, there are LLaMa models specifically optimized for CPU use. For example, GGML/GGUF provides a solution to overcome the limitations posed by GPU memory when dealing with large models. Therefore, if you are constrained by budget for a GPU, or if you already own one but wish to test models that exceed your VRAM capacity, it is advisable to run a software called llama.cpp with GGUF model files. Yet, it’s crucial to consider CPU inference as a combo between the processor and RAM, since CPU inference is bottlenecked by RAM and benefits from a CPU that supports higher memory bandwidth and more channels. Hence, opting for a CPU with these specifications is the preferred choice.

Models for Llama CPU based inference:

  • Core i9 13900K (2 channels,  works with DDR5-6000 @ 96 GB/s)
  • Ryzen 9 7950x (2 channels,  works with DDR5-6000 @ 96 GB/s)

This is an example of running llama.cpp with a Ryzen 7 3700X and 128GB RAM @ 3600 MHz.

GGML Model Memory per Token Load Time Sample Time Predict Time Total Time
LLaMA-7B 4-bit 14434244 bytes 1270.15 ms 325.76 ms 15147.15 ms / 8.5 t/s 17077.88 ms
LLaMA-13B 4-bit 22439492 bytes 2946.00 ms 86.11 ms 7358.48 ms / 4.62 t/s 11019.28 ms
LLaMA-30B 4-bit 43387780 bytes 6666.53 ms 332.71 ms 68779.27 ms / 1.87 t/s 77333.97 ms
LLaMA-65B 4-bit 70897348 bytes 14010.35 ms 335.09 ms 140527.48 ms / 0.91 t/s 157951.48 ms

Memory (RAM) for LLaMA computer

Besides the GPU and CPU, you will also need sufficient RAM (Random Access Memory) and storage space to store the model parameters and data. The RAM requirement for the 4-bit LLaMA-30B is 32 GB, which allows the entire model to be held in memory without swapping to disk. However, if you don’t have the required amount, the model will still load, but not as quickly. For larger models or longer texts, you may want to consider using more RAM, such as 64 GB.

In a situations where you use CPU for inference, the bandwidth between the CPU and the memory is a critical factor. I’d like to emphasize its importance. When generating a single token, the entire model needs to be read from memory once. Suppose your have Core i9-10900X (4 channel support) and DDR4-3600 memory, this means a throughput of 115 GB/s and your model size is 13 GB. In that case the inference speed will be around 8 tokens per second, regardless of how fast your CPU is or how many parallel cores it has.

The amount of RAM depends on the type of GGML /GGUF quantization and the model (LLaMA, Alpaca, Wizard, Vicuna etc.) you are using.

These are the the memory (RAM) requirements for LLaMA model used on the CPU:

GGUF Model Original size Quantized size (4-bit) Quantized size (5-bit) Quantized size (8-bit)
7B 13 GB 3.9 – 7.5 GB 7.5 – 8.5 GB 8.5 – 10.0 GB
13B 24 GB 7.8 – 11 GB 11.5 – 13.5 GB 13.5 – 17.5 GB
30B 60 GB 19.5 – 23.0 GB 23.5 – 27.5 GB 28.5 – 38.5 GB
65B 120 GB 38.5 – 47.0 GB 47.0 – 52.0 GB 71.0 – 80.0 GB

Inference speed for 13B model with 4-bit quantization, based on memory (RAM) speed when running on CPU:

RAM speed CPU CPU channels Bandwidth *Inference
DDR4-3600 Ryzen 5 3600 2 56 GB/s ~ 7 tokens/s
DDR4-3200 Ryzen 5 5600X 2 51 GB/s ~ 6.3 tokens/s
DDR5-5600 Core i9-13900K 2 89.6 GB/s ~ 11.2 tokens/s
DDR4-2666 Core i5-10400f 2 41.6 GB/s ~ 5.1 tokens/s

*These are pure numbers. The actual speed will be lower and will depend on OS and system load. 

Motherboard

For those of you setting up a single GPU system you can pretty much go with a solid mid-range motherboard that matches your CPU type, without sweating the small stuff too much. But, here’s where it gets interesting for anyone eyeing a dual GPU setup, especially with GPUs like dual RTX 3090s. You’ve got to ensure that the motherboard has two PCIe slots and enough space between them. Since the RTX 3090 is a 3-slot GPU, spacing is crucial. Also, don’t forget to check that the motherboard supports bifurcation, meaning it can split a single PCIe 16x slot into two 8x slots (8x/8x). This is essential for getting both GPUs running optimally without bottlenecking performance.

Storage

The minimum storage requirement for LLaMA is 1 TB NVMe SSD, which can store the model files and data files with fast read and write speeds. However, for more data or backup purposes, you may want to use more storage space, such as 2 TB or 4 TB of SSD.

Choose high-speed storage. Opt for a PCIe 4.0 NVMe SSD with excellent sequential speeds to facilitate fast data transfer between storage and system RAM.

How does model quantization effect the choice of GPU?

Quantized LLMs (Language Models) use fewer bits to store and process the model’s weights and activations. This makes them faster and more efficient for GPU deployment.

4-bit quantized LLMs use only 4 bits per weight. This means they take up much less memory and computation time than full-precision models. They can run smoothly on GPUs with low VRAM capacities.

8-bit quantized LLMs use 8 bits per weight. This still reduces memory and computation costs compared to full-precision models, but not as much as 4-bit quantization. They need more GPU memory and computational power to run well. They are more suitable for GPUs with high VRAM capacities and computational capabilities.

To sum up, 4-bit quantized LLMs are more efficient and can run on GPUs with low VRAM capacities. 8-bit quantized LLMs are slightly less efficient and need GPUs with high VRAM capacities and computational capabilities.

LLaMA Precision GPU Memory Requirements Computational Demands Suitable GPU
Native (32-bit) Higher requirements Higher computational demands GPUs with larger VRAM capacities and high computational capabilities
16-bit Quantized Moderate requirements Moderate computational demands GPUs with moderate VRAM capacities and good computational capabilities
8-bit Quantized Relatively higher requirements Slightly higher computational demands GPUs with larger VRAM capacities and higher computational capabilities
4-bit Quantized Lower requirements Lower computational demands GPUs with limited VRAM capacities

As you can see, the precision of an LLaMAhas a direct impact on its GPU memory requirements and computational demands. Native (32-bit) LLMs require the most GPU memory and computational power, while 4-bit and lower quantized LLMs require the least.

The suitable GPU for an LLaMA will depend on its precision and the specific tasks that you want to use it for. If you need to run a large LLaMA (30B+), then you will need a GPU with a large VRAM capacity. If you only need to run a small LLaMA (7B), then you can get away with using a GPU with a smaller VRAM capacity.

It is important to note that the accuracy of the model will also decrease as the quantization level decreases. This is because the reduced precision can lead to errors in the model’s predictions.

The best quantization level for you will depend on your specific needs and requirements. If you need a model that is small and efficient, then you may want to consider using a 2-bit or 3-bit quantized model. However, if you need a model that is highly accurate, then you may want to use a 5 or more bit model.

Does a dual GPU setup deliver better performance than a single one when used with LLaMA?

Adding a second GPU may not speed up text generation as much as expected. Some tests have shown surprising results where lower-end GPUs were faster than higher-end GPUs in generating tokens/second.

Dual GPU setups have more VRAM in total, but each GPU still has its own VRAM limit. The 30B LLaMA needs about 20GB VRAM, so two RTX 3090 GPUs (each with 24GB VRAM) still have only 24GB VRAM available, so splitting the model between the two will not increase the speed. The model should fit in the VRAM of one GPU to run to gain a max speed performance. .

However, if the model is large (65B+) and cannot fit within the VRAM of a single GPU, using multiple GPUs can indeed pseed up the inference compared to running the same model in split mode between the GPU and RAM. In these instances, each GPU can manage a segment of the model, with the model weights being distributed among them.

Therefore, multiple GPUs are commonly used when dealing with large models.

What is faster for inference speed for 65B Llama model – Dual RTX 3090/4090 or Mac M2 Pro/Max/Ultra?

Using Apple M1 or M2 M3 Pro/Max is the recommended option for running large language models on a laptop. These processors utilize Apple’s unified memory architecture (UMA), offering fast and low-latency memory access. This results in improved speed of output. The M1/M2 Pro supports up to 200 GB/s unified memory bandwidth, while the M1/M2 Max supports up to 400 GB/s and M1/M2/M3 Ultra 800 GB/s. For example MacBook Pro M2 Max using Llama.cpp can run 7B model with 65 t/s, 13B model with 30  t/s, and 65B model with 5 t/s.

However in terms of inference speed dual setup of RTX 3090/4090 GPUs is faster compared to the Mac M2 Pro/Max/Ultra. Two RTX 4090s can run 65b models at a speed of 20 tokens per second, while two affordable secondhand RTX 3090s achieve 15 tokens per second with Exllama. Additionally, the Mac evaluates prompts slower, making the dual GPU setup more appealing.

Hints and Tips when choosing PC hardware for LLaMA

Build around the GPU

Create a platform that includes the motherboard, CPU, and RAM. The GPU handles training and inference, while the CPU, RAM, and storage manage data loading. Select a motherboard with PCIe 4.0 (or 5.0) support, multiple NVMe drive slots, x16 GPU slots, and four memory slots. CPUs with high single-threaded speed, like Ryzen 5000/7000 or Intel’s 12th/13th gen, are recommended.

LLM and VRAM

For optimal performance in terms of response quality, it is recommended to run the 8-bit 13B LLM or the 4-bit 30B model on a GPU with at least 20GB VRAM. Both models provide similar quality responses, and VRAM availability should be the deciding factor. Consider options like the RTX 30 series or RTX 40 series, such as the RTX 3090 24GB,  RTX 4090 24GB for the best local performance.

Speed Comparison

The 13B model generally runs faster than the 30B model in terms of tokens generated per second. While the exact speed difference may vary, the 13B model tends to offer a noticeable improvement in generation speed compared to the 30B model.

RAM requirements

Aim for higher speed DDR5 memory and at least 1.5 times the VRAM capacity or double the VRAM for optimal performance.

PCIe 4.0 NVMe SSD

The importance of a high sequential speed PCIe 4.0 NVMe SSD is mainly for the initial model load into VRAM. Once the model is loaded, the SSD’s impact on generation speed (tokens/second) is minimal.

Sufficient Regular RAM

Having enough regular RAM, preferably double the VRAM capacity, is essential for the initial model load. Once the model is loaded, its impact on the actual generation speed is limited.

CPU Single-Threaded Speed

The CPU’s single-threaded speed is important primarily for the initial model load rather than running the model during generation. The CPU’s role is more prominent in tasks such as prompt preprocessing, model loading, and other non-GPU-dependent operations.

Power supply and case

Invest in a high-quality power supply with sufficient capacity to power all components. Choose a spacious case with good airflow for optimal thermals.

DDR5 and future platforms

While DDR5 and future platforms like Zen 4 or AM5 offer advantages, stability and compatibility can vary. Consider investing in a high-end motherboard with good PCIe slot layout and memory support for future upgradeability.

Remember, while these hints and tips provide insights based on experience, individual system configurations and performance may vary. It’s always advisable to experiment and benchmark different setups to find the most suitable solution for your specific needs.

Allan Witt

Allan Witt

Allan Witt is Co-founder and editor in chief of Hardware-corner.net. Computers and the web have fascinated me since I was a child. In 2011 started training as an IT specialist in a medium-sized company and started a blog at the same time. I really enjoy blogging about tech. After successfully completing my training, I worked as a system administrator in the same company for two years. As a part-time job I started tinkering with pre-build PCs and building custom gaming rigs at local hardware shop. The desire to build PCs full-time grew stronger, and now this is my full time job.

Related

Desktops
Best GPUs for 600W and 650W PSU

A high-quality 500W PSU is typically sufficient to power GPUs like the Nvidia GeForce RTX 370 Ti or RTX 4070.

Guides
Dell Outlet and Dell Refurbished Guide

For cheap refurbished desktops, laptops, and workstations made by Dell, you have the option…

Guides
Dell OptiPlex 3020 vs 7020 vs 9020

Differences between the Dell OptiPlex 3020, 7020 and 9020 desktops.

Guides
Best Dedicated GPU for Dell OptiPlex

Pick a GPU for your Dell OptiPlex.