Computer Hardware Required to Run LLaMA AI Model Locally (GPU, CPU, RAM, SSD)

Large language models (LLMs) are powerful tools that can generate natural language texts for various tasks and domains. One of the most advanced LLMs is LLaMA (Large Language Model Meta AI), a 65-billion-parameter model developed by Meta AI, a research division of Facebook
To run LLaMA model at home, you will need a computer build with a powerful GPU that can handle the large amount of data and computation required for inferencing. In this article we will discuss some of the hardware requirements in order to run LLaMA and Llama-2 locally.
There are different ways to run LLaMA models on consumer hardware. The most common is to use a single NVIDIA GeForce RTX 3090 GPU. This GPU has 24 GB of memory, which is enough to run a LLaMA model. For example you can use RTX 3090, ExLlama model loader, and 4-bit quantized LLaMA 30B model with around 30 to 40 tokens per second, which is huge.
However, if you want to run a larger 65B model, you have to use a dual GPU setup. This will allow you to fit the model weights inside the VRAM. Combinations like 2x RTX 3090’s or RTX 3090 and RTX 4090 are popular.
You can also run LLaMA model on the CPU. This option requires GGML version (LLaMA, Vicuna, Alpaca) of the model and a software called llama.cpp. A decent CPUs for running LLaMA are Core i7 12900K and Ryzen 9 5900X. Check the CPU section for more info on this topic.
Lest look at some of the hardware requirements you need to cover in order to use LLaMA model on a desktop PC:
GPU for running LLaMA
The GPU is the most important piece of computer hardware when running LLaMA at consumer-grade machine because it is responsible for the majority of the processing required to run the model. The GPU’s performance will have a direct impact on the speed of inference.
Different variations and implementations of the model may require less powerful hardware. However, the GPU will still be the most important part of the system.
GPU requirements for 4-bit quantized LLaMA models
LLaMA Model | Minimum VRAM Requirement | Recommended GPU Examples |
---|---|---|
LLaMA-7B | 6GB | RTX 3060, GTX 1660, 2060, AMD 5700 XT, RTX 3050 |
LLaMA-13B | 10GB | AMD 6900 XT, RTX 2060 12GB, 3060 12GB, 3080, A2000 |
LLaMA-30B | 20GB | RTX 3080 20GB, A4500, A5000, 3090, 4090, 6000, Tesla V100, Tesla P40 |
LLaMA-65B | 40GB | A100 40GB, 2x3090, 2x4090, A40, RTX A6000, 8000 |
Example of inference speed using ExLlama, RTX 4090, and Intel i9-12900K CPU
Model | Size | Seq. len. | VRAM used | Speed |
---|---|---|---|---|
LLaMA | 7B | 2,048 t | 5 GB | 138 t/s |
LLaMA | 13B | 2,048 t | 9 GB | 85 t/s |
LLaMA | 33B | 2,048 t | 20 GB | 35 t/s |
LLaMA-7B
To run LLaMA-7B effectively, it is recommended to have a GPU with a minimum of 6GB VRAM. A suitable GPU example for this model is the RTX 3060, which offers a 8GB VRAM version. Other GPUs such as the GTX 1660, 2060, AMD 5700 XT, or RTX 3050, which also have 6GB VRAM, can serve as good options to support LLaMA-7B.
LLaMA-13B
For optimal performance with LLaMA-13B, a GPU with at least 10GB VRAM is suggested. Examples of GPUs that meet this requirement include the AMD 6900 XT, RTX 2060 12GB, 3060 12GB, 3080, or A2000. These GPUs provide the necessary VRAM capacity to handle the computational demands of LLaMA-13B effectively.
LLaMA-30B
To ensure smooth operation of LLaMA-30B, it is advisable to use a GPU with a minimum of 20GB VRAM. The RTX 3080 20GB, A4500, A5000, 3090, 4090, 6000, or Tesla V100 are examples of GPUs that offer the required VRAM capacity. These GPUs enable efficient processing and memory management for LLaMA-30B.
LLaMA-65B
LLaMA-65B performs optimally when paired with a GPU that has a minimum of 40GB VRAM. Suitable examples of GPUs for this model include the A100 40GB, 2x3090, 2x4090, A40, RTX A6000, or 8000. These GPUs provide ample VRAM capacity to handle the intensive computational tasks associated with LLaMA-65B.
Each LLaMA model has specific VRAM requirements, and the suggested GPUs are chosen based on their ability to meet or exceed those requirements, ensuring smooth and efficient performance for the corresponding LLaMA model.
CPU for LLaMA
In addition to the GPU, you will also need a CPU that can support the GPU and handle other tasks such as data loading and preprocessing. The CPU requirement for the GPQT (GPU) based model is lower that the one that are optimized for CPU.
Good CPUs for LLaMA are Intel Core i9-10900K, i7-12700K, or Ryzen 9 5900x. However, for better performance, you may want to use a more powerful CPU, such as an AMD Ryzen Threadripper 3990X with 64 cores and 128 threads. When it comes to choosing between an expensive server CPU and a high-end gaming CPU, the latter takes the lead.
We must note that in this article we discussing a build that is targeted toward GPUs, but there are LLaMa models optimizer for CPU. For example GGML is a solution that addresses the limitations posed by GPU memory when working with large models. If you prefer utilizing a CPU, it is recommended to run GGML format model files.
Than you can than use a software called llama.cpp (interface to the LLaMA model) in order to utilize your CPU. A recent update to llama.cpp has introduced a new enhancement, enabling users to distribute the model’s workload between the CPU and GPU. This not only facilitates the loading of significantly larger models but also amplifies the token/s speed.
Keep in mind that prompt processing with llama.cpp is highly dependent on CPU performance. Specifically, it scales with the number of CPU cores and threads used. This indicates that prompt processing is a CPU-bound workload – the speed is limited by raw CPU compute throughput rather than memory bandwidth or latency. In summary, prompt processing performance can be readily improved by using faster CPUs with more cores/threads.
This is and example of running llama.cpp with Ryzen 7 3700X and 128GB RAM.
GGML Model | Memory per Token | Load Time | Sample Time | Predict Time | Total Time |
---|---|---|---|---|---|
LLaMA-7B 4-bit | 14434244 bytes | 1270.15 ms | 325.76 ms | 15147.15 ms / 117.42 ms per token | 17077.88 ms |
LLaMA-13B 4-bit | 22439492 bytes | 2946.00 ms | 86.11 ms | 7358.48 ms / 216.43 ms per token | 11019.28 ms |
LLaMA-30B 4-bit | 43387780 bytes | 6666.53 ms | 332.71 ms | 68779.27 ms / 533.17 ms per token | 77333.97 ms |
LLaMA-65B 4-bit | 70897348 bytes | 14010.35 ms | 335.09 ms | 140527.48 ms / 1089.36 ms per token | 157951.48 ms |
Memory (RAM) for LLaMA computer
Besides the GPU and CPU, you will also need sufficient RAM (random access memory) and storage space to store the model parameters and data. The minimum RAM requirement for 4-bit LLaMA-30B is 32 GB, which can hold the entire model in memory without swapping to disk. However, for larger datasets or longer texts, you may want to use more RAM, such as 64 GB or 128 GB.
In a situations where you use CPU for inference, the bandwidth between the CPU and the memory is a critical factor. I’d like to emphasize its importance. When generating a single token, the entire model needs to be read from memory once. Suppose your have Core i9-10900X (4 channel support) and DDR4-3600 memory, this means a throughput of 115 GB/s and your model size is 13 GB. In that case the inference speed will be around 9 tokens per second, regardless of how fast your CPU is or how many parallel cores it has.
The amount of RAM depends on the type of GGML quantization and the model (LLaMA, Alpaca, Wizard, Vicuna etc.) you are using.
These are the the memory (RAM) requirements for LLaMA model used on the CPU:
GGML Model | Original size | Quantized size (4-bit) | Quantized size (5-bit) | Quantized size (8-bit) |
---|---|---|---|---|
7B | 13 GB | 3.9 – 7.5 GB | 7.5 – 8.5 GB | 8.5 – 10.0 GB |
13B | 24 GB | 7.8 – 11 GB | 11.5 – 13.5 GB | 13.5 – 17.5 GB |
30B | 60 GB | 19.5 – 23.0 GB | 23.5 – 27.5 GB | 28.5 – 38.5 GB |
65B | 120 GB | 38.5 – 47.0 GB | 47.0 – 52.0 GB | 71.0 – 80.0 GB |
Inference speed for 13B model with 4-bit quantization, based on memory (RAM) speed when running on CPU:
RAM speed | CPU | CPU channels | Bandwidth | *Inference |
---|---|---|---|---|
DDR4-3600 | Ryzen 5 3600 | 2 | 56 GB/s | ~ 7 tokens/s |
DDR4-3200 | Ryzen 5 5600X | 2 | 51 GB/s | ~ 6.3 tokens/s |
DDR5-5600 | Core i9-13900K | 2 | 89.6 GB/s | ~ 11.2 tokens/s |
DDR4-2666 | Core i5-10400f | 2 | 41.6 GB/s | ~ 5.1 tokens/s |
*The speed will depend on OS and system load.
Storage
The minimum storage requirement for LLaMA is 1 TB NVMe SSD, which can store the model files and data files with fast read and write speeds. However, for more data or backup purposes, you may want to use more storage space, such as 2 TB or 4 TB of SSD.
Choose high-speed storage. Opt for a PCIe 4.0 NVMe SSD with excellent sequential speeds to facilitate fast data transfer between storage and system RAM.
How does model quantization effect the choice of GPU?
Quantized LLMs (Language Models) use fewer bits to store and process the model’s weights and activations. This makes them faster and more efficient for GPU deployment.
4-bit quantized LLMs use only 4 bits per weight or activation. This means they take up much less memory and computation time than full-precision models. They can run smoothly on GPUs with low VRAM capacities.
8-bit quantized LLMs use 8 bits per weight or activation. This still reduces memory and computation costs compared to full-precision models, but not as much as 4-bit quantization. They need more GPU memory and computational power to run well. They are more suitable for GPUs with high VRAM capacities and computational capabilities.
To sum up, 4-bit quantized LLMs are more efficient and can run on GPUs with low VRAM capacities. 8-bit quantized LLMs are slightly less efficient and need GPUs with high VRAM capacities and computational capabilities.
LLaMA Precision | GPU Memory Requirements | Computational Demands | Suitable GPU |
---|---|---|---|
Native (32-bit) | Higher requirements | Higher computational demands | GPUs with larger VRAM capacities and high computational capabilities |
16-bit Quantized | Moderate requirements | Moderate computational demands | GPUs with moderate VRAM capacities and good computational capabilities |
8-bit Quantized | Relatively higher requirements | Slightly higher computational demands | GPUs with larger VRAM capacities and higher computational capabilities |
4-bit Quantized | Lower requirements | Lower computational demands | GPUs with limited VRAM capacities |
As you can see, the precision of an LLaMAhas a direct impact on its GPU memory requirements and computational demands. Native (32-bit) LLMs require the most GPU memory and computational power, while 4-bit quantized LLMs require the least.
The suitable GPU for an LLaMA will depend on its precision and the specific tasks that you want to use it for. If you need to run a large LLaMA on a variety of tasks, then you will need a GPU with a large VRAM capacity and high computational capabilities. If you only need to run a small LLaMA on a few specific tasks, then you can get away with using a GPU with a smaller VRAM capacity and lower computational capabilities.
It is important to note that the accuracy of the model will also decrease as the quantization level decreases. This is because the reduced precision can lead to errors in the model’s predictions.
The best quantization level for you will depend on your specific needs and requirements. If you need a model that is small and efficient, then you may want to consider using a 4-bit or 8-bit quantized model. However, if you need a model that is highly accurate, then you may want to use a 16-bit model.
Does a dual GPU setup deliver better performance than a single one when used with LLaMA?
Adding a second GPU may not speed up text generation as much as expected. A bottleneck seems to block the simple solution of adding more compute power. Some tests have shown surprising results where lower-end GPUs were faster than higher-end GPUs in generating tokens/second. The reason for this is unclear, and text generation programs may need better optimization to use dual GPU setups well.
Dual GPU setups have more VRAM in total, but each GPU still has its own VRAM limit. The 30B LLaMA needs about 20GB VRAM, so two RTX 3090 GPUs (each with 24GB VRAM) still have only 24GB VRAM available. The model should fit in the VRAM of one GPU to run well.
However, if the model is too large (65B) to fit within the VRAM of a single GPU and needs to utilize system RAM, using multiple GPUs can indeed speed up the process. In such cases, each GPU can handle a portion of the model, and the computational load is distributed among them. This parallelization can lead to speed improvements for large models that exceed the VRAM capacity of a single GPU.
Therefore, multiple GPUs are commonly employed when dealing with large models that have high VRAM requirements.
What is faster for inference speed for 65B Llama model – Dual RTX 3090/4090 or Mac M2 Pro/Max/Ultra?
Using Apple M1 or M2 Pro/Max/Ultra laptops is a recommended option for running LLaMA AI on a laptop. These processors utilize Apple’s unified memory architecture (UMA), offering fast and low-latency memory access. This results in improved speed of output. The M1/M2 Pro supports up to 200 GB/s unified memory bandwidth, while the M1/M2 Max supports up to 400 GB/s. For example MacBook M2 Max using Llama.cpp can run 7B model with 38 t/s, 13B model with 22 t/s, and 65B model with 5 t/s.
However in terms of inference speed dual setup of RTX 3090/4090 GPUs is faster compared to the Mac M2 Pro/Max/Ultra. Two RTX 4090s can run 65b models at a speed of 20 tokens per second, while two affordable secondhand RTX 3090s achieve 15 tokens per second with Exllama. Additionally, the Mac evaluates prompts slower, making the dual GPU setup more appealing.
Hints and Tips when choosing PC hardware for LLaMA
Build around the GPU
Create a platform that includes the motherboard, CPU, and RAM. The GPU handles training and inference, while the CPU, RAM, and storage manage data loading. Select a motherboard with PCIe 4.0 (or 5.0) support, multiple NVMe drive slots, x16 GPU slots, and ample memory DIMMs. CPUs with high single-threaded speed, like Ryzen 5000 or Intel’s 12th/13th gen, are recommended.
Model Choice and VRAM
For optimal performance in terms of response quality, it is recommended to run the 8-bit 13B model or the 4-bit 30B model on a GPU with at least 20GB VRAM. Both models provide similar quality responses, and VRAM availability should be the deciding factor. Invest in an Nvidia GPU with tensor cores to enhance performance. Consider options like the RTX 30 series or RTX 40 series, such as the RTX 3090 24GB, RTX 4090 24GB for optimal performance.
Speed Comparison
The 13B model generally runs faster than the 30B model in terms of tokens generated per second. While the exact speed difference may vary, the 13B model tends to offer a noticeable improvement in generation speed compared to the 30B model.
RAM requirements
Aim for at least 1.5 times the VRAM capacity or double the VRAM for optimal performance. Motherboard and CPU selection become critical when working with 128GB or more RAM.
PCIe 4.0 NVMe SSD
The importance of a high sequential speed PCIe 4.0 NVMe SSD is mainly for the initial model load into VRAM. Once the model is loaded, the SSD’s impact on generation speed (tokens/second) is minimal.
Sufficient Regular RAM
Having enough regular RAM, preferably double the VRAM capacity, is essential for the initial model load. Once the model is loaded, its impact on the actual generation speed is limited. Ensuring sufficient regular RAM during the initial load is crucial for a smooth experience.
CPU Single-Threaded Speed
The CPU’s single-threaded speed is important primarily for the initial model load rather than running the model during generation. The CPU’s role is more prominent in tasks such as data preprocessing, model loading, and other non-GPU-dependent operations.
Single GPU performance
A single GPU typically offers faster performance than a multi-GPU setup due to the internal bandwidth advantages within the GPU itself.
Power supply and case
Invest in a high-quality power supply with sufficient capacity to power all components. Choose a spacious case with good airflow for optimal thermals.
DDR5 and future platforms
While DDR5 and future platforms like Zen 4 or AM5 offer advantages, stability and compatibility can vary. Consider investing in a high-end motherboard with good PCIe slot layout and memory support for future upgradeability.
Remember, while these hints and tips provide insights based on experience, individual system configurations and performance may vary. It’s always advisable to experiment and benchmark different setups to find the most suitable solution for your specific needs.
Allan Witt
Allan Witt is Co-founder and editor in chief of Hardware-corner.net. Computers and the web have fascinated me since I was a child. In 2011 started training as an IT specialist in a medium-sized company and started a blog at the same time. I really enjoy blogging about tech. After successfully completing my training, I worked as a system administrator in the same company for two years. As a part-time job I started tinkering with pre-build PCs and building custom gaming rigs at local hardware shop. The desire to build PCs full-time grew stronger, and now this is my full time job.Related
Desktops
Best GPUs for 600W and 650W PSU
A high-quality 500W PSU is typically sufficient to power GPUs like the Nvidia GeForce RTX 370 Ti or RTX 4070.
Guides
Dell Outlet and Dell Refurbished Guide
For cheap refurbished desktops, laptops, and workstations made by Dell, you have the option…
Guides
Dell OptiPlex 3020 vs 7020 vs 9020
Differences between the Dell OptiPlex 3020, 7020 and 9020 desktops.