Best Computer to Run LLaMA AI Model at Home (GPU, CPU, RAM, SSD)
Large language models (LLMs) are powerful tools that can generate natural language texts for various tasks and domains. One of the most advanced LLMs is LLaMA (Large Language Model Meta AI), a 65-billion-parameter model developed by Meta AI, a research division of Facebook
To run LLaMA model at home, you will need a computer build with a powerful GPU that can handle the large amount of data and computation required for inferencing. In this article we will discuss some of the hardware requirements in order to run LLaMA locally.
There are many different ways to run LLaMA models on consumer hardware. The most common way is to use a single NVIDIA GeForce RTX 3090 GPU. This GPU has 24 GB of memory, which is enough to run a LLaMA model. The RTX 3090 can run a 4-bit quantized LLaMA 30B model with around 4 to 10 tokens per second. 24GB VRAM seems to be the sweet spot for using a single GPU on a consumer desktop PC.
However, if you need faster performance or want to run a larger model, you have to use a dual GPU setup. This will allow you to fit the model weights inside the VRAM. You can also use an advanced GPU, such as an NVIDIA A100. This GPU is very expensive but has 40 GB of memory, which can run the model even better.
You can also run LLaMA model on the CPU. You have to use GGML version (LLaMA, Vicuna, Alpaca, and GPT4All) of the model and a software called llama.cpp in order to use the processor. A decent CPUs for running LLaMA are Core i7 12900K and Ryzen 9 5900X. Check the CPU section for more info on this topic.
Keep in mind that training or fine-tuning a LLaMA model requires more VRAM than running one. This is because the training process requires the model to be stored in VRAM, as well as the training data. The amount of VRAM required for training will depend on the size of the model and the amount of training data.
Lest look at some of the hardware requirements you need to cover in order to use LLaMA model on a desktop PC:
GPU for running LLaMA
The GPU is the most important piece of computer hardware when running LLaMA at consumer-grade machine because it is responsible for the majority of the processing required to run the model. The GPU’s performance will have a direct impact on the speed and accuracy of the inference.
Different variations and implementations of the model may require less powerful hardware. However, the GPU will still be the most important part of the system.
GPU requirements for 4-bit quantized LLaMA models
|LLaMA Model||Minimum VRAM Requirement||Recommended GPU Examples|
|LLaMA-7B||6GB||RTX 3060, GTX 1660, 2060, AMD 5700 XT, RTX 3050|
|LLaMA-13B||10GB||AMD 6900 XT, RTX 2060 12GB, 3060 12GB, 3080, A2000|
|LLaMA-30B||20GB||RTX 3080 20GB, A4500, A5000, 3090, 4090, 6000, Tesla V100|
|LLaMA-65B||40GB||A100 40GB, 2x3090, 2x4090, A40, RTX A6000, 8000|
To run LLaMA-7B effectively, it is recommended to have a GPU with a minimum of 6GB VRAM. A suitable GPU example for this model is the RTX 3060, which offers a 8GB VRAM version. Other GPUs such as the GTX 1660, 2060, AMD 5700 XT, or RTX 3050, which also have 6GB VRAM, can serve as good options to support LLaMA-7B.
For optimal performance with LLaMA-13B, a GPU with at least 10GB VRAM is suggested. Examples of GPUs that meet this requirement include the AMD 6900 XT, RTX 2060 12GB, 3060 12GB, 3080, or A2000. These GPUs provide the necessary VRAM capacity to handle the computational demands of LLaMA-13B effectively.
To ensure smooth operation of LLaMA-30B, it is advisable to use a GPU with a minimum of 20GB VRAM. The RTX 3080 20GB, A4500, A5000, 3090, 4090, 6000, or Tesla V100 are examples of GPUs that offer the required VRAM capacity. These GPUs enable efficient processing and memory management for LLaMA-30B.
LLaMA-65B performs optimally when paired with a GPU that has a minimum of 40GB VRAM. Suitable examples of GPUs for this model include the A100 40GB, 2x3090, 2x4090, A40, RTX A6000, or 8000. These GPUs provide ample VRAM capacity to handle the intensive computational tasks associated with LLaMA-65B.
Each LLaMA model has specific VRAM requirements, and the suggested GPUs are chosen based on their ability to meet or exceed those requirements, ensuring smooth and efficient performance for the corresponding LLaMA model.
CPU for LLaMA
In addition to the GPU, you will also need a CPU that can support the GPU and handle other tasks such as data loading and preprocessing. The CPU requirement for the GPQT (GPU) based model is much lower that the one that are optimized for CPU.
Good CPUs for LLaMA are Intel Core i9-10900K, i7-12700k, or Ryzen 9 5900x. However, for better performance, you may want to use a more powerful CPU, such as an AMD Ryzen Threadripper 3990X with 64 cores and 128 threads, which can run at a base frequency of 2.9 GHz and a turbo frequency of 4.3 GHz.
We must note that the models discussed in this article are targeted toward GPUs, but there are LLaMa models optimizer for CPU. For example GGML is a solution that addresses the limitations posed by GPU memory when working with large models. If you prefer utilizing a CPU, it is recommended to run GGML format model files.
Than you can than use a software called llama.cpp (interface to the LLaMA model) in order to utilize your CPU. A recent update to llama.cpp has introduced a new enhancement, enabling users to distribute the model’s workload between the CPU and GPU. This not only facilitates the loading of significantly larger models but also amplifies the token/s speed.
This is and example of running llama.cpp with Ryzen 7 3700X and 128GB RAM.
|GGML Model||Memory per Token||Load Time||Sample Time||Predict Time||Total Time|
|LLaMA-7B 4-bit||14434244 bytes||1270.15 ms||325.76 ms||15147.15 ms / 117.42 ms per token||17077.88 ms|
|LLaMA-13B 4-bit||22439492 bytes||2946.00 ms||86.11 ms||7358.48 ms / 216.43 ms per token||11019.28 ms|
|LLaMA-30B 4-bit||43387780 bytes||6666.53 ms||332.71 ms||68779.27 ms / 533.17 ms per token||77333.97 ms|
|LLaMA-65B 4-bit||70897348 bytes||14010.35 ms||335.09 ms||140527.48 ms / 1089.36 ms per token||157951.48 ms|
Memory (RAM) for LLaMA computer
Besides the GPU and CPU, you will also need sufficient RAM (random access memory) and storage space to store the model parameters and data. The minimum RAM requirement for 4-bit LLaMA-30B is 32 GB, which can hold the entire model in memory without swapping to disk. However, for larger datasets or longer texts, you may want to use more RAM, such as 64 GB, 128 GB or 512 GB.
If you are using the CPU optimized version of LLaMA (llama.cpp) the model needs to be fully loaded in to the memory.
The RAM requirements depends on the type of GGML quantization and the model (LLaMA, Alpaca, Wizard, Vicuna etc.) you are using.
These are the the memory (RAM) requirements for LLaMA model used on the CPU:
|GGML Model||Original size||Quantized size (4-bit)||Quantized size (5-bit)||Quantized size (8-bit)|
|7B||13 GB||3.9 – 7.5 GB||7.5 – 8.5 GB||8.5 – 10.0 GB|
|13B||24 GB||7.8 – 11 GB||11.5 – 13.5 GB||13.5 – 17.5 GB|
|30B||60 GB||19.5 – 23.0 GB||23.5 – 27.5 GB||28.5 – 38.5 GB|
|65B||120 GB||38.5 – 47.0 GB||47.0 – 52.0 GB||71.0 – 80.0 GB|
The minimum storage requirement for LLaMA is 1 TB NVMe SSD, which can store the model files and data files with fast read and write speeds. However, for more data or backup purposes, you may want to use more storage space, such as 2 TB or 4 TB of SSD.
Choose high-speed storage. Opt for a PCIe 4.0 NVMe SSD with excellent sequential speeds to facilitate fast data transfer between storage and system RAM.
How does model quantization effect the choice of GPU?
Quantized LLMs (Language Models) use fewer bits to store and process the model’s weights and activations. This makes them faster and more efficient for GPU deployment.
4-bit quantized LLMs use only 4 bits per weight or activation. This means they take up much less memory and computation time than full-precision models. They can run smoothly on GPUs with low VRAM capacities.
8-bit quantized LLMs use 8 bits per weight or activation. This still reduces memory and computation costs compared to full-precision models, but not as much as 4-bit quantization. They need more GPU memory and computational power to run well. They are more suitable for GPUs with high VRAM capacities and computational capabilities.
To sum up, 4-bit quantized LLMs are more efficient and can run on GPUs with low VRAM capacities. 8-bit quantized LLMs are slightly less efficient and need GPUs with high VRAM capacities and computational capabilities.
|LLaMA Precision||GPU Memory Requirements||Computational Demands||Suitable GPU|
|Native (32-bit)||Higher requirements||Higher computational demands||GPUs with larger VRAM capacities and high computational capabilities|
|16-bit Quantized||Moderate requirements||Moderate computational demands||GPUs with moderate VRAM capacities and good computational capabilities|
|8-bit Quantized||Relatively higher requirements||Slightly higher computational demands||GPUs with larger VRAM capacities and higher computational capabilities|
|4-bit Quantized||Lower requirements||Lower computational demands||GPUs with limited VRAM capacities|
As you can see, the precision of an LLaMAhas a direct impact on its GPU memory requirements and computational demands. Native (32-bit) LLMs require the most GPU memory and computational power, while 4-bit quantized LLMs require the least.
The suitable GPU for an LLaMA will depend on its precision and the specific tasks that you want to use it for. If you need to run a large LLaMA on a variety of tasks, then you will need a GPU with a large VRAM capacity and high computational capabilities. If you only need to run a small LLaMA on a few specific tasks, then you can get away with using a GPU with a smaller VRAM capacity and lower computational capabilities.
It is important to note that the accuracy of the model will also decrease as the quantization level decreases. This is because the reduced precision can lead to errors in the model’s predictions.
The best quantization level for you will depend on your specific needs and requirements. If you need a model that is small and efficient, then you may want to consider using a 4-bit or 8-bit quantized model. However, if you need a model that is highly accurate, then you may want to use a 16-bit model.
Do dual GPUs setup deliver better performance than one, when used with LLaMA?
Adding a second GPU may not speed up text generation as much as expected. A bottleneck seems to block the simple solution of adding more compute power. Some tests have shown surprising results where lower-end GPUs were faster than higher-end GPUs in generating tokens/second. The reason for this is unclear, and text generation programs may need better optimization to use dual GPU setups well.
Dual GPU setups have more VRAM in total, but each GPU still has its own VRAM limit. The 30B LLaMA needs about 20GB VRAM, so two RTX 3090 GPUs (each with 24GB VRAM) still have only 24GB VRAM available. The model should fit in the VRAM of one GPU to run well.
However, if the model is too large to fit within the VRAM of a single GPU and needs to utilize system RAM, using multiple GPUs can indeed speed up the process. In such cases, each GPU can handle a portion of the model, and the computational load is distributed among them. This parallelization can lead to speed improvements for large models that exceed the VRAM capacity of a single GPU.
Therefore, multiple GPUs are commonly employed when dealing with large models that have high VRAM requirements. It allows for efficient utilization of resources and accelerates the training or inference process.
Splitting a big language model like the 65B LLaMA over multiple GPUs with model parallelism can be hard and may cause communication delays. Splitting and syncing the model’s parameters and computations over GPUs needs careful coding and may not always improve performance much.
Dual GPU setups may not work well with some software. Some machine learning frameworks or libraries may not use multiple GPUs fully, and it may take extra work to set up and optimize the system for dual GPU use.
These limitations mean that it’s important to compare the possible benefits with the difficulty and potential problems of using a dual GPU setup for a 30B LLaMA. Sometimes, getting a stronger single GPU or trying other optimization methods may be a better way.
Hints and Tips when choosing PC hardware for LLaMA
Build around the GPU
Create a platform that includes the motherboard, CPU, and RAM. The GPU handles training and inference, while the CPU, RAM, and storage manage data loading. Select a motherboard with PCIe 4.0 (or 5.0) support, multiple NVMe drive slots, x16 GPU slots, and ample memory DIMMs. CPUs with high single-threaded speed, like Ryzen 5000 or Intel’s 12th/13th gen, are recommended.
Model Choice and VRAM
For optimal performance in terms of response quality, it is recommended to run the 8-bit 13B model or the 4-bit 30B model on a GPU with at least 20GB VRAM. Both models provide similar quality responses, and VRAM availability should be the deciding factor. Invest in an Nvidia GPU with tensor cores to enhance performance. Consider options like the RTX 30 series or RTX 40 series, such as the RTX 3090 24GB, RTX 4090 24GB for optimal performance.
The 13B model generally runs faster than the 30B model in terms of tokens generated per second. While the exact speed difference may vary, the 13B model tends to offer a noticeable improvement in generation speed compared to the 30B model.
Aim for at least 1.5 times the VRAM capacity or double the VRAM for optimal performance. Motherboard and CPU selection become critical when working with 128GB or more RAM.
PCIe 4.0 NVMe SSD
The importance of a high sequential speed PCIe 4.0 NVMe SSD is mainly for the initial model load into VRAM. Once the model is loaded, the SSD’s impact on generation speed (tokens/second) is minimal.
Sufficient Regular RAM
Having enough regular RAM, preferably double the VRAM capacity, is essential for the initial model load. Once the model is loaded, its impact on the actual generation speed is limited. Ensuring sufficient regular RAM during the initial load is crucial for a smooth experience.
CPU Single-Threaded Speed
The CPU’s single-threaded speed is important primarily for the initial model load rather than running the model during generation. The CPU’s role is more prominent in tasks such as data preprocessing, model loading, and other non-GPU-dependent operations.
Scaling for Increased Speed
If you need to increase the speed of text generation from 15 tokens/second to 30 tokens/second, setting up a literal clone of the entire PC may be more effective than adding a second 3090 card. Doubling the overall system resources, including CPU and RAM, may yield better results in increasing the text generation speed.
Single GPU performance
A single GPU typically offers faster performance than a multi-GPU setup due to the internal bandwidth advantages within the GPU itself.
Power supply and case
Invest in a high-quality power supply with sufficient capacity to power all components. Choose a spacious case with good airflow for optimal thermals.
DDR5 and future platforms
While DDR5 and future platforms like Zen 4 or AM5 offer advantages, stability and compatibility can vary. Consider investing in a high-end motherboard with good PCIe slot layout and memory support for future upgradeability.
Remember, while these hints and tips provide insights based on experience, individual system configurations and performance may vary. It’s always advisable to experiment and benchmark different setups to find the most suitable solution for your specific needs.
Allan WittAllan Witt is Co-founder and editor in chief of Hardware-corner.net. Computers and the web have fascinated me since I was a child. In 2011 started training as an IT specialist in a medium-sized company and started a blog at the same time. I really enjoy blogging about tech. After successfully completing my training, I worked as a system administrator in the same company for two years. As a part-time job I started tinkering with pre-build PCs and building custom gaming rigs at local hardware shop. The desire to build PCs full-time grew stronger, and now this is my full time job.
Best GPUs for 600W and 650W PSU
A high-quality 500W PSU is typically sufficient to power GPUs like the Nvidia GeForce RTX 370 Ti or RTX 4070.
Dell Outlet and Dell Refurbished Guide
For cheap refurbished desktops, laptops, and workstations made by Dell, you have the option…
Dell OptiPlex 3020 vs 7020 vs 9020
Differences between the Dell OptiPlex 3020, 7020 and 9020 desktops.