GPU and Apple Silicone Benchmarks with Large Language Models
This chart showcases a range of benchmarks for GPU performance while running large language models like LLaMA and Llama-2, using various quantizations. The data covers a set of GPUs, from Apple Silicon M series chips to Nvidia GPUs, helping you make an informed decision if you’re considering using a large language model locally.
To switch GPUs on or off, use to the legend below the graph. When you hover over a data point, you’ll see additional details about each model, such as an estimated system price.
Memory Boundary Conditions for GPUs Running Large Language Models
Our benchmarks emphasize the crucial role of VRAM capacity when running large language models. Even if a GPU can manage specified model sizes and quantizations—for instance, a context of 512 tokens—it may struggle or fail with larger contexts due to VRAM limitations. Take the RTX 3090, which comes with 24 GB of VRAM, as an example. It handled the 30 billion (30B) parameter Airobors Llama-2 model with 5-bit quantization (Q_5), consuming around 23 GB of VRAM. However, expanding the context caused the GPU to run out of memory. This scenario illustrates the importance of balancing model size, quantization level, and context length for users.
Key Components of the Benchmark
GPUs Tested
We’ve included a variety of consumer-grade GPUs that are suitable for local setups. For instance, the Nvidia A100 80GB is available on the second-hand market for around $15,000. In contrast, a dual RTX 4090 setup, which allows you to run 70B models at a reasonable speed, costs only $4,000 for a brand-new setup.
Model | VRAM | Bandwidth | TFLOPS | 7B 4-bit TG |
---|---|---|---|---|
RTX 3060 | 12GB | 360 GB/s | 12.74 | 59.86 t/s |
RTX 3080 Ti | 12GB | 912 GB/s | 34.10 | 108.46 t/s |
RTX 4080 | 16GB | 716 GB/s | 48.74 | 112.85 t/s |
RTX 4000 Ada | 20GB | 360 GB/s | 26.73 | 64.53 t/s |
RTX 3090 | 24GB | 936 GB/s | 35.58 | 120.6 t/s |
RTX 4090 | 24GB | 1,008 GB/s | 82.58 | 139.37 t/s |
(2x) RTX 3060 | 24GB | 360 GB/s | 12.74 | 59.86 t/s |
(2x) RTX 3090 | 48GB | 936 GB/s | 35.58 | 120.6 t/s |
(2x) RTX 4090 | 48GB | 1,008 GB/s | 82.58 | 139.37 t/s |
RTX A6000 | 48GB | 768.0 GB/s | 38.71 | 107.11 t/s |
M3 Max 40-GPU | 48GB | 400 GB/s | 13.6 | 66.31 t/s |
(3x) RTX 3090 | 72GB | 936 GB/s | 35.58 | 120.6 t/s |
M2 Ultra 76-GPU | 192 GB | 800 GB/s | 27.2 | 93.86 t/s |
Model Quantization
The benchmark includes model sizes ranging from 7 billion (7B) to 75 billion (75B) parameters, illustrating the influence of various quantizations on processing speed. Our tests were conducted on the LLaMA, Llama-2 and Mixtral MoE models; however, you can make rough estimates about the inference speed for other models, such as Mistral and Yi, based on the size of their weights in gigabytes. The table below displays the sizes of the models we used, categorized by their quantization.
Model Name | Parameter Count | Model Quantization | Model Size (GB) |
---|---|---|---|
7B_q4_0 | 7B | 4-bit | 3.8 |
7B_q5_0 | 7B | 5-bit | 4.6 |
7B_q8_0 | 7B | 8-bit | 7.1 |
13B_q4_0 | 13B | 4-bit | 7.8 |
13B_q5_0 | 13B | 5-bit | 8.9 |
7B_f16 | 7B | 16-bit | 13.4 |
13B_q8_0 | 13B | 8-bit | 13.8 |
30B_q4_0 | 30B | 4-bit | 19.1 |
30B_q5_0 | 30B | 5-bit | 23.2 |
13B_f16 | 13B | 16-bit | 24.2 |
8x7B_q5_0 | 8x7B (46B) | 5-bit | 32.2 |
30B_q8_0 | 30B | 8-bit | 35.9 |
65B_q4_0 | 65B | 4-bit | 36.8 |
70B_q4_0 | 70B | 4-bit | 38.9 |
30B_f16 | 30B | 16-bit | 60 |
120B_Q4 | 120B | 4-bit | 66 |
Speed Measurement
Performance is quantified as tokens per second (t/s), representing the average speed after three tests with a 512 tokens context and 1024 tokens generated.
Pricing Information
The chart includes estimated prices for computer systems equipped with the tested GPUs or Apple Silicon chips. Prices reflect the cost of complete systems, incorporating both new, second-hand, and refurbished units, to account for market availability.
Benchmarking Environment
Data was gathered from user benchmarks across the web and our personal benchmarks. We used Ubuntu 22.04, CUDA 12.1, and llama.cpp (build: 8504d2d0, 2097).
For the dual GPU setup, we utilized both -sm row
and -sm layer
options in llama.cpp. With -sm row
, the dual RTX 3090 demonstrated a higher inference speed of 3 tokens per second (t/s), whereas the dual RTX 4090 performed better with -sm layer
, achieving 5 t/s more. However, it’s important to note that using the -sm row
option results in a prompt processing speed decrease of approximately 60%.
Tests were conducted on systems sourced from two cloud GPU providers, vast.ai and runpod.io.
This GPU benchmark graph is work in progress and will be update with more GPUs regularly.
Allan Witt
Allan Witt is Co-founder and editor in chief of Hardware-corner.net. Computers and the web have fascinated me since I was a child. In 2011 started training as an IT specialist in a medium-sized company and started a blog at the same time. I really enjoy blogging about tech. After successfully completing my training, I worked as a system administrator in the same company for two years. As a part-time job I started tinkering with pre-build PCs and building custom gaming rigs at local hardware shop. The desire to build PCs full-time grew stronger, and now this is my full time job.Related
Desktops
Best GPUs for 600W and 650W PSU
A high-quality 500W PSU is typically sufficient to power GPUs like the Nvidia GeForce RTX 370 Ti or RTX 4070.
Guides
Dell Outlet and Dell Refurbished Guide
For cheap refurbished desktops, laptops, and workstations made by Dell, you have the option…
Guides
Dell OptiPlex 3020 vs 7020 vs 9020
Differences between the Dell OptiPlex 3020, 7020 and 9020 desktops.