RTX 3090 and Local LLMs: What Fits in 24GB VRAM, from Model Size to Context Limits
For the technically savvy enthusiast building systems for local LLM inference on a budget, the NVIDIA GeForce RTX 3090 presents a compelling value proposition, particularly on the second-hand market. Its generous 24GB GDDR6X VRAM buffer is the critical enabler, allowing for increasingly sophisticated models to run entirely within the GPU’s memory, bypassing the performance bottlenecks of system RAM offloading. Compared to newer, pricier options like the RTX 4090 which offers the same VRAM capacity, or anticipating the costs of upcoming generations, the RTX 3090 delivers substantial memory capacity and bandwidth (936 GB/s) at a price point that aligns well with the performance-per-dollar focus of experienced builders comfortable with system tuning and hardware management.
Successfully leveraging this 24GB capacity requires a firm grasp of the trade-offs between LLM parameter count, quantization levels, and context length. Quantization, especially to 4-bit formats, is essential for fitting larger models (up to ~32B parameters) by drastically reducing their VRAM footprint. However, the desired context length also significantly consumes VRAM via the KV cache. Therefore, running models effectively on the RTX 3090 involves a careful balancing act: selecting the appropriate quantization to fit the base model, and then adjusting the context length to stay within the 24GB limit, understanding that higher fidelity (less quantization) or longer context windows necessitate compromises, particularly with larger models near the card’s VRAM ceiling.
What You Can Run on an RTX 3090 (24GB VRAM, No Offloading)
Before diving into specific model sizes, it’s crucial to understand that the following VRAM usage figures are estimates. Actual memory consumption can vary slightly based on the specific model architecture (number of layers, hidden dimensions, attention heads), the inference software being used (like llama.cpp, LM Studio), and driver overhead.
Furthermore, the context length calculations presented here assume an unquantized KV cache. Advanced users can often enable KV cache quantization (to 8-bit or even 4-bit), which can reduce the VRAM impact of long contexts, allowing for even larger context windows than estimated below within the same 24GB budget. Our figures provide a solid baseline assuming standard KV cache handling.
Note that the listed VRAM requirements reflect only the LLM and its context cache, excluding operating system overhead; achieving these maximums often requires dedicating the RTX 3090 solely to compute by using an integrated GPU for display output or operating the system headlessly via a text terminal.
The following test are done on headless Ubuntu 20.04 LTS server using llama.cpp build 5258 and Open WebUI as frontend.
The following tests were conducted on a headless Ubuntu 22.04.5 LTS server using llama.cpp build 5258 with Open WebUI as the frontend.
30-32 Billion Parameter Models
Models in the 30-32 billion parameter range represent the upper limit of what can comfortably fit entirely within an RTX 3090’s 24GB VRAM using common quantization levels like 4-bit or 5-bit. This class includes capable models such as various Qwen variants like Qwen_Qwen3-32B, Qwen_QwQ-32B, the coding-focused Qwen2.5-Coder-32B-Instruct, and the compact Qwen_Qwen3-30B-A3B. Fitting these models requires careful consideration of quantization and context length.
Model Tested | Quantization | Context | TG* | PP* | VRAM (GB) |
---|---|---|---|---|---|
Qwen3 32B | Q4_K_M | 12K | 22 t/s | 590 t/s | 23.40 |
Qwen3 32B | Q5_K_M | 4K | 25 t/s | 770 t/s | 23.86 |
Qwen3 30B A3B | Q4_K_M | 32K | 35 t/s | 750 t/s | 23.91 |
Qwen3 30B A3B | Q5_K_M | 16K | 54 t/s | 1035 t/s | 23.93 |
*TG – Token Generation | PP* – Prompt Processing
These results show that 30-32B models are viable on a 24GB card, but compromises are necessary. Using 4-bit quantization (Q4_K_M) offers more headroom for context, allowing up to 16k or even 32k depending on the specific model variant, while staying under the 24GB limit. Stepping up to 5-bit quantization (Q5_K_M) for potentially better quality significantly restricts the usable context length, often limiting it to around 8k-16k tokens. It’s worth noting that for complex reasoning tasks, models like Qwen_Qwen3-30B-A3B or Qwen_QwQ-32B often benefit significantly from larger context windows; users should be aware that achieving a practical context size like 16k might necessitate using 4-bit quantization.
20-27 Billion Parameter Models
Dropping down to the 20-27B parameter range provides considerably more flexibility within the 24GB VRAM envelope. This category includes interesting models like Google’s Gemma 3 27B Instruct (google_gemma-3-27b-it), Mistral’s compact yet capable Mistral-Small-24B, and the specialized coding model Codestral-22B-v0.1. With these models, users can often achieve longer context lengths even with higher quality quantization levels.
Model Tested | Parameters | Quantization | Context | VRAM Used (GB) |
---|---|---|---|---|
Gemma 3 27b | 27B | Q4_K_M | 12K | ~23.90 |
Gemma 3 27b | 27B | Q5_K_M | 8K | ~23.79 |
Gemma 3 27b | 27B | Q6_K | 4K | ~23.87 |
Mistral Small 24B | 24B | Q4_K_M | 36K | ~23.63 |
Mistral Small 24B | 24B | Q5_K_M | 26K | ~23.61 |
Mistral Small 24B | 24B | Q6_K | 18K | ~23.76 |
Mistral Small 24B | 24B | Q8_0 | 1K | ~24.00 |
Codestral 22B v0.1 | 22B | Q4_K_M | 36K | ~23.63 |
Codestral 22B v0.1 | 22B | Q5_K_M | 26K | ~23.91 |
Codestral 22B v0.1 | 22B | Q6_K | 18K | ~23.80 |
Codestral 22B v0.1 | 22B | Q8_0 | 1K | ~23.37 |
As demonstrated, the 20-27B models offer a better balance on the RTX 3090. Using 4-bit quantization allows for substantial context lengths, often exceeding 30k tokens. Even moving up to 5-bit or 6-bit quantization still permits very usable context sizes (around 8k-26k tokens). While 8-bit quantization (Q8_0) pushes the VRAM limit even with minimal context, it becomes a possibility for specific use cases where model fidelity is paramount and context length is not critical. This size class provides a great blend of capability and operational flexibility on 24GB.
12-16 Billion Parameter Models
Stepping into the 12-16B parameter range unlocks significant headroom on an RTX 3090, allowing for very long context windows or the use of higher-fidelity quantization methods without VRAM anxiety. Notable models here include the DeepSeek Coder V2 Lite Instruct, Qwen3 14B, Qwen2.5 Coder 14B, the distilled DeepSeek-R1-Distill-Qwen-14BF, and Google’s Gemma 3 12B Instruct (google_gemma-3-12b-it).
Model Tested | Parameters | Quantization | Context | VRAM Used (GB) |
---|---|---|---|---|
DeepSeek Coder V2 Lite | 16B | Q4_K_M | 60K | ~23.20 |
DeepSeek Coder V2 Lite | 16B | Q5_K_M | 52K | ~23.79 |
DeepSeek Coder V2 Lite | 16B | Q6_K | 43K | ~23.87 |
DeepSeek Coder V2 Lite | 16B | Q8_0 | 30K | ~23.84 |
Qwen3 14B | 14B | Q4_K_M | 62K | ~23.70 |
Qwen3 14B | 14B | Q5_K_M | 52K | ~23.73 |
Qwen3 14B | 14B | Q6_K | 50K | ~23.87 |
Qwen3 14B | 14B | Q8_0 | 32K | ~23.16 |
With 12-16B models, the 24GB VRAM of the RTX 3090 feels capacious. Users can comfortably employ 4-bit quantization and achieve extremely long context lengths, often exceeding 60k tokens. Even higher quality 5-bit and 6-bit quantizations allow for context lengths easily surpassing 40-50k tokens. Remarkably, even 8-bit quantization becomes practical, supporting substantial context lengths of around 30k tokens while staying within the 24GB budget. This makes the 12-16B class highly versatile for tasks demanding both quality and extensive context memory on this hardware.
7-8 Billion Parameter Models
For users prioritizing maximum context length or the absolute highest quality quantization possible within 24GB, the 7-8 billion parameter models are an excellent choice. This popular category includes mainstays like Qwen3 8B, Meta Llama 3.1 8B Instruct, and specialized distilled models like DeepSeek-R1-Distill-Qwen-7B.
Model Tested | Parameters | Quantization | Context | VRAM Used (GB) |
---|---|---|---|---|
Qwen3 8B | 8B | Q4_K_M | 90K | ~23.57 |
Qwen3 8B | 8B | Q5_K_M | 86K | ~23.55 |
Qwen3 8B | 8B | Q6_K | 82K | ~23.80 |
Qwen3 8B | 8B | Q8_0 | 72K | ~23.60 |
Running 7-8B models on an RTX 3090 leaves significant VRAM headroom. This allows for pushing context lengths towards the theoretical maximums supported by the models themselves (often 80k tokens or more) even with higher quality 5-bit or 6-bit quantization. Crucially, even full 8-bit quantization (Q8_0) fits comfortably alongside very large context windows (around 70k+ tokens). This makes the 7-8B class ideal for applications needing to process very large documents or maintain extensive conversation history without sacrificing model fidelity due to aggressive quantization.
Conclusion
The NVIDIA GeForce RTX 3090, particularly sourced from the second-hand market, represents a compelling hardware choice for the price-conscious enthusiast aiming to run LLMs entirely within GPU memory. Its 24GB VRAM buffer provides substantial capacity, capable of fully hosting quantized models up to the 30-32 billion parameter range with careful management of context length and quantization level. As model size decreases, the flexibility increases dramatically, allowing for extensive context windows or higher-fidelity quantization (like Q6_K or Q8_0) with models in the 7B to 16B range. While newer GPUs exist and future releases promise more, the RTX 3090 delivers a potent combination of VRAM capacity, memory bandwidth, and value that is hard to beat today for local LLM inference without resorting to slower system memory offloading. For users looking to eventually run even larger models, the experience gained managing VRAM on a 3090 provides a solid foundation for considering multi-GPU configurations or evaluating the cost-benefit of future higher-VRAM graphics cards.
Allan Witt
Allan Witt is Co-founder and editor in chief of Hardware-corner.net. Computers and the web have fascinated me since I was a child. In 2011 started training as an IT specialist in a medium-sized company and started a blog at the same time. I really enjoy blogging about tech. After successfully completing my training, I worked as a system administrator in the same company for two years. As a part-time job I started tinkering with pre-build PCs and building custom gaming rigs at local hardware shop. The desire to build PCs full-time grew stronger, and now this is my full time job.Related
Desktops
Dell refurbished desktop computers
If you are looking to buy a certified refurbished Dell desktop computer, this article will help you …
Guides
Dell Outlet and Dell Refurbished Guide
For cheap refurbished desktops, laptops, and workstations made by Dell, you have the option to use …
Guides
Refurbished, Renewed, Off Lease
When you are looking for refurbished computer, you often see – certified, renewed, and off-lease placed in …
Laptops
Excelent Refurbished ZenBook Laptops
If you are looking for a compact ultrabook and a reasonable price, consider a refurbished Asus Zenbook …
0 Comments