RTX 3090 and Local LLMs: What Fits in 24GB VRAM, from Model Size to Context Limits

Last updated: May 1, 2025 | Author: Allan Witt

Contents

Waht can you run
32B Models
27B Modelst
14B Models
8B Models
Conclusion

For the technically savvy enthusiast building systems for local LLM inference on a budget, the NVIDIA GeForce RTX 3090 presents a compelling value proposition, particularly on the second-hand market. Its generous 24GB GDDR6X VRAM buffer is the critical enabler, allowing for increasingly sophisticated models to run entirely within the GPU’s memory, bypassing the performance bottlenecks of system RAM offloading. Compared to newer, pricier options like the RTX 4090 which offers the same VRAM capacity, or anticipating the costs of upcoming generations, the RTX 3090 delivers substantial memory capacity and bandwidth (936 GB/s) at a price point that aligns well with the performance-per-dollar focus of experienced builders comfortable with system tuning and hardware management.

Successfully leveraging this 24GB capacity requires a firm grasp of the trade-offs between LLM parameter count, quantization levels, and context length. Quantization, especially to 4-bit formats, is essential for fitting larger models (up to ~32B parameters) by drastically reducing their VRAM footprint. However, the desired context length also significantly consumes VRAM via the KV cache. Therefore, running models effectively on the RTX 3090 involves a careful balancing act: selecting the appropriate quantization to fit the base model, and then adjusting the context length to stay within the 24GB limit, understanding that higher fidelity (less quantization) or longer context windows necessitate compromises, particularly with larger models near the card’s VRAM ceiling.

What You Can Run on an RTX 3090 (24GB VRAM, No Offloading)

Before diving into specific model sizes, it’s crucial to understand that the following VRAM usage figures are estimates. Actual memory consumption can vary slightly based on the specific model architecture (number of layers, hidden dimensions, attention heads), the inference software being used (like llama.cpp, LM Studio), and driver overhead.

Furthermore, the context length calculations presented here assume an unquantized KV cache. Advanced users can often enable KV cache quantization (to 8-bit or even 4-bit), which can reduce the VRAM impact of long contexts, allowing for even larger context windows than estimated below within the same 24GB budget. Our figures provide a solid baseline assuming standard KV cache handling.

Note that the listed VRAM requirements reflect only the LLM and its context cache, excluding operating system overhead; achieving these maximums often requires dedicating the RTX 3090 solely to compute by using an integrated GPU for display output or operating the system headlessly via a text terminal.

The following test are done on headless Ubuntu 20.04 LTS server using llama.cpp build 5258 and Open WebUI as frontend.

The following tests were conducted on a headless Ubuntu 22.04.5 LTS server using llama.cpp build 5258 with Open WebUI as the frontend.

30-32 Billion Parameter Models

Models in the 30-32 billion parameter range represent the upper limit of what can comfortably fit entirely within an RTX 3090’s 24GB VRAM using common quantization levels like 4-bit or 5-bit. This class includes capable models such as various Qwen variants like Qwen_Qwen3-32B, Qwen_QwQ-32B, the coding-focused Qwen2.5-Coder-32B-Instruct, and the compact Qwen_Qwen3-30B-A3B. Fitting these models requires careful consideration of quantization and context length.

Model Tested	Quantization	Context	TG*	PP*	VRAM (GB)
Qwen3 32B	Q4_K_M	12K	22 t/s	590 t/s	23.40
Qwen3 32B	Q5_K_M	4K	25 t/s	770 t/s	23.86
Qwen3 30B A3B	Q4_K_M	32K	35 t/s	750 t/s	23.91
Qwen3 30B A3B	Q5_K_M	16K	54 t/s	1035 t/s	23.93

*TG – Token Generation | PP* – Prompt Processing

These results show that 30-32B models are viable on a 24GB card, but compromises are necessary. Using 4-bit quantization (Q4_K_M) offers more headroom for context, allowing up to 16k or even 32k depending on the specific model variant, while staying under the 24GB limit. Stepping up to 5-bit quantization (Q5_K_M) for potentially better quality significantly restricts the usable context length, often limiting it to around 8k-16k tokens. It’s worth noting that for complex reasoning tasks, models like Qwen_Qwen3-30B-A3B or Qwen_QwQ-32B often benefit significantly from larger context windows; users should be aware that achieving a practical context size like 16k might necessitate using 4-bit quantization.

20-27 Billion Parameter Models

Dropping down to the 20-27B parameter range provides considerably more flexibility within the 24GB VRAM envelope. This category includes interesting models like Google’s Gemma 3 27B Instruct (google_gemma-3-27b-it), Mistral’s compact yet capable Mistral-Small-24B, and the specialized coding model Codestral-22B-v0.1. With these models, users can often achieve longer context lengths even with higher quality quantization levels.

Model Tested	Parameters	Quantization	Context	VRAM Used (GB)
Gemma 3 27b	27B	Q4_K_M	12K	~23.90
Gemma 3 27b	27B	Q5_K_M	8K	~23.79
Gemma 3 27b	27B	Q6_K	4K	~23.87
Mistral Small 24B	24B	Q4_K_M	36K	~23.63
Mistral Small 24B	24B	Q5_K_M	26K	~23.61
Mistral Small 24B	24B	Q6_K	18K	~23.76
Mistral Small 24B	24B	Q8_0	1K	~24.00
Codestral 22B v0.1	22B	Q4_K_M	36K	~23.63
Codestral 22B v0.1	22B	Q5_K_M	26K	~23.91
Codestral 22B v0.1	22B	Q6_K	18K	~23.80
Codestral 22B v0.1	22B	Q8_0	1K	~23.37

As demonstrated, the 20-27B models offer a better balance on the RTX 3090. Using 4-bit quantization allows for substantial context lengths, often exceeding 30k tokens. Even moving up to 5-bit or 6-bit quantization still permits very usable context sizes (around 8k-26k tokens). While 8-bit quantization (Q8_0) pushes the VRAM limit even with minimal context, it becomes a possibility for specific use cases where model fidelity is paramount and context length is not critical. This size class provides a great blend of capability and operational flexibility on 24GB.

12-16 Billion Parameter Models

Stepping into the 12-16B parameter range unlocks significant headroom on an RTX 3090, allowing for very long context windows or the use of higher-fidelity quantization methods without VRAM anxiety. Notable models here include the DeepSeek Coder V2 Lite Instruct, Qwen3 14B, Qwen2.5 Coder 14B, the distilled DeepSeek-R1-Distill-Qwen-14BF, and Google’s Gemma 3 12B Instruct (google_gemma-3-12b-it).

Model Tested	Parameters	Quantization	Context	VRAM Used (GB)
DeepSeek Coder V2 Lite	16B	Q4_K_M	60K	~23.20
DeepSeek Coder V2 Lite	16B	Q5_K_M	52K	~23.79
DeepSeek Coder V2 Lite	16B	Q6_K	43K	~23.87
DeepSeek Coder V2 Lite	16B	Q8_0	30K	~23.84
Qwen3 14B	14B	Q4_K_M	62K	~23.70
Qwen3 14B	14B	Q5_K_M	52K	~23.73
Qwen3 14B	14B	Q6_K	50K	~23.87
Qwen3 14B	14B	Q8_0	32K	~23.16

With 12-16B models, the 24GB VRAM of the RTX 3090 feels capacious. Users can comfortably employ 4-bit quantization and achieve extremely long context lengths, often exceeding 60k tokens. Even higher quality 5-bit and 6-bit quantizations allow for context lengths easily surpassing 40-50k tokens. Remarkably, even 8-bit quantization becomes practical, supporting substantial context lengths of around 30k tokens while staying within the 24GB budget. This makes the 12-16B class highly versatile for tasks demanding both quality and extensive context memory on this hardware.

7-8 Billion Parameter Models

For users prioritizing maximum context length or the absolute highest quality quantization possible within 24GB, the 7-8 billion parameter models are an excellent choice. This popular category includes mainstays like Qwen3 8B, Meta Llama 3.1 8B Instruct, and specialized distilled models like DeepSeek-R1-Distill-Qwen-7B.

Model Tested	Parameters	Quantization	Context	VRAM Used (GB)
Qwen3 8B	8B	Q4_K_M	90K	~23.57
Qwen3 8B	8B	Q5_K_M	86K	~23.55
Qwen3 8B	8B	Q6_K	82K	~23.80
Qwen3 8B	8B	Q8_0	72K	~23.60

Running 7-8B models on an RTX 3090 leaves significant VRAM headroom. This allows for pushing context lengths towards the theoretical maximums supported by the models themselves (often 80k tokens or more) even with higher quality 5-bit or 6-bit quantization. Crucially, even full 8-bit quantization (Q8_0) fits comfortably alongside very large context windows (around 70k+ tokens). This makes the 7-8B class ideal for applications needing to process very large documents or maintain extensive conversation history without sacrificing model fidelity due to aggressive quantization.

Conclusion

The NVIDIA GeForce RTX 3090, particularly sourced from the second-hand market, represents a compelling hardware choice for the price-conscious enthusiast aiming to run LLMs entirely within GPU memory. Its 24GB VRAM buffer provides substantial capacity, capable of fully hosting quantized models up to the 30-32 billion parameter range with careful management of context length and quantization level. As model size decreases, the flexibility increases dramatically, allowing for extensive context windows or higher-fidelity quantization (like Q6_K or Q8_0) with models in the 7B to 16B range. While newer GPUs exist and future releases promise more, the RTX 3090 delivers a potent combination of VRAM capacity, memory bandwidth, and value that is hard to beat today for local LLM inference without resorting to slower system memory offloading. For users looking to eventually run even larger models, the experience gained managing VRAM on a 3090 provides a solid foundation for considering multi-GPU configurations or evaluating the cost-benefit of future higher-VRAM graphics cards.

Allan Witt

Allan Witt is Co-founder and editor in chief of Hardware-corner.net. Computers and the web have fascinated me since I was a child. In 2011 started training as an IT specialist in a medium-sized company and started a blog at the same time. I really enjoy blogging about tech. After successfully completing my training, I worked as a system administrator in the same company for two years. As a part-time job I started tinkering with pre-build PCs and building custom gaming rigs at local hardware shop. The desire to build PCs full-time grew stronger, and now this is my full time job.

RTX 3090 and Local LLMs: What Fits in 24GB VRAM, from Model Size to Context Limits

What You Can Run on an RTX 3090 (24GB VRAM, No Offloading)

30-32 Billion Parameter Models

20-27 Billion Parameter Models

12-16 Billion Parameter Models

7-8 Billion Parameter Models

Conclusion

Allan Witt

0 Comments

Submit a Comment Cancel reply

Latest articles

Latest news

Related

Desktops

Dell refurbished desktop computers

Guides

Dell Outlet and Dell Refurbished Guide

Guides

Refurbished, Renewed, Off Lease

Laptops

Excelent Refurbished ZenBook Laptops