RTX 4070 Series for LLMs: A Technical Guide


For enthusiasts who are delving into the world of large language models (LLMs) like Llama-2 and Mistral, the NVIDIA RTX 4070 presents a compelling option. Doubling the performance of its predecessor, the RTX 3060 12GB, the RTX 4070 is grate option for local LLM inference.

Running LLMs with RTX 4070’s Hardware

Key to the RTX 4070’s proficiency in handling LLMs is its 12GB VRAM coupled with a 504 GBps bandwidth. This combination proves particularly effective with 7B models. Expect good performance, particularly with GGUF and EXL2 formatted models at 8-bit (Q8) quantization, breezing past 40 tokens per second.

When stepping up to 13B models, the RTX 4070 continues to impress – 4-bit quantized model versions in GGUF or GPTQ format is the optimal choice, giving you the to option use a larger 4K context. You can expect around 10 tokens per second from this setup.

Models you can use with RTX 4070:

Scaling Up: Handling Larger Models

The RTX 4070’s prowess extends to running 22B models at 3-bit quantization (Q3), with Llama2-22B-Daydreamer-v3 at Q3 being an good choice. However, when it comes to a bigger 33B models, typically around 17GB for the 4-bit version, a full VRAM load is not an option. The workaround? Offload 25 to 30 layers onto the GPU, with the remainder in system memory. This setup, while slower than a fully GPU-loaded model, still manages a token generation rate of 5 to 6 tokens per second.

Comparing the RTX 4070 Ti and RTX 4070 Ti SUPER

Moving to the RTX 4070 Ti, the performance in running LLMs is remarkably similar to the RTX 4070, largely due to their identical memory bandwidth of 504 GB/s.

RTX 4070 Ti Specifications:

  • GPU: AD104
  • Cores: 7680
  • TMUs: 240
  • ROPs: 80
  • Memory Size: 12 GB
  • Memory Type: GDDR6X
  • Bus Width: 192 bit

Despite having more cores, TMUs, and ROPs, the RTX 4070 Ti’s overall impact on LLM performance is moderated by its memory configuration, mirroring that of the RTX 4070.

Advancing Further: RTX 4070 Ti SUPER

The RTX 4070 Ti SUPER steps up the game with its elevated specifications, yielding improved LLM performance. The boost comes primarily from a heightened memory bandwidth of 672 GB/s and additional upgrades.

RTX 4070 Ti SUPER Specifications:

  • GPU: AD103
  • Cores: 8448
  • TMUs: 264
  • ROPs: 96
  • Memory Size: 16 GB
  • Memory Type: GDDR6X
  • Bus Width: 256 bit

This increase in memory bandwidth offers a 15 to 20% improvement in token generation. While the 16 GB memory doesn’t support a full 33B model load, it provides more flexibility for larger context lengths and layer offloading.