What Is The Right LLM For RTX 3060 (12GB)?

Answer:

For those looking to run large language models like Llama-2 and Mistral, on a PC built with RTX 3060 (12GB VRAM), here’s a concise guide on what model you can run.

For the RTX 3060 (12GB), you can utilize any GPTQ 7B model. Models in GGUF format can be used with up to 8-bit (Q8) quantization, allowing you to allocate all the layers to the GPU. On RTX 3060 the 7B 8-bit model models will run with around 25 tokens/second on Windows and 37 tokens/second on Linux.

A 13B model would require adjustments in terms of layers and quantization. For instance, 13B 6-bit (GGUF) quantized models is the maximum you can fit in RTX 3060. However the best approach is to use 4-bit 13B model in GGUF or GPTQ file format to strike a balance between speed (7-8 t/s) and inference quality.

Other Models and Quantization:

You might also explore 22B models at 3-bit quantization (Q3). For example, 22B Llama2-22B-Daydreamer-v3 model at Q3 will fit on RTX 3060.

Which models to run?

Some quality 7B models to run with RTX 3060 are the Mistral based Zephyr and Mistral-7B-Claude-Chat model, and the Llama-2 based airoboros-l2-7B-3.0 from the Airboros family. For 13B LLM you can try Athena for roleplay and WizardCoder for coding. 

And finally, instead of just seeking the largest model, it’s crucial to focus on what you intend to achieve with the LLM. Sometimes, smaller models, with specific tweaks, might perform exceptionally well for certain tasks.

Hope this guide helps you find the right model and configuration for your setup!