Falcon LLM: Versions, Prompt Templates & Hardware Requirements

Updated: 2023-08-31 |
Base model

Explore all versions of the model, their file formats like GGML, GPTQ, and HF, and understand the hardware requirements for local inference.

Hardware requirements

The performance of an Falcon model depends heavily on the hardware it's running on. For recommendations on the best computer hardware configurations to handle Falcon models smoothly, check out this guide: Best Computer for Running LLaMA and LLama-2 Models.

Below are the Falcon hardware requirements for 4-bit quantization:

For 7B Parameter Models

If the 7B model is what you're after, you gotta think about hardware in two ways. First, for the GPTQ version, you'll want a decent GPU with at least 6GB VRAM. The GTX 1660 or 2060, AMD 5700 XT, or RTX 3050 or 3060 would all work nicely. But for the GGML / GGUF format, it's more about having enough RAM. You'll need around 4 gigs free to run that one smoothly.

Format RAM Requirements VRAM Requirements
GPTQ (GPU inference) 6GB (Swap to Load*) 6GB
GGML / GGUF (CPU inference) 4GB 300MB
Combination of GPTQ and GGML / GGUF (offloading) 2GB 2GB

*RAM needed to load the model initially. Not required for inference. If your system doesn't have quite enough RAM to fully load the model at startup, you can create a swap file to help with the loading.

Memory speed

When running Falcon AI models, you gotta pay attention to how RAM bandwidth and mdodel size impact inference speed. These large language models need to load completely into RAM or VRAM each time they generate a new token (piece of text). For example, a 4-bit 13B billion parameter Falcon model takes up around 7.5GB of RAM.

So if your RAM bandwidth is 50 GBps (DDR4-3200 and Ryzen 5 5600X), you can generate roughly 6 tokens per second. But for fast speeds like 11 tokens per second, you'd need more bandwidth - DDR5-5600 with around 90 GBps. For reference, top-end GPUs like the Nvidia RTX 3090 have about 930 GBps of bandwidth to their VRAM. The latest DDR5 RAM can provide up to 100GB/s. So understanding the bandwidth is key to run models like Falcon efficiently.


  1. For Best Performance: Opt for a machine with a high-end GPU (like NVIDIA's latest RTX 3090 or RTX 4090) or dual GPU setup to accommodate the largest models (65B and 70B). A system with adequate RAM (minimum 16 GB, but 64 GB best) would be optimal.
  2. For Budget Constraints: If you're limited by budget, focus on Falcon GGML/GGUF models that fit within the sytem RAM. Remember, while you can offload some weights to the system RAM, it will come at a performance cost.

Remember, these are recommendations, and the actual performance will depend on several factors, including the specific task, model implementation, and other system processes.

CPU requirements

For best performance, a modern multi-core CPU is recommended. An Intel Core i7 from 8th gen onward or AMD Ryzen 5 from 3rd gen onward will work well. CPU with 6-core or 8-core is ideal. Higher clock speeds also improve prompt processing, so aim for 3.6GHz or more.

Having CPU instruction sets like AVX, AVX2, AVX-512 can further improve performance if available. The key is to have a reasonably modern consumer-level CPU with decent core count and clocks, along with baseline vector processing (required for CPU inference with llama.cpp) through AVX2. With those specs, the CPU should handle Falcon model size.