How To Install Llama-2 Locally On Windows Computer – llama.cpp, Exllama, KoboltCpp

llama ai model on windows

LLaMA (LLmMA and Llama-2) is a super powerful and flexible open-source language model. Developed by Meta AI Research, Llama offers a scalable solution for stuff like text generation, answering questions, and understanding natural language.

The first version of LLaMA comes in four sizes: 7 billion, 13 billion, 30 billion, and 65 billion parameters. The second version, Llama-2, is similar but the biggest one has 70 billion parameters instead of 65 billion.

Unlike the first gen, each Llama-2 model has two versions: a regular (uncensored) version and a chat-optimized (aligned) version.

In this post, I’ll show you how to install Llama-2 on Windows – the requirements, steps involved, and how to test and use Llama.

System requirements for running Llama-2 on Windows

The hardware required to run Llama-2 on a Windows machine depends on which Llama-2 model you want to use.

The smaller 7 billion and 13 billion parameter models can run on most modern laptops and desktops with at least 8GB of RAM and a decent CPU. For the larger 30 billion parameter model, a system with 16GB of RAM and a recent multi-core processor is recommended.

To use the massive 70 billion parameter Llama-2 model, more powerful hardware is ideal – a desktop with 64GB RAM or a dual Nvidia RTX 3090 graphics card setup.

While the smaller models will work fine on mid-range consumer hardware, the faster memory and GPU acceleration of high-end systems will significantly speed up performance when working with Llama-2’s models.

We have a special dedicated article discussing the hardware requirements for running the LLaMA model locally on a computer.

LLaMA and Llama-2 installation process for Windows

In order to install Llama-2 locally on Windows you need the implementation of the model – a software that refers to the code that defines the structure and operations of the LLaMA model.

And for this software in order to produce any meaningful output, you’ll need to download the pretrained model file that contains the weights and parameters for the specific Llama variation you want to use.

The learned parameters (weights) are stored in a separate file and loaded into the model during runtime to enable it to perform inference or make predictions on new data. The combination of the implementation code and the loaded weights allows the model to function as intended and produce meaningful outputs.

Currently, there are a couple of Llama implementations available that offer users the convenience of running the AI model locally.

Installing LM Studio on Windows

In the realm of local Llama inference, LM Studio is quickly becoming a favored choice among Windows users, thanks to its remarkable blend of user-friendliness and powerful features.

lm studio home screen interface

At the heart of LM Studio’s appeal is its intuitive chat interface, which simplifies the process of querying and interacting with LLMs. This interface is great especially for those who are new to LLMs or prefer a more straightforward user experience. Moreover, LM Studio is equipped with essential settings like:

  • Model search and download.
  • GPU offloading, a feature that splits the model between the GPU and RAM for faster inference.
  • Predefined prompt formats for different models.
  • Markdown formatting for code outputs.
  • OpenAI compatible local server.

Underpinning all these features is the robust llama.cpp, that’s why you have to download the model in GGUF file format.

To install and run inference on LM Studio do the flowing:

  • Visit the official LM Studio website.
  • Download the software to your local machine.
  • Run the Installer.
  • After the installation is complete, open LM Studio.
  • Use the search bar on the home screen to search, browse, and download any desired model.
  • After the model is downloaded, click the chat icon on the left to navigate to chat screen.
  • In the chat screen use the purple dropdown menu at the top to select the model.
  • Use the right sidebar to setup the model prompt format, context length, model parameters, markdown or plaintext chat appearance.
  • Load the model and chat with it.

lm studio local models folder

Using LM Studio with Pre-downloaded Models:

  • On the left vertical menu of LM Studio, look for a file folder icon and click on it.
  • In this section, find the “Local Models Folder” field.
  • Click ‘Change‘ and navigate to the top folder where your local LLM files (GGUF) are stored.
  • It’s important to ensure that the models are organized in the correct directory structure for LM Studio to recognize them.
  • The models should be placed in a directory following this structure: /model-folder/hugging-face-repo/model-name.gguf.
  • For example, if you’re using the 8-bit  “TheBloke/OpenHermes-2.5-Mistral-7B-GGUF” model, it should be located in: /models-folder/TheBloke/openhermes-2.5-mistral-7b.Q8_0.gguf.

Installing llama.cpp for GPU and CPU inference

llama.cpp is a port of Facebook’s LLaMA model in C/C++. llama.cpp is optimized for various platforms and architectures, such as Apple silicon, Metal, AVX, AVX2, AVX512, CUDA, MPI and more. It also supports 4-bit integer quantization. It is good for running the LLaMA model on the CPU using minimal resources.

Llama.cpp runs in a simple command window (Windows PowerShell; Command Prompt) without convenient features and a user interface. It works only with GGML and GGUF converted weights. So, look for those in the file name.

The latest version of llama.cpp no longer supports GGML models. It now uses a new format called GGUF. So for llama.cpp, GGML is deprecated, though other clients/libraries may continue supporting it.

For now, to use GGML with llama.cpp you’ll need to downgrade to an older version and pre-GGUF binary release, or use a 3rd party client (KoboldCpp, LM Studio, text-generation-webui) that still supports GGML.

You can also convert GGML models yourself using the ggml_to_gguf.py script now included with llama.cpp.

The easiest way to install llama.cpp on Windows is to use a pre-built executable from their release page on Github.

screanshot of llama cpp releases page on git hub

There are a couple of versions there you can choose from according to your hardware.

For pure CPU inference, choose the AVX release, which is typically AVX or AVX2, suitable for most processors. For GPU offloading, you have two options: cuBLAS for NVIDIA GPUs or clBLAS for AMD GPUs.

For example, offloading a 35 layer 7B parameter model using cuBLAS with RTX 3060 (12GB) can speed up the prompt evaluation time and inference time more than 3 times.

Installing AVX version

  1. Download the AVX/AVX2 zip file and extract its contents into a folder of your choice.
  2. Within the extracted folder, create a new folder named “models
  3. Download the specific Llama-2 model weights (Llama-2-7B-Chat-GGML) you want to use and place it inside the “models” folder. Download only files with GGML in the name.
  4. Open the Windows Command Prompt by pressing the Windows Key + R, typing “cmd,” and pressing “Enter.”
  5. Navigate to the main llama.cpp folder using the cd command.
  6. Run the following command in the Command Prompt: main.exe -m .\models\13B\model_file_name.bin --in-prefix " [INST] " --in-suffix " [/INST]" -i -p "[INST] <<SYS>> You are a helpful, respectful, and honest assistant. <</SYS>> [/INST]" -ins --color

This command will start the llama.cpp AI model in interactive chat mode with the specified model and allow you to interact with it using the provided input.

Installing cuBLAS version for NVIDIA GPU

  1. Download the https://llama-master-eb542d3-bin-win-cublas-[version]-x64.zip file  from llama.cpp releases and extract its contents into a folder of your choice.
  2. Download the same version cuBLAS drivers cudart-llama-bin-win-[version]-x64.zip and extract them in the llama.cpp main directory
  3. Update your NVIDIA drivers
  4. Within the extracted folder, create a new folder named “models.”
  5. Download the specific Llama-2 model (Llama-2-7B-Chat-GGML) you want to use and place it inside the “models” folder.
  6. Open the Windows Command Prompt by pressing the Windows Key + R, typing “cmd,” and pressing “Enter.”
  7. Navigate to the main llama.cpp folder using the cd command.
  8. Run the following command in the Command Prompt: main.exe -m .\models\13B\model_file_name.bin --in-prefix " [INST] " --in-suffix " [/INST]" -i -p "[INST] <<SYS>> You are a helpful, respectful, and honest assistant. <</SYS>> [/INST]" --n-gpu-layers 32 -ins --color

This command will start the llama.cpp AI model in interactive chat mode with the specified (in our case Llama-2-7B-Chat-GGML) model with 32 layers offloaded to the GPU. This means that with 7B you will have around 3700 MB of VRAM used and with 13B model 5800 MB VRAM used.

Use llama.cpp help (main.exe --help) to learn about other command line arguments.

Installing KoboltCpp on Windows

KoboldCpp is a fantastic combination of KoboldAI and llama.cpp, offering a lightweight and super fast way to run various LLAMA.CPP and ALPACA models locally.

The best part is that it’s self-contained and distributable, making it easy to get started. With KoboldCpp, you get accelerated CPU/GPU text generation and a fancy writing UI, along with cool features like persistent stories, editing tools, and save formats.

It supports multiple versions of GGML LLAMA.CPP and ALPACA models, as well as GPT-J/JT, GPT2, and GPT4ALL models. The download size is just around 15 MB (excluding model weights), and it has some neat optimizations to speed up inference.

Here’s a step-by-step guide to install and use KoboldCpp on Windows:

  1. Download the latest Koboltcpp.exe release from the official source or website.
  2. Download the weights from other sources like TheBloke’s Huggingface. For example Llama-2-7B-Chat-GGML.
  3. Execute “koboldcpp.exe” directly.
  4. Launching KoboldCpp without command-line arguments displays a GUI with a subset of configurable settings.
  5. Navigate to the Quick launch tab and click the Browse button under the Model section.
  6. Chose the model/weights you have download and and click the Launch button 
  7. You can adjust settings like Presets and GPU Layers to suit your preferences and requirements.
  8. Open your browser and enter http://localhost:5001 in the address bar

Command-Line Options:

Similar to llama.cpp you can start KoboltCpp with command line arguments. For more control you can use specific command-line options.

For example: koboldcpp.exe [ggml_model.bin] [port] allows you to specify the model and port.

If you need to adjust context size, use --contextsize [value] for better performance. Use -smartcontext to reduce prompt processing frequency if the big context is too slow. For GPU acceleration, consider -useclblast with --gpulayers to offload entire layers to the GPU.

Experiment to find the right balance of layers to offload based on VRAM availability.

For further details, run the program with the --help flag to access additional information and options.

Installing Exllama on Windows

Exllama is a standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ weights, designed to be fast and memory-efficient on modern GPUs. If you have a GPU with enough VRAM, this is the fastest option to to run Llama-2 locally.

1. Install Visual Studio Toolkit

First you have to install Visual Studio Toolkit. You can either install the full Visual Studio IDE or only the Build Tools. If you don’t have the IDE installed already use the Build Tools. To download, navigate to MS Visual studio page and scroll down to Tools For Visual Studio and download Build Tools for Visual Studio 2022.

When you use the Build Tools option make sure on the Workloads tab to select the Desktop development with C++.

visual studio build tools 2022 installer with desktop development option selected

After the installation Add cl.exe to your system’s PATH to enable easy access to the compiler. You have to add the folder what has the cl.exe file.

  • With the IDE: C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\<version-number>\bin\Hostx64\x64
  • With Build Tools: C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\<version-number>\bin\Hostx64\x64

2. Install Python

To get started with Python, first, install the latest version from the official Python website. Once the installation is complete, ensure Python is added to your system’s PATH to enable easy access through the command prompt.

Additionally, to prevent Windows store from starting when you type python in the command prompt, disable Python App Installer from the Manage App Execution Aliases. Finally, to apply all the changes and ensure smooth functionality, restart your PC.

Following these steps will set up Python on your system, allowing you to run Python scripts.

screenshot of windows app execution installer with python app installer disabled

Remove Python App installer.

3. Install PyTorch and CUDA Toolkit

Step three is to Install PyTorch along with the compatible CUDA 11.8 computer platform. PyTorch website has a convenient selector tool for building the actual install command, but for Python installed on Windows trough the official site, use this one –  python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Next step is to download and install the CUDA Toolkit version 11.8, matching the PyTorch compute platform. During installation you will be prompted to install NVIDIA Display Drivers, HD Audio drivers, and PhysX drivers – install them if they are newer version. If you have more recent drivers version, choose a custom installation with only CUDA components.

The download size of CUDA Toolkit version 11.8, is approximately 3GB.

4. Download and Extract Exllama Project

Get the Exllama project from GitHub by clicking the green “Code” button and selecting “Download ZIP.”
Extract the downloaded ZIP file to access the Exllama project files.

5. Install Python Dependencies

Open the Windows Command Prompt and navigate to the Exllama project directory.
Install Python dependencies using the commands: python -m pip install -r requirements.txt and python -m pip install -r requirements-web.txt

6. Create Model Directory

Within the Exllama main directory, create a new directory named “models

7. Download and Place Model Files

Now download all the necessary model files from the Huggingface directory (e.g., https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ/tree/main).
Crate new directory inside model folder. Give it appropriate name, for example – Llama-2-7b-Chat-GPTQ. Place the downloaded model there.

8. Launch Exllama and configure Web UI:

In the Command Prompt, type python webui/app.py -d .\\models\\model-name\\ to initiate Exllama, where “model-name” refers to the specific model you downloaded. This command will run your default browser and load Exllama in Web UI mode where you can set some basic parameters.

Dot forget to customize the “Fixed Prompt” section according to your desired character setting.

Allan Witt

Allan Witt

Allan Witt is Co-founder and editor in chief of Hardware-corner.net. Computers and the web have fascinated me since I was a child. In 2011 started training as an IT specialist in a medium-sized company and started a blog at the same time. I really enjoy blogging about tech. After successfully completing my training, I worked as a system administrator in the same company for two years. As a part-time job I started tinkering with pre-build PCs and building custom gaming rigs at local hardware shop. The desire to build PCs full-time grew stronger, and now this is my full time job.

Related

Desktops
Best GPUs for 600W and 650W PSU

A high-quality 500W PSU is typically sufficient to power GPUs like the Nvidia GeForce RTX 370 Ti or RTX 4070.

Guides
Dell Outlet and Dell Refurbished Guide

For cheap refurbished desktops, laptops, and workstations made by Dell, you have the option…

Guides
Dell OptiPlex 3020 vs 7020 vs 9020

Differences between the Dell OptiPlex 3020, 7020 and 9020 desktops.

Guides
Best Dedicated GPU for Dell OptiPlex

Pick a GPU for your Dell OptiPlex.