Mastering Llama-2 Setup: A Comprehensive Guide to Installing and Running llama.cpp Locally

Welcome to the exciting world of Llama-2 models! In today’s blog post, we’re diving into the process of installing and running these advanced models. Whether you’re a seasoned AI enthusiast or just starting out, understanding Llama-2 models is crucial. These models, known for their efficiency and versatility in handling large-scale data, are a game-changer in the field of machine learning.

In this Shortcut, I give you a step-by-step process to install and run Llama-2 models on your local machine with or without GPUs by using llama.cpp. As I mention in Run Llama-2 Models, this is one of the preferred options.

Here are the steps:

Step 1. Clone the repositories

You should clone the Meta Llama-2 repository as well as llama.cpp:

$ git clone
$ git clone 

Step 2. Request access to download Llama models

Complete the request form, then navigate to the directory where you downloaded the GitHub code provided by Meta. Run the download script, enter the key you received, and adhere to the given instructions:

$ cd llama
$ ./download 

Step 3. Create a virtual environment

To prevent any compatibility issues, it’s advisable to establish a separate Python environment using Conda. In case Conda isn’t already installed on your system, you can refer to this installation guide. Once Conda is set up, you can create a new environment by entering the command below:


#These four commands quickly and quietly install the latest 64-bit version #of the installer and then clean up after themselves. To install a different #version or architecture of Miniconda for Linux, change the name of the .sh #installer in the wget command.

mkdir -p ~/miniconda3
wget -O ~/miniconda3/
bash ~/miniconda3/ -b -u -p ~/miniconda3
rm -rf ~/miniconda3/

#After installing, initialize your newly-installed Miniconda. The following #commands initialize for bash and zsh shells:

~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
$ conda create -n llamacpp python=3.11

Step 4. Activate the environment

Use this command to activate the environment:

$ conda activate llamacpp

Step 5. Go to llama.cpp folder and Install the required dependencies

Type the following commands to continue the installation:

$ cd ..
$ cd llama.cpp
$ python3 -m pip install -r requirements.txt

Step 6.

Compile the source code

Option 1:

Enter this command to compile with only CPU support:

$ make 

Option 2:

To compile with CPU and GPU support, you need to have the official CUDA libraries from Nvidia installed.



sudo mv /etc/apt/preferences.d/cuda-repository-pin-600


sudo dpkg -i cuda-repo-ubuntu2204-12-3-local_12.3.0-545.23.06-1_amd64.deb

sudo cp /var/cuda-repo-ubuntu2204-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/

sudo apt-get update

sudo apt-get -y install cuda-toolkit-12-3

sudo apt-get install -y cuda-drivers

Double check that you have Nvidia installed by running:

$ nvcc --version
$ nvidia-smi

Export the the CUDA Home environment variable:

$ export CUDA_HOME=/usr/lib/cuda

Then, you can launch the compilation by typing:

$ make clean && LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=all make -j

Step 7. Perform a model conversion

Run the following command to convert the original models to f16 format (please note in the example I show examples the 7b-chat model / 13b-chat model / 70b-chat model):

$ mkdir models/7B
$ python3 --outfile models/7B/ggml-model-f16.gguf --outtype f16 ../llama/llama-2-7b-chat --vocab-dir ../llama/llama

$ mkdir models/13B
$ python3 --outfile models/13B/ggml-model-f16.gguf --outtype f16 ../llama/llama-2-13b-chat --vocab-dir ../llama/llama

$ mkdir models/70B
$ python3 --outfile models/70B/ggml-model-f16.gguf --outtype f16 ../llama/llama-2-70b-chat --vocab-dir ../llama/llama

If the conversion runs successfully, you should have the converted model stored in models/* folders. You can double check this with the ls command:

$ ls models/7B/ 
$ ls models/13B/ 
$ ls models/70B/ 

Step 8. Quantize

Now, you can quantize the model to 4-bits, for example, by using the following command (please note the q4_0 parameter at the end):

$ ./quantize models/7B/ggml-model-f16.gguf models/7B/ggml-model-q4_0.gguf q4_0 

$ ./quantize models/13B/ggml-model-f16.gguf models/13B/ggml-model-q4_0.gguf q4_0 

$ ./quantize models/70B/ggml-model-f16.gguf models/70B/ggml-model-q4_0.gguf q4_0 
$ ./quantize models/70B/ggml-model-f16.gguf models/70B/ggml-model-q2_0.gguf Q2_K
$ ./quantize -h
usage: ./quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] model-f32.gguf [model-quant.gguf] type [nthreads]

  --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
  --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
  --pure: Disable k-quant mixtures and quantize all tensors to the same type

Allowed quantization types:
   2  or  Q4_0   :  3.56G, +0.2166 ppl @ LLaMA-v1-7B
   3  or  Q4_1   :  3.90G, +0.1585 ppl @ LLaMA-v1-7B
   8  or  Q5_0   :  4.33G, +0.0683 ppl @ LLaMA-v1-7B
   9  or  Q5_1   :  4.70G, +0.0349 ppl @ LLaMA-v1-7B
  10  or  Q2_K   :  2.63G, +0.6717 ppl @ LLaMA-v1-7B
  12  or  Q3_K   : alias for Q3_K_M
  11  or  Q3_K_S :  2.75G, +0.5551 ppl @ LLaMA-v1-7B
  12  or  Q3_K_M :  3.07G, +0.2496 ppl @ LLaMA-v1-7B
  13  or  Q3_K_L :  3.35G, +0.1764 ppl @ LLaMA-v1-7B
  15  or  Q4_K   : alias for Q4_K_M
  14  or  Q4_K_S :  3.59G, +0.0992 ppl @ LLaMA-v1-7B
  15  or  Q4_K_M :  3.80G, +0.0532 ppl @ LLaMA-v1-7B
  17  or  Q5_K   : alias for Q5_K_M
  16  or  Q5_K_S :  4.33G, +0.0400 ppl @ LLaMA-v1-7B
  17  or  Q5_K_M :  4.45G, +0.0122 ppl @ LLaMA-v1-7B
  18  or  Q6_K   :  5.15G, -0.0008 ppl @ LLaMA-v1-7B
   7  or  Q8_0   :  6.70G, +0.0004 ppl @ LLaMA-v1-7B
   1  or  F16    : 13.00G              @ 7B
   0  or  F32    : 26.00G              @ 7B
          COPY   : only copy tensors, no quantizing

Step 9. Run one of the prompts

Option 1:

You can execute one of the example prompts using only CPU computation by typing the following command:

$ ./main -m ./models/7B/ggml-model-q4_0.gguf -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt

This example will initiate the chat in interactive mode in the console, starting with the chat-with-bob.txt prompt example.

Option 2:

If you compiled llama.cpp with GPU enabled in Step 8, then you can use the following command:

$ ./main -m ./models/7B/ggml-model-q4_0.gguf -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt --n-gpu-layers 3800

$ ./main -m ./models/13B/ggml-model-q4_0.gguf -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt --n-gpu-layers 3800

$ ./main -m ./models/70B/ggml-model-q4_0.gguf -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt --n-gpu-layers 1000

If you have GPU enabled, you need to set the number of layers to offload to the GPU based on your vRAM capacity. You can increase the number gradually and check with the nvtop command until you find a good spot.

You can see the full list of parameters of the main command by typing the following:

$ ./main --help

Private AI in HomeLAB: Affordable GPU Solution with NVIDIA Tesla P40

For Private AI in HomeLAB, I was searching for budget-friendly GPUs with a minimum of 24GB RAM. Recently, I came across the refurbished NVIDIA Tesla P40 on eBay, which boasts some intriguing specifications:

  • GPU Chip: GP102
  • Cores: 3840
  • TMUs: 240
  • ROPs: 96
  • Memory Size: 24 GB
  • Memory Type: GDDR5
  • Bus Width: 384 bit

Since the NVIDIA Tesla P40 comes in a full-profile form factor, we needed to acquire a PCIe riser card.

A PCIe riser card, commonly known as a “riser card,” is a hardware component essential in computer systems for facilitating the connection of expansion cards to the motherboard. Its primary role comes into play when space limitations or specific orientation requirements prevent the direct installation of expansion cards into the PCIe slots on the motherboard.

Furthermore, I needed to ensure adequate cooling, but this posed no issue. I utilized a 3D model created by MiHu_Works for a Tesla P100 blower fan adapter, which you can find at this link: Tesla P100 Blower Fan Adapter.

As for the fan, the Titan TFD-B7530M12C served the purpose effectively. You can find it on Amazon: Titan TFD-B7530M12C.

Currently, I am using a single VM with PCIe pass-through. However, it was necessary to implement specific Advanced VM parameters:

  • pciPassthru.use64bitMMIO = true
  • pciPassthru.64bitMMIOSizeGB = 64

Now, you might wonder about the performance. It’s outstanding, delivering speeds up to 16x-26x times faster than the CPU. To provide you with an idea of the performance, I conducted a llama-bench test:

pp 512CPU t/sGPU t/sAcceleration
llama 7B mostly Q4_09.50155.3716x
llama 13B mostly Q4_05.18134.7426x
./llama-bench -t 8
| model                          |       size |     params | 
| ------------------------------ | ---------: | ---------: | 
| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | 
| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | 
| backend    |    threads | test       |              t/s |
| ---------- | ---------: | ---------- | ---------------: |
| CPU        |          8 | pp 512     |      9.50 ± 0.07 |
| CPU        |          8 | tg 128     |      8.74 ± 0.12 |
./llama-bench -ngl 3800
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params |
| ------------------------------ | ---------: | ---------: |
| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B |
| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B |
| backend    | ngl | test       |              t/s | 
| ---------- | --: | ---------- | ---------------: | 
| CUDA       | 3800 | pp 512     |    155.37 ± 1.26 |
| CUDA       | 3800 | tg 128     |      9.31 ± 0.19 |
 ./llama-bench -t 8 -m ./models/13B/ggml-model-q4_0.gguf
| model                          |       size |     params |
| ------------------------------ | ---------: | ---------: |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B |
| backend    |    threads | test       |              t/s |
| ---------- | ---------: | ---------- | ---------------: |
| CPU        |          8 | pp 512     |      5.18 ± 0.00 |
| CPU        |          8 | tg 128     |      4.63 ± 0.14 |
./llama-bench -ngl 3800 -m ./models/13B/ggml-model-q4_0.gguf
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params |
| ------------------------------ | ---------: | ---------: |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B |
| backend    | ngl | test       |              t/s | 
| ---------- | --: | ---------- | ---------------: | 
| CUDA       | 3800 | pp 512     |    134.74 ± 1.29 |
| CUDA       | 3800 | tg 128     |      8.42 ± 0.10 |

Feel free to explore this setup for your Private AI in HomeLAB.

DELL R610 or R710: How to Convert an H200A to H200I for Dedicated Slot Use

For my project involving the AI tool llama.cpp, I needed to free up a PCI slot for an NVIDIA Tesla P40 GPU. I found an excellent guide and a useful video from ArtOfServer.

Based on this helpful video from ArtOfServer:

ArtOfServer wrote a small tutorial on how to modify an H200A (external) into an H200I (internal) to be used into the dedicated slot (e.g. instead of a Perc6i)

ArtOfServer wrote a small tutorial on how to modify an H200A (external) into an H200I (internal) to be used into the dedicated slot (e.g. instead of a Perc6i)

Install compiler and build tools (those can be removed later)

# apt install build-essential unzip

Compile and install lsirec and lsitool

# mkdir lsi
# cd lsi
# wget
# wget
# tar -zxvvf lsiutil-1.72.tar.gz
# unzip
# cd lsirec-master
# make
# chmod +x
# cp -p lsirec /usr/bin/
# cp -p /usr/bin/
# cd ../lsiutil
# make -f Makefile_Linux

Modify SBR to match an internal H200I

Get bus address:

# lspci -Dmmnn | grep LSI
0000:05:00.0 "Serial Attached SCSI controller [0107]" "LSI Logic / Symbios Logic [1000]" "SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [0072]" -r03 "Dell [1028]" "6Gbps SAS HBA Adapter [1f1c]"

Bus address 0000:05:00.0

We are going to change id 0x1f1c to 0x1f1e

Unbind and halt card:

# lsirec 0000:05:00.0 unbind
Trying unlock in MPT mode...
Device in MPT mode
Kernel driver unbound from device
# lsirec 0000:05:00.0 halt
Device in MPT mode
Resetting adapter in HCB mode...
Trying unlock in MPT mode...
Device in MPT mode

Read sbr:

# lsirec 0000:05:00.0 readsbr h200.sbr
Device in MPT mode
Using I2C address 0x54
Using EEPROM type 1
Reading SBR...
SBR saved to h200.sbr

Transform binary sbr to text file:

# parse h200.sbr h200.cfg

Modify PID in line 9 (e.g using vi or vim):
from this:
SubsysPID = 0x1f1c
to this:
SubsysPID = 0x1f1e

Important: if in the cfg file you find a line with:
SASAddr = 0xfffffffffffff
remove it!
Save and close file.

Build new sbr file:

# build h200.cfg h200-int.sbr

Write it back to card:

# lsirec 0000:05:00.0 writesbr h200-int.sbr
Device in MPT mode
Using I2C address 0x54
Using EEPROM type 1
Writing SBR...
SBR written from h200-int.sbr

Reset the card an rescan the bus:

# lsirec 0000:05:00.0 reset
Device in MPT mode
Resetting adapter...
# lsirec 0000:05:00.0 info
Trying unlock in MPT mode...
Device in MPT mode
DOORBELL: 0x10000000
DIAG: 0x000000b0
DCR_I2C_SELECT: 0x80030a0c
DCR_SBR_SELECT: 0x2100001b
CHIP_I2C_PINS: 0x00000003
# lsirec 0000:05:00.0 rescan
Device in MPT mode
Removing PCI device...
Rescanning PCI bus...
PCI bus rescan complete.

Verify new id (H200I):

# lspci -Dmmnn | grep LSI
0000:05:00.0 "Serial Attached SCSI controller [0107]" "LSI Logic / Symbios Logic [1000]" "SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [0072]" -r03 "Dell [1028]" "PERC H200 Integrated [1f1e]"

You can now move the card to the dedicated slot 🙂

Thanks to ArtOfServer for a great video.

CUDA Support for WSL 2

For more efficient testing of LLAMA 2, I recommend taking advantage of GPU acceleration in WSL 2, available on notebooks. This approach significantly increases performance and efficiency when working with LLAMA 2. In my latest blog post, you will find a detailed guide on how to easily and quickly set up GPU acceleration in WSL 2 on your notebook.

  • At first install – The latest NVIDIA Windows GPU Driver will fully support WSL 2. With CUDA support in the driver, existing applications (compiled elsewhere on a Linux system for the same target GPU) can run unmodified within the WSL environment.
  • Once a Windows NVIDIA GPU driver is installed on the system, CUDA becomes available within WSL 2. The CUDA driver installed on Windows host will be stubbed inside the WSL 2 as, therefore users must not install any NVIDIA GPU Linux driver within WSL 2.
  • Installation of Linux x86 CUDA Toolkit using WSL-Ubuntu Package
sudo mv /etc/apt/preferences.d/cuda-repository-pin-600
sudo dpkg -i cuda-repo-wsl-ubuntu-12-3-local_12.3.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3



VMware Private AI with Intel is a very interesting support for OpenVINO. To illustrate what all OpenVINO enables, here is a summary of the thesis ACCELERATION OF FACE RECOGNITION ALGORITHM WITH NEURAL COMPUTE STICK 2 by my daughter Eva Micankova, with her permission here is a brief summary. The acceleration speed using NCS2, the Facenet model accuracy 97,08 % achieved a frame rate of 15.35 FPS and a latency of 65.164 ms – NUC Maxtang NX6412.

OpenVINO tools

Intel’s suite of OpenVINO tools. This open-source set of tools offers development tools for the optimization and deployment of deep learning models. It delivers better performance for vision, audio, and language models from popular frameworks such as TensorFlow, Caffe, PyTorch, and others. OpenVINO optimizes deep learning pipelines through memory reuse, graph fusion, load balancing, and inference parallelism across CPUs, GPUs, VPUs, and others, as seen in the figure. Accelerators can have additional operations for pre-processing and post-processing transferred or integrated to reduce latency between endpoints and improve throughput.

Popular algorithms FaceNet, SphereFace, and ArcFace, which differ in architecture and training procedures, all aim to learn a vector representation of the face that is robust to changes in conditions.


The FaceNet model was developed by Google’s research group in 2015. The model maps faces of individuals into clusters of geometric points (Euclidean spaces) referred to as an embedding, which is obtained from the measure of similarity and difference of faces.


The authors of SphereFace introduced a loss function A-Softmax, derived from the softmax loss function, in their work published in 2017. The A-Softmax (Angular Softmax) loss function was designed to learn discriminative facial features with a clear geometric interpretation, which no available face recognition algorithm offered until then.


ArcFace (Additive Angular Margin loss) is a loss function first introduced in 2018. It builds on the previous work of SphereFace, which introduced the concept of angular margin, which helps improve class separability and thereby the performance of face recognition. However, their loss function required a series of approximations, which led to unstable network training. In addition, the standard softmax loss function dominated training, meaning that the concept of angular margin was not fully utilized. ArcFace introduces a new loss function that aims to address these shortcomings. It introduced the Additive Angular Margin loss function, which allows for better class separability and more stable training without the need for approximations used in SphereFace.


Vision Processing Unit (VPU) accelerators are chips created to accelerate image processing using computer vision and deep learning algorithms. The Intel Neural Compute Stick 2 is a powerful, affordable, and compact solution, with low power consumption, for accelerating neural networks. It is designed to run deep neural networks at high speeds with low energy consumption without losing accuracy, enabling real-time computer vision processing.


One Intel NSC2 was used for accelerating face detection and a second Intel NSC2 was utilized for face recognition

Figure A.4

Graph showing the accuracy of all validated models depending on the changing threshold. ArcFace achieved the highest accuracy of 0.84 at a threshold value of 0.79. SphereFace achieved the highest accuracy of 0.77 for a threshold of 0.74. FaceNet achieved the best accuracy of all the compared models, at 0.982 for a threshold value of 0.57.

Figure 6.13

Graph comparing the achieved frame rate for each series.

  • 1st series – ArcFace model, one person
  • 2nd series – ArcFace model, two people
  • 3rd series – FaceNet model, one person
  • 4th series – FaceNet model, two people
  • 5th series – SphereFace model, one person
  • 6th series – SphereFace model, two people


The best results were achieved by the FaceNet model, which reached an accuracy of 97.08%. The second part of the experiments focused on evaluating the speed of recognition on different platforms. The experiments provided answers to questions about the accuracy achieved by each model, the results of system acceleration using CPU, GPU, and NCS2, which configuration is most suitable for each model, and which configuration achieves the highest frame rate and lowest latency among the compared models. The best frame rates and latency were achieved by the SphereFace model, accelerated using NCS2, with 17.15 FPS and a latency of 58.293 ms. The FaceNet model achieved a frame rate of 15.35 FPS and a latency of 65.164 ms. For achieving a balance between accuracy and speed, the best system configuration is with the FaceNet model, accelerated using NCS2.


ACCELERATION OF FACE RECOGNITION ALGORITHM WITH NEURAL COMPUTE STICK 2 thesis focuses on the issue of facial recognition in a face image using neural networks and its acceleration. It provides an overview of previously used techniques and addresses the use of currently dominant convolutional neural networks to solve this issue. The work also focuses on acceleration mechanisms that can be used in this area. Based on the knowledge of the issue, a system based on the concept of edge computing was created, which can be used as a home security system connected to an IP camera, which sends a notification about the presence of an unknown person in a guarded area.