Featured

Mastering Llama-2 Setup: A Comprehensive Guide to Installing and Running llama.cpp Locally

Welcome to the exciting world of Llama-2 models! In today’s blog post, we’re diving into the process of installing and running these advanced models. Whether you’re a seasoned AI enthusiast or just starting out, understanding Llama-2 models is crucial. These models, known for their efficiency and versatility in handling large-scale data, are a game-changer in the field of machine learning.

In this Shortcut, I give you a step-by-step process to install and run Llama-2 models on your local machine with or without GPUs by using llama.cpp. As I mention in Run Llama-2 Models, this is one of the preferred options.

Here are the steps:

Step 1. Clone the repositories

You should clone the Meta Llama-2 repository as well as llama.cpp:

$ git clone https://github.com/facebookresearch/llama.git
$ git clone https://github.com/ggerganov/llama.cpp.git 

Step 2. Request access to download Llama models

Complete the request form, then navigate to the directory where you downloaded the GitHub code provided by Meta. Run the download script, enter the key you received, and adhere to the given instructions:

$ cd llama
$ ./download 

Step 3. Create a virtual environment

To prevent any compatibility issues, it’s advisable to establish a separate Python environment using Conda. In case Conda isn’t already installed on your system, you can refer to this installation guide. Once Conda is set up, you can create a new environment by entering the command below:

#https://docs.conda.io/projects/miniconda/en/latest/

#These four commands quickly and quietly install the latest 64-bit version #of the installer and then clean up after themselves. To install a different #version or architecture of Miniconda for Linux, change the name of the .sh #installer in the wget command.

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh

#After installing, initialize your newly-installed Miniconda. The following #commands initialize for bash and zsh shells:

~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
$ conda create -n llamacpp python=3.11

Step 4. Activate the environment

Use this command to activate the environment:

$ conda activate llamacpp

Step 5. Go to llama.cpp folder and Install the required dependencies

Type the following commands to continue the installation:

$ cd ..
$ cd llama.cpp
$ python3 -m pip install -r requirements.txt

Step 6.

Compile the source code

Option 1:

Enter this command to compile with only CPU support:

$ make 

Option 2:

To compile with CPU and GPU support, you need to have the official CUDA libraries from Nvidia installed.

#https://developer.nvidia.com/cuda-12-3-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin

sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600

wget https://developer.download.nvidia.com/compute/cuda/12.3.0/local_installers/cuda-repo-ubuntu2204-12-3-local_12.3.0-545.23.06-1_amd64.deb

sudo dpkg -i cuda-repo-ubuntu2204-12-3-local_12.3.0-545.23.06-1_amd64.deb

sudo cp /var/cuda-repo-ubuntu2204-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/

sudo apt-get update

sudo apt-get -y install cuda-toolkit-12-3

sudo apt-get install -y cuda-drivers

Double check that you have Nvidia installed by running:

$ nvcc --version
$ nvidia-smi

Export the the CUDA Home environment variable:

$ export CUDA_HOME=/usr/lib/cuda

Then, you can launch the compilation by typing:

$ make clean && LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=all make -j

Step 7. Perform a model conversion

Run the following command to convert the original models to f16 format (please note in the example I show examples the 7b-chat model / 13b-chat model / 70b-chat model):

$ mkdir models/7B
$ python3 convert.py --outfile models/7B/ggml-model-f16.gguf --outtype f16 ../llama/llama-2-7b-chat --vocab-dir ../llama/llama

$ mkdir models/13B
$ python3 convert.py --outfile models/13B/ggml-model-f16.gguf --outtype f16 ../llama/llama-2-13b-chat --vocab-dir ../llama/llama

$ mkdir models/70B
$ python3 convert.py --outfile models/70B/ggml-model-f16.gguf --outtype f16 ../llama/llama-2-70b-chat --vocab-dir ../llama/llama

If the conversion runs successfully, you should have the converted model stored in models/* folders. You can double check this with the ls command:

$ ls models/7B/ 
$ ls models/13B/ 
$ ls models/70B/ 

Step 8. Quantize

Now, you can quantize the model to 4-bits, for example, by using the following command (please note the q4_0 parameter at the end):

$ ./quantize models/7B/ggml-model-f16.gguf models/7B/ggml-model-q4_0.gguf q4_0 

$ ./quantize models/13B/ggml-model-f16.gguf models/13B/ggml-model-q4_0.gguf q4_0 

$ ./quantize models/70B/ggml-model-f16.gguf models/70B/ggml-model-q4_0.gguf q4_0 
$ ./quantize models/70B/ggml-model-f16.gguf models/70B/ggml-model-q2_0.gguf Q2_K
$ ./quantize -h
usage: ./quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] model-f32.gguf [model-quant.gguf] type [nthreads]

  --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
  --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
  --pure: Disable k-quant mixtures and quantize all tensors to the same type

Allowed quantization types:
   2  or  Q4_0   :  3.56G, +0.2166 ppl @ LLaMA-v1-7B
   3  or  Q4_1   :  3.90G, +0.1585 ppl @ LLaMA-v1-7B
   8  or  Q5_0   :  4.33G, +0.0683 ppl @ LLaMA-v1-7B
   9  or  Q5_1   :  4.70G, +0.0349 ppl @ LLaMA-v1-7B
  10  or  Q2_K   :  2.63G, +0.6717 ppl @ LLaMA-v1-7B
  12  or  Q3_K   : alias for Q3_K_M
  11  or  Q3_K_S :  2.75G, +0.5551 ppl @ LLaMA-v1-7B
  12  or  Q3_K_M :  3.07G, +0.2496 ppl @ LLaMA-v1-7B
  13  or  Q3_K_L :  3.35G, +0.1764 ppl @ LLaMA-v1-7B
  15  or  Q4_K   : alias for Q4_K_M
  14  or  Q4_K_S :  3.59G, +0.0992 ppl @ LLaMA-v1-7B
  15  or  Q4_K_M :  3.80G, +0.0532 ppl @ LLaMA-v1-7B
  17  or  Q5_K   : alias for Q5_K_M
  16  or  Q5_K_S :  4.33G, +0.0400 ppl @ LLaMA-v1-7B
  17  or  Q5_K_M :  4.45G, +0.0122 ppl @ LLaMA-v1-7B
  18  or  Q6_K   :  5.15G, -0.0008 ppl @ LLaMA-v1-7B
   7  or  Q8_0   :  6.70G, +0.0004 ppl @ LLaMA-v1-7B
   1  or  F16    : 13.00G              @ 7B
   0  or  F32    : 26.00G              @ 7B
          COPY   : only copy tensors, no quantizing

Step 9. Run one of the prompts

Option 1:

You can execute one of the example prompts using only CPU computation by typing the following command:

$ ./main -m ./models/7B/ggml-model-q4_0.gguf -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt

This example will initiate the chat in interactive mode in the console, starting with the chat-with-bob.txt prompt example.

Option 2:

If you compiled llama.cpp with GPU enabled in Step 8, then you can use the following command:

$ ./main -m ./models/7B/ggml-model-q4_0.gguf -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt --n-gpu-layers 3800

$ ./main -m ./models/13B/ggml-model-q4_0.gguf -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt --n-gpu-layers 3800

$ ./main -m ./models/70B/ggml-model-q4_0.gguf -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt --n-gpu-layers 1000

If you have GPU enabled, you need to set the number of layers to offload to the GPU based on your vRAM capacity. You can increase the number gradually and check with the nvtop command until you find a good spot.

You can see the full list of parameters of the main command by typing the following:

$ ./main --help

VMware vCenter Server 8.0 Update 3c: Fixing vSphere Client Idle Session Issue

VMware has released vCenter Server 8.0 Update 3c, bringing several key improvements and bug fixes. Among these, one notable issue addressed in this release relates to the vSphere Client’s behavior when left idle for extended periods.

PR 3439359: vSphere Client Session Becomes Unresponsive After 50 Minutes of Inactivity

In previous versions, particularly starting from vSphere 8.0 Update 3b, users encountered a frustrating issue with the vSphere Client. If a session remained idle for more than 50 minutes, the client would become unresponsive, making it impossible to log in or log out. Attempting to resume work in the same browser would yield no results unless all browser cookies were cleared. This was not only an inconvenience but also a disruption for administrators managing their vSphere environments.

Cause of the Issue: Apache Tomcat 9.0.91 Upgrade

The root of the problem was traced back to an upgrade to Apache Tomcat 9.0.91, introduced in vSphere 8.0 Update 3b. This upgrade brought with it a change in the default value of the property org.apache.catalina.connector.RECYCLE_FACADES. Previously set to FALSE, this value was altered to TRUE, causing sessions to become invalid after extended inactivity. This meant that any session left idle for over 50 minutes would not properly refresh, effectively locking the user out until they manually cleared cookies from their browser.

Links:

Quick Tip – Using PowerCLI to query VMware…

Quick Tip – Using PowerCLI to query VMware…

One of the most powerful and versatile VM management capability in vSphere is the Guest Operations API, providing a rich set of operations from transferring files to/from the guest to running commands directly on the guest as if you were logged in! An easy way to consume the Guest Operations API […]


Broadcom Social Media Advocacy

Quick Tip – SSH Server, Client & Authorized Key…

Quick Tip – SSH Server, Client & Authorized Key Configurations for ESXi 7.0 Update 1 and later

Quick Tip – SSH Server, Client & Authorized Key…

The general best practice is to disable SSH on your ESXi host by default and if/when you need access, you can turn it on temporarily and disable it when you have completed your task. For users that need to modify the default SSH configurations whether that is on the server side, client side or setting […]


Broadcom Social Media Advocacy

Intel Skylake CPUs Reaching End of Support in Future vSphere Releases after 8.x

As the IT industry continues to evolve, so do the platforms and hardware that support our digital infrastructure. One significant upcoming change is related to Intel’s Skylake generation of processors, which has entered the End of Servicing Update (ESU) and End of Servicing Lifetime (EOSL) phase. By December 31, 2023, Intel will officially stop providing updates for Skylake server-class processors, including the Xeon Scalable Processors (SP) series. This change is set to impact future VMware vSphere releases, as VMware plans to discontinue support for Intel Skylake CPUs in its next major release following vSphere 8.x.

Why Skylake CPUs are Being Phased Out

Intel’s Skylake architecture, introduced in 2015, has been widely adopted in server environments for its balance of performance and power efficiency. The Xeon Scalable Processor series, which is part of the Skylake generation, has been foundational in many data centers around the world. However, as technology progresses, older generations of processors become less relevant in the context of modern workloads and new advancements in CPU architectures.

Impact on VMware vSphere Users

With VMware announcing plans to drop support for Skylake CPUs in a future major release after vSphere 8.x, organizations relying on these processors need to start planning for hardware refreshes. As VMware’s virtualization platform evolves, it is optimized for more modern CPU architectures that offer enhanced performance, security, and energy efficiency.

More info CPU Support Deprecation and Discontinuation In vSphere Releases

Cisco UCS Manager Release 4.3(4a): New Optimized Adapter Policies for VIC Series Adapters

Starting with Cisco UCS Manager release 4.3(4a), Cisco has introduced optimized adapter policies for Windows, Linux, and VMware operating systems, including a new policy for VMware environments called “VMware-v2.” This update affects the Cisco UCS VIC 1400, 14000, and 15000 series adapters, promising improved performance and flexibility.

This release is particularly interesting for those managing VMware infrastructures, as many organizations—including ours—have been using similar settings for years. However, one notable difference is that the default configuration in the new policy sets Interrupts to 11, while in our environment, we’ve historically set it to 12.

Key Enhancements in UCS 4.3(4a)

  1. Optimized Adapter Policies: The new “VMware-v2” policy is tailored to enhance performance in VMware environments, specifically for the Cisco UCS VIC 1400, 14000, and 15000 adapters. It adjusts parameters such as the number of interrupts, queue depths, and receive/transmit buffers to achieve better traffic handling and lower latency.
  2. Receive Side Scaling (RSS): A significant feature available on the Cisco UCS VIC series is Receive Side Scaling (RSS). RSS is crucial for servers handling large volumes of network traffic as it allows the incoming network packets to be distributed across multiple CPU cores, enabling parallel processing. This distribution improves the overall throughput and reduces bottlenecks caused by traffic being handled by a single core. In high-performance environments like VMware, this can lead to a noticeable improvement in network performance. RSS is enabled on a per-vNIC basis, meaning administrators have granular control over which virtual network interfaces benefit from the feature. Given the nature of modern server workloads, enabling RSS on vNICs handling critical traffic can substantially improve performance, particularly in environments with multiple virtual machines.
  3. Maximizing Ring Size: Another important recommendation for administrators using the VIC 1400 adapters is to set the ringsize to the maximum, which for these adapters is 4096. The ring size determines how much data can be queued for processing by the NIC (Network Interface Card) before being handled by the CPU. A larger ring size allows for better performance, especially when dealing with bursts of high traffic.In environments where high throughput and low latency are critical, setting the ring size to its maximum value ensures that traffic can be handled more efficiently, reducing the risk of packet drops or excessive buffering.

Links:

vSphere Client Instability and Session Timeouts After vCenter Server 8.0.3.00200 Upgrade: How to Resolve

After upgrading to vCenter Server 8.0.3.00200, some users have reported issues with the vSphere Client becoming unstable, particularly after long periods of session idleness (typically 1-2 hours). This instability may manifest in a variety of ways, including session timeouts, continuous loading indicators, and errors when browsing the inventory.

Root Cause

The root cause of this instability appears to be related to a misconfiguration in how the vSphere Client handles facade recycling within the Apache Catalina Connector.

Solution: Updating the Catalina Configuration

root@vcsa-home [ ~ ]# cp /usr/lib/vmware-vsphere-ui/server/conf/catalina.properties /root/catalina.properties.bak

root@vcsa-home [ ~ ]# echo "org.apache.catalina.connector.RECYCLE_FACADES=false" >> /usr/lib/vmware-vsphere-ui/server/conf/catalina.properties

root@vcsa-home [ ~ ]# service-control --restart vsphere-ui

Quick Tip – Monitoring ESXi remote syslog…

Quick Tip – Monitoring ESXi remote syslog…

When an ESXi host is unable to forward its logs to a remote syslog server, a VMkernel Observation (VOB) is automatically raised by the host and it can be used to proactively alert administrators, which has been possible since ESXi 5.0 …. per this blog post from 2012 after some Googling! 😅😂 […]


Broadcom Social Media Advocacy

ESXi on ASUS NUC 14 Pro (Revel Canyon)

ESXi on ASUS NUC 14 Pro (Revel Canyon)

The ASUS NUC 14 Pro (formally known as Revel Canyon) is the first ASUS-based NUC since the acquisition of the NUC Division from Intel last fall. I know many of my readers have been requesting a review of the new ASUS NUCs, but to be honest, it has been pretty difficult to get samples directly […]


Broadcom Social Media Advocacy

Useful NVMe Tiering reporting using vSphere 8.0…

Useful NVMe Tiering reporting using vSphere 8.0 Update 3 APIs

Useful NVMe Tiering reporting using vSphere 8.0…

After successfully enabling the NVMe Tiering feature, which was introduced in vSphere 8.0 Update 3, you can find some useful details about your NVMe Tiering configuration by navigating to a specific ESXi host and under Configure->Hardware and under the Memory section as shown in the screenshot below. There is quite a bit of information that […]


Broadcom Social Media Advocacy