What We Didn’t Tell You About Fine-Tuning SAM 2 at VMware Explore

If you’re attending VMware Explore 2025 in Las Vegas, don’t miss our joint lightning talk: “Homelab Meets SAM2: Fine-Tune Like a Pro” – a father-daughter session exploring how the Segment Anything Model 2 (SAM2) can be fine-tuned for real-world image and video segmentation tasks.

We’ll show how to take this powerful open-world segmentation model and adapt it for specific domains. Whether you’re a machine learning enthusiast, a homelab tinkerer, or someone looking to bring SAM2 into production, we’d love to see you there and share what we’ve learned.

This post is a detailed companion to our talk — perfect for those who want to dive deeper into the technical side of SAM2 fine-tuning, or for anyone simply curious about how to adapt SAM2 for their own image and video segmentation tasks.

📝 Note: Many of the insights shared here are based on the practical experience and research from my Master’s thesis, which focused on interactive segmentation of medical video data.

SAM 2: How It’s Built

Before we dive into training details, let’s take a quick look at how Segment Anything Model 2 (SAM 2) is actually structured.

SAM 2 builds on its predecessor, Segment Anything Model (SAM), which was released by Meta AI in April 2023. While SAM was a breakthrough in interactive open-world image segmentation, it was limited to single-image inputs and lacked native support for video or sequential data. SAM 2 extends this concept by introducing architectural changes that enable it to handle both images and videos, making it suitable for tasks that require temporal consistency, such as video object segmentation and tracking.

SAM 2 is an open-source foundational model for open-world segmentation, developed and released by Meta AI in August 2024.
It was specifically designed to support interactive segmentation for both images and videos, making it one of the first models to handle temporal and spatial segmentation in a unified framework.

As an interactive model, SAM 2 accepts a wide range of user prompts — including positive and negative points, bounding boxes, and even segmentation masks. These prompts guide the model to focus on specific objects or regions, enabling precise segmentation in complex and dynamic scenes.

The Architecture That Powers SAM 2

SAM 2 consists of several specialized modules that work together to enable precise and temporally consistent segmentation across both images and videos. The visualization below1 provides a clear overview of how these components interact — from user prompts to final mask predictions, across time.

  1. Source: Segment Anything Model 2 ↩︎
  • Image Encoder: At its core, SAM 2 uses a transformer-based encoder to extract high-level visual features from each frame. Whether it’s a single image or a sequence of video frames, this encoder captures the spatial context and object structures at each point in time.
  • Prompt Encoder: This module translates user-provided inputs – such as positive/negative clicks, bounding boxes, or coarse masks – into prompt embeddings. These embeddings guide the model to focus on specific objects or regions.
  • Memory Mechanism: To enable tracking over time, SAM 2 includes a memory system composed of a memory encoder, memory bank, and attention module. This system stores relevant features from previous frames, helping the model maintain object identity even during motion, deformation, or partial occlusion.
  • Mask Decoder: The decoder fuses all information – visual features, prompt embeddings, and memory context – to produce the final segmentation mask. In video mode, it leverages stored memory to ensure that objects are segmented consistently across frames

From General to Specific: Fine-Tuning SAM 2

While SAM 2 is trained on the massive and diverse SA-V dataset, making it a powerful generalist, real-world applications often require greater precision and consistency. When the model is applied to a domain that was not well-represented in the original training data — such as medical imaging or other highly specialized fields — its out-of-the-box performance may fall short.

That’s where fine-tuning comes in.

The Meta AI team provides official training and fine-tuning code for SAM 2, including detailed instructions in the training README. However, based on my experience, this codebase is quite complex and optimized for large-scale multi-GPU setups. Since I was working with a domain-specific dataset and didn’t require distributed training, I needed more granular control over the training loop and fine-tuning process. That’s why I chose to write my own simplified fine-tuning pipeline — tailored to single-GPU setups and easier to adapt to custom datasets. I’m sharing this approach here in the hope that it will be helpful for others facing similar challenges.

Fine-Tuning SAM 2 for Interactive Image Segmentation with Bounding Box Prompts

Before we begin fine-tuning, several preparation steps are required.
First, we need to load the pretrained SAM 2 model. After that, we selectively enable the components we wish to fine-tune (e.g., image encoder, prompt encoder, or mask decoder). Next, we choose an appropriate loss function and optimizer based on our task and dataset. Once all components are in place, we can implement the training loop, including the forward pass and backpropagation.

For a practical introduction to how SAM 2 can be used on static images, the official SAM2 GitHub repository provides ready-to-run Jupyter notebooks. A good starting point is the image_predictor_example.ipynb, which demonstrates how to load the model, apply prompts, and generate segmentation masks.

1. Loading the Pretrained SAM 2 Model

To use Segment Anything Model 2 (SAM 2), we first need to build the model architecture and load its pretrained weights. This is done using the build_sam2 function and the SAM2ImagePredictor wrapper:

from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor

sam2_checkpoint = "../checkpoints/sam2.1_hiera_small.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_s.yaml"

predictor = build_sam2(model_cfg, sam2_checkpoint, device=device)
  • model_cfg: Path to the YAML configuration file describing the architecture of the desired SAM 2 variant
  • sam2_checkpoint: Path to the corresponding pretrained model weights (.pt checkpoint).
  • build_sam2(...): Constructs the SAM 2 model and loads its weights.
  • SAM2ImagePredictor(...): Wraps the model in a high-level inference class that simplifies interaction.

2. Enabling Model Components for Fine-Tuning

When fine-tuning SAM 2 for image segmentation, the three essential components that can be trained are:

  • predictor.model.image_encoder
  • predictor.model.sam_prompt_encoder
  • predictor.model.sam_mask_decoder

Each of these modules can be selectively enabled for training based on your configuration and GPU capacity.

These components are implemented in PyTorch and can be selectively enabled for training. To fine-tune any of them, you need to set the module to training mode using .train(True) and ensure all parameters have requires_grad = True. For example:

predictor.model.sam_mask_decoder.train(True)
for p in predictor.model.sam_mask_decoder.parameters():
    p.requires_grad = True

⚠️ Note: Be sure to comment out any @torch.no_grad() or @torch.inference_mode() decorators in the source code — particularly in the ./sam2 folder (sam2_image_predictor.py) of the repository. These decorators are designed to speed up inference by disabling gradient tracking and certain computation paths, but they will prevent training from working correctly.

3. Defining the Loss Function

To compute the difference between predicted masks and ground-truth masks, we need a suitable loss function. A convenient option is to use the segmentation_models_pytorch (SMP) library, which provides ready-to-use implementations of popular segmentation loss functions such as DiceLoss, FocalLoss, or combinations thereof.

In my own experiments, I used Dice Loss, as it is well-suited for segmentation tasks with imbalanced foreground and background pixels—a common scenario in medical imaging. However, depending on your dataset and goals, you may want to experiment with other loss functions like Binary Cross-Entropy, Focal Loss, or custom hybrids.

4. Selecting the Optimizer

Once the model and loss function are defined, the next step is to select an optimizer that will update the model’s parameters during training. In most PyTorch-based workflows, popular choices include SGD, Adam, and AdamW.

For my experiments, I used AdamW, which is a variant of the Adam optimizer with decoupled weight decay. It often works well in computer vision tasks and provides better generalization, especially when fine-tuning large pretrained models like SAM 2.

In my fine-tuning setup, I assigned different learning rates and weight decay values to individual components of the SAM 2 model. Specifically, I used a lower learning rate (1e-6) for the image encoder, since it is a large and sensitive component pretrained on diverse data. For the prompt encoder and the mask decoder, I used a slightly higher learning rate (5e-6), which allowed them to adapt more quickly to the target task. For all components, I applied a weight decay of 1e-4 to encourage regularization and improve generalization. This configuration proved effective in practice and led to stable training across multiple datasets.

5. Implementing the Training Loop

Once all components are prepared—model, optimizer, and loss function—we can implement the training loop:

preds = sam2_forward_pass(...)

preds = preds.to(dtype=torch.float32)
masks = masks.to(dtype=torch.float32)

loss_value = loss(preds, masks)

optimizer.zero_grad()
loss_value.backward()
optimizer.step()

During each iteration we first perform a forward pass using sam2_forward_pass(...) to obtain predicted masks. Both the predictions and ground-truth masks are then cast to float32, ensuring they are compatible for loss computation.

The loss is computed using the selected loss function (e.g., DiceLoss), and we then proceed with the standard PyTorch optimization routine: we reset gradients with optimizer.zero_grad(), perform backpropagation via loss_value.backward(), and update the model parameters using optimizer.step().

This training step is typically wrapped inside an epoch loop and may be extended with gradient accumulation, learning rate scheduling, or mixed-precision training as needed.

6. Writing the SAM 2 Forward Pass

Now lets dive into the forward pass implementation.

⚠️ Note: Be sure to comment out any @torch.no_grad() or @torch.inference_mode() decorators in the source code — particularly in the ./sam2 folder (sam2_image_predictor.py) of the repository. These decorators are designed to speed up inference by disabling gradient tracking and certain computation paths, but they will prevent training from working correctly.

First, we pass the input image batch through the SAM 2 image encoder. This step extracts high-level visual features from each image, which will later be combined with prompt embeddings to guide the segmentation process.

predictor.set_image_batch(images_list)

The images_list is expected to be a list of NumPy arrays, where each image has shape [H, W, C] (Height, Width, Channels) and is typically of type float32.
The list itself emulates the batch dimension, allowing multiple images to be processed together.

Next, we prepare the prompt input — in this case, bounding boxes:

_, _, _, unnorm_box = predictor._prep_prompts(
                 point_coords=None, 
                 point_labels=None, 
                 box=bbox_coords,
                 mask_logits=None, 
                 normalize_coords=True
                 )

The variable bbox_coords contains the box coordinates in unnormalized format [x1, y1, x2, y2], where the values are in pixel units relative to the original input image size. These coordinates will be normalized internally by the model before being passed to the prompt encoder.

The prepared box coordinates are passed into the prompt encoder to obtain both sparse and dense embeddings:

sparse_embeddings, dense_embeddings = predictor.model.sam_prompt_encoder(
                 points=None, 
                 boxes=unnorm_box, 
                 masks=Non
                 )

We also extract high-resolution image features, which are used to improve the accuracy of mask decoding:

high_res_features = [feat_level[-1].unsqueeze(0) for feat_level in predictor._features["high_res_feats"]]

The mask decoder combines all inputs to predict segmentation masks:

low_res_masks, _, _, _ = predictor.model.sam_mask_decoder(
                image_embeddings=predictor._features["image_embed"],
                image_pe=predictor.model.sam_prompt_encoder.get_dense_pe(),
                sparse_prompt_embeddings=sparse_embeddings,
                dense_prompt_embeddings=dense_embeddings,
                multimask_output=True,
                repeat_image=False,
                high_res_features=high_res_features,
                )

Finally, the predicted masks are upsampled to the desired resolution (e.g. 512×512) and passed through a sigmoid activation to obtain per-pixel probabilities:

preds = predictor._transforms.postprocess_masks(low_res_masks, img_size) # output image size (e.g. 512)
preds = torch.sigmoid(preds[:, 0])

Since the model is configured with multimask_output=True, the mask decoder produces three candidate masks for each image. For fine-tuning, we only use the first output mask (index 0), which is usually the most relevant. This is done by slicing with preds[:, 0], reducing the shape from [B, N, H, W] to [B, H, W], where N is the number of masks per image. Applying torch.sigmoid then converts the raw logits into probability values.

Domain-Specific Video Segmentation with SAM 2 and Box Prompts

Alright, now that we’ve covered the essential components of fine-tuning SAM 2 for static images, let’s dive into video segmentation. Fortunately, many of the building blocks from the previous chapter — such as the loss function and optimizer — remain applicable. In this section, we’ll explore how to adapt the fine-tuning pipeline to work with video data using bounding box prompts.

For a practical introduction to how SAM 2 can be applied to video data, the official SAM 2 GitHub repository also includes a ready-to-run notebook for video segmentation. The video_predictor_example.ipynb demonstrates how to process a sequence of frames, apply prompts, and generate segmentation masks consistently across time. It’s a great starting point for understanding how SAM 2 handles video inputs and temporal context.

Similarly to the image-based setup, we need to go through several key steps: loading the same pretrained checkpoint used by the video predictor, enabling the desired model components for training, and defining a forward pass tailored for video inputs. Each of these steps builds upon the foundations we covered earlier, but with adjustments to handle sequential data and temporal dynamics.

⚠️ Note: In my own experiments, I fine-tuned SAM 2 for video segmentation using a custom approach tailored to my specific dataset. The recommendations in this section are based on what worked best for me in practice. I was working with sparsely annotated video sequences, where only a few frames had ground-truth masks available. To address this, I created training sequences of variable length — typically between 1 and 4 frames — and computed the loss only on the final frame of each sequence. This setup allowed the model to learn prompt propagation over time, while making the most out of limited annotations. In my experience, this approach led to more stable training and better generalization than using fixed-length sequences or computing loss on all frames.

1. Loading the Pretrained SAM 2 Checkpoint for Video

To begin fine-tuning SAM 2 for video segmentation, we first load the pretrained model in video mode using the build_sam2_video_predictor function. This function initializes the model with support for sequential frame processing and enables video-specific features such as temporal embeddings.

from sam2.build_sam import build_sam2_video_predictor

sam2_checkpoint = "../checkpoints/sam2.1_hiera_small.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_s.yaml"

predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint, device=device)

2. Trainable Modules in SAM 2 for Video Segmentation

When fine-tuning SAM 2 for video segmentation, the model includes several additional components specifically designed to handle temporal information. The following modules can be selectively enabled for training:

  • predictor.image_encoder – extracts visual features from each frame (same as in the image version),
  • predictor.sam_prompt_encoder – encodes prompts such as boxes or points,
  • predictor.sam_mask_decoder –generates segmentation masks from combined image features, prompt embeddings, and temporal memory,
  • predictor.memory_encoder – encodes past frame information to provide temporal context,
  • predictor.memory_attention – applies attention over temporal memory to support consistent object tracking,
  • predictor.obj_ptr_proj – projects memory features for object pointer modeling,
  • predictor.obj_ptr_tpos_proj – encodes temporal positional embeddings for object-level temporal reasoning.

These components work together to enable temporally consistent predictions across video frames. Since the model is based on PyTorch, fine-tuning any of these parts requires setting them to training mode using .train(True) and ensuring their parameters have requires_grad = True. This allows gradients to flow and weights to update during backpropagation:

predictor.memory_encoder.train(True)
for p in predictor.memory_encoder.parameters():
    p.requires_grad = True

⚠️ Note: Be sure to comment out any @torch.no_grad() or @torch.inference_mode() decorators in the source code — particularly in the ./sam2 folder (sam2_video_predictor.py) of the repository. These decorators are designed to speed up inference by disabling gradient tracking and certain computation paths, but they will prevent training from working correctly.

3. Forward Pass for Video Segmentation with SAM 2

The video forward pass in SAM 2 requires more preparation than static image inference, as it handles multiple frames, memory management, and prompt propagation across time. The process is encapsulated in two functions: create_inference_state(...) and sam2_forward_pass(...).

Understanding and Customizing init_state()

Before implementing the forward pass, I highly recommend taking time to understand how the SAM2VideoPredictor works internally — especially the init_state() function. This method is responsible for creating the inference state, a central structure that stores everything SAM 2 needs to track and segment objects across video frames.

For training purposes, you’ll likely need to adapt this logic — especially to:

  • load frames and annotations from tensors or memory instead of a video path,
  • work with variable-length sequences, not full videos,
  • and adjust the input resolution (image_size) to match your training pipeline.

In my case, I implemented a simplified version of init_state() that accepts a preloaded batch of frames (from sequence_data) and integrates seamlessly with my training loop. This gave me full control over how sequences are formed, annotated, and processed during fine-tuning.

⚠️ Code Modification
In order to get fine-tuning working correctly, I also had to modify line 794 in SAM2VideoPredictor.
Originally, the code assumes that object_score_logits is always present:

object_score_logits = current_out["object_score_logits"]

However, during training with custom data or partial inputs, this key may be missing. To prevent errors and ensure the forward pass still works, I replaced it with a safe fallback:

object_score_logits = current_out.get("object_score_logits", torch.ones(1, 64, device=current_out["pred_masks"].device))

This creates a dummy tensor when the score logits are not available, allowing the training process to proceed without crashing.

Putting It All Together: The Forward Pass for Video

Before running the forward pass, we first initialize a fresh inference state using our custom create_inference_state(...) function — a reimplementation of SAM 2’s original init_state(), adapted for training. This prepares all necessary structures for tracking, memory, and prompt inputs.

inference_state = create_inference_state(...)
predictor.reset_state(inference_state)

After creating the state, we call predictor.reset_state(...) to ensure the model starts with a clean internal memory. This is important to avoid any residual data from previous sequences during training or evaluation.

With the inference state initialized and reset, we can now run the forward pass for the current video sequence. We start by injecting the initial prompt — in this case, a bounding box — into the first frame (ann_frame_idx = 0) using predictor.add_new_points_or_box(...). This establishes the object we want to segment and track throughout the video.

ann_frame_idx = 0
ann_obj_id = 1
    
_, out_obj_ids, out_mask_logits = predictor.add_new_points_or_box(
            inference_state=inference_state,
            frame_idx=ann_frame_idx,
            obj_id=ann_obj_id,
            box=sequence_data["prompt_bbox"]
)
    
video_segments = {}  # Video_segments contains the per-frame segmentation results

If the sequence contains multiple frames, the model then propagates the object across the video using memory attention via predictor.propagate_in_video(...). This produces a set of predicted masks for each frame.

Finally, all masks are stacked and passed through a sigmoid to obtain per-pixel probabilities. The output preds has the shape [T, C, H, W], where T is the number of frames.

if inference_state["num_frames"] > 1:
    for out_frame_idx, out_obj_ids, out_mask_logits in predictor.propagate_in_video(inference_state):
        video_segments[out_frame_idx] = {
            out_obj_id: out_mask_logits[i]
            for i, out_obj_id in enumerate(out_obj_ids)
        }
        
    # Return video segments in shape: [T,C,H,W]
    preds = torch.stack([torch.sigmoid(segment[ann_obj_id]) for segment in video_segments.values()]).to(device)
else:
    # Return video segments in shape: [T,C,H,W]
    preds = torch.sigmoid(out_mask_logits).to(device)

Final Notes

While implementing the forward pass, you may have noticed a potential limitation — all frames in the sequence are loaded into memory at once. This is not ideal for longer videos or real-time applications. Fortunately, there is an alternative implementation of SAM 2 designed for streaming input:
segment-anything-2-real-time

This real-time version of SAM 2 is optimized for sequential frame-by-frame processing and significantly reduces memory usage. The fine-tuned model you trained following this guide can be integrated into that pipeline, making it suitable for deployment in low-latency or resource-constrained environments.

I hope this walkthrough helped clarify how SAM2 can be fine-tuned for both image and video segmentation — and that it gave you a clearer path for applying it to your own project or research. If you have any questions, run into unexpected issues, or just want to share what you’re working on, feel free to reach out. I’ll be happy to help or discuss further. If you’re interested in a deeper dive into the methods and experiments behind this work — particularly in the context of medical video segmentation — feel free to check out my Master’s thesis, where many of these insights originated.

Good luck, and happy segmenting!

My TOP 10 Sessions You Can’t Miss at VMware Explore 2025

🚀 VMware Explore 2025 is here! If you haven’t checked the Content Catalog yet, grab a coffee, block your calendar, and don’t miss these carefully picked 10 sessions.


1️⃣ Deploying Minimal VMware Cloud Foundation 9.0 Lab [CLOB1201LV]

[CLOB1201LV] – William Lam
If you ever dreamt of getting hands-on with VMware Cloud Foundation 9.0 but lack enterprise-grade hardware—this session is your ticket. Learn how to run a fully functional VCF lab with minimal resources, plus hardware tips and tricks to stretch every CPU cycle and gigabyte of RAM.


2️⃣ Real-World Lessons in Rightsizing VMware Cloud Foundation for On-Premises AI Workloads [INVB1300LV]

[INVB1300LV] – Frank Denneman & Johan van Amersfoort
Designing infrastructure for AI is no joke. Frank and Johan share hard-earned insights on balancing model size, inference speed, GPU allocation, and backup pipelines. Expect practical strategies and an AI sizing tool demo—perfect for anyone serious about AI on-premises.


3️⃣ Unleashing AI Inference at Scale: NVIDIA Dynamo on VMware Private AI Foundation with NVIDIA [INVB1070LV]

[INVB1070LV] – Frank Denneman & Shawn Kelly
Learn how NVIDIA Dynamo supercharges LLM inference—up to 30× more tokens per GPU! If you’re integrating large AI models into your VMware Private AI Foundation stack, this session will open your eyes to what’s possible.


4️⃣ VMware Cloud Foundation and VMware vSphere Kubernetes Service: A Holistic Approach to Encryption, Keys, and Secrets – Part 1

[CMTYQT1284LV] – Waldemar Pera
In the cloud era, your secrets are gold. Waldemar shows how VCF and vSphere Kubernetes Service work together to lock down your data, manage encryption keys, and protect Kubernetes secrets holistically.


5️⃣ VMware Cloud Foundation and VMware vSphere Kubernetes Service: A Holistic Approach to Encryption, Keys, and Secrets – Part 2 [CMTYQT1285LV]

[CMTYQT1285LV] – Waldemar Pera
Continue the journey from Part 1—this session gets hands-on with Thales CipherTrust Data Security Platform and shows how to apply airtight data governance across your hybrid cloud.


6️⃣ Six Innovations Redefining Storage and Disaster Recovery for VMware Cloud Foundation [CLOB1028LV]

[CLOB1028LV] – Duncan Epping
Ready for a storage revolution? Duncan walks you through the future of multitenant storage, S3 integration, and automated ransomware recovery—straight from the R&D lab. Storage and DR pros, mark this one with five stars!


7️⃣ Three Times the Performance, Half the Latency: VMware vSAN Express Storage Architecture Deep Dive for VMware Geeks [CLOB1067LV]

[CLOB1067LV] – Duncan Epping & John Nicholson
This one is for true storage geeks. Dive into how the new vSAN Express Storage Architecture squeezes out max NVMe performance, inline dedupe, and insane latency improvements. You’ll want to rewrite your vSAN notes after this.


8️⃣ Accelerating AI Workloads: Mastering vGPU Management in VMware Environments [INVB1158LV]

[INVB1158LV] – Shawn Kelly & Justin Murray
GPUs are gold in AI—learn how to virtualize and manage them right. From Multi-Instance GPUs to vMotion gotchas, this session packs benchmarks, demos, and best practices so you get the most out of your expensive silicon.


9️⃣ Security and Resilience for Sovereign AI [CLOB1262LV]

[CLOB1262LV] – Bob Plankers & Justin Murray
This session blends tech, process, and people factors for securing AI workloads. If your org is building sovereign or regulated AI solutions, this talk will sharpen your strategy.


🔟 What Is New in VMware Cloud Foundation 9.0 Security and Compliance [CLOB1261LV]

[CLOB1261LV] – Bob Plankers
Stay ahead of security and compliance changes in VCF 9.0—features, deprecations, and future directions. Bob makes security updates practical and clear, so you stay audit-ready.


+1: SPECIAL INVITE Homelab Meets SAM2: Fine-Tune Like a Pro [CMTYQT1227LV]

[CMTYQT1227LV] – Me & Eva Micankova
And here’s my shameless plug: join us to see how we built an AI-ready homelab for fine-tuning cutting-edge segmentation models like SAM2. From setup tips to a real use case on polyp segmentation in colonoscopy videos—this is AI tinkering at its finest, done at home!


🔗 Don’t Miss Out – See You There!

I hope this shortlist sparks your excitement and helps you plan your VMware Explore 2025 days wisely. Got your own must-watch sessions? See you in Vegas!

AI Without GPUs – Harnessing CPU Power for AI Workloads

At VMware Explore EU 2024, the session “AI Without GPUs: Using Your Existing CPU Resources to Run AI Workloads” showcased innovative approaches to AI and machine learning using CPUs. Presented by Earl Ruby from Broadcom and Keith Bradley from Nature Fresh Farms, this session emphasized the potential of leveraging Intel Xeon CPUs with Advanced Matrix Extensions (AMX) to run AI workloads efficiently without the need for GPUs.

Key Highlights:

  1. Introduction to AMX:
    • AMX (Advanced Matrix Extensions), available in Intel’s Sapphire Rapids processors, enables high-performance matrix computations directly on CPUs, making them more capable of handling AI/ML tasks traditionally reserved for GPUs.
  2. Why Use CPUs for AI?:
    • Cost Efficiency: Lower operating costs compared to GPUs.
    • Energy Efficiency: Ideal for environments where power consumption is a concern.
    • Sufficient Performance for Specific Use Cases: CPUs can efficiently handle tasks like inferencing and batch-processing ML workloads with models under 15-20 billion parameters.
  3. Software Stack:
    • OpenVINO Toolkit: Optimizes AI/ML workloads on CPUs by compressing neural networks, improving inference performance with minimal accuracy loss.
    • Intel oneAPI: Provides a unified software environment for both CPU and GPU workloads.
  4. Real-World Application:
    • Nature Fresh Farms: Demonstrated how AI-driven automation using CPUs effectively manages complex agricultural processes, including plant lifecycle control in greenhouses.

When to Choose CPUs Over GPUs:

  • Inferencing and Batch Processing: When real-time responses are not critical.
  • Sustainability Goals: Lower power consumption makes CPUs a viable option.
  • Cost-Conscious Environments: For scenarios where reducing operational costs is a priority.

Unlocking Your Data with VMware by Broadcom and NVIDIA — RAG Deep Dive

At VMware Explore EU 2024, the session “Unlocking Your Data with VMware by Broadcom and NVIDIA — RAG Deep Dive” delivered fascinating insights into the power of Retrieval Augmented Generation (RAG). Led by Frank Denneman and Shawn Kelly, this session explored how combining large language models (LLMs) with proprietary organizational data can revolutionize data utilization in enterprises.

What is RAG?

RAG combines the strengths of LLMs with a Vector Database to enhance AI applications by integrating them with an organization’s proprietary data. This synergy allows for more precise and context-aware responses, crucial for business-critical operations.

Why RAG Matters:

  • Enhanced Accuracy: Unlike traditional LLMs prone to “hallucinations” or inaccuracies, RAG provides validated, up-to-date answers by sourcing information directly from relevant databases.
  • Contextual Relevance: It seamlessly blends general knowledge from LLMs with specific proprietary data, delivering highly relevant insights.
  • Traceability and Transparency: RAG solutions can cite the documents used to generate responses, addressing one of the significant limitations of traditional LLMs.

How RAG Works:

  1. Data Indexing: Proprietary data is pre-processed and stored in a vector database.
  2. Question Processing: When a query is made, it is semantically embedded and matched against the vector database.
  3. Answer Generation: The most relevant data is retrieved and used to generate a precise answer.

Integration with NVIDIA:

NVIDIA’s Inference Microservice (NIM) accelerates this process by optimizing LLMs for rapid inference, leveraging GPU-accelerated infrastructure to enhance throughput and reduce latency.

Demystifying DPUs and GPUs in VMware Cloud Foundation

At VMware Explore EU 2024, the session “Demystifying DPUs and GPUs in VMware Cloud Foundation” provided deep insights into how these advanced technologies are transforming modern data centers. Presented by Dave Morera and Peter Flecha, the session highlighted the integration and benefits of Data Processing Units (DPUs) and Graphics Processing Units (GPUs) in VMware Cloud Foundation (VCF).

Key Highlights:

  1. Understanding DPUs:
    • Offloading and Acceleration: DPUs enhance performance by offloading network and communication tasks from the CPU, allowing more efficient resource usage and better performance for data-heavy operations.
    • Enhanced Security: By isolating security tasks, DPUs contribute to a stronger zero-trust security model, essential for protecting modern cloud environments.
    • Dual DPU Support: This feature offers high availability and increased network offload capacity, simplifying infrastructure management and boosting resilience.
  2. Leveraging GPUs:
    • Accelerated AI and ML Workloads: GPUs in VMware environments significantly speed up data-intensive tasks like AI model training and inference.
    • Optimized Resource Utilization: VMware’s vSphere enables efficient GPU resource sharing through virtual GPU (vGPU) profiles, accommodating various workloads, including graphics, compute, and machine learning.
  3. Distributed Services Engine:
    • This engine simplifies infrastructure management and enhances performance by integrating DPU-based services, creating a more secure and efficient data center architecture.

LLM Inference Sizing and Performance Guidance

LLM Inference Sizing and Performance Guidance

When planning to deploy a chatbot or simple Retrieval-Augmentation-Generation (RAG) pipeline on VMware Private AI Foundation with NVIDIA (PAIF-N) [1], you may have questions about sizing (capacity) and performance based on your existing GPU resources or potential future GPU acquisitions. For […]


Broadcom Social Media Advocacy

Mastering Llama-2 Setup: A Comprehensive Guide to Installing and Running llama.cpp Locally

Welcome to the exciting world of Llama-2 models! In today’s blog post, we’re diving into the process of installing and running these advanced models. Whether you’re a seasoned AI enthusiast or just starting out, understanding Llama-2 models is crucial. These models, known for their efficiency and versatility in handling large-scale data, are a game-changer in the field of machine learning.

In this Shortcut, I give you a step-by-step process to install and run Llama-2 models on your local machine with or without GPUs by using llama.cpp. As I mention in Run Llama-2 Models, this is one of the preferred options.

Here are the steps:

Step 1. Clone the repositories

You should clone the Meta Llama-2 repository as well as llama.cpp:

$ git clone https://github.com/facebookresearch/llama.git
$ git clone https://github.com/ggerganov/llama.cpp.git 

Step 2. Request access to download Llama models

Complete the request form, then navigate to the directory where you downloaded the GitHub code provided by Meta. Run the download script, enter the key you received, and adhere to the given instructions:

$ cd llama
$ ./download 

Step 3. Create a virtual environment

To prevent any compatibility issues, it’s advisable to establish a separate Python environment using Conda. In case Conda isn’t already installed on your system, you can refer to this installation guide. Once Conda is set up, you can create a new environment by entering the command below:

#https://docs.conda.io/projects/miniconda/en/latest/

#These four commands quickly and quietly install the latest 64-bit version #of the installer and then clean up after themselves. To install a different #version or architecture of Miniconda for Linux, change the name of the .sh #installer in the wget command.

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh

#After installing, initialize your newly-installed Miniconda. The following #commands initialize for bash and zsh shells:

~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
$ conda create -n llamacpp python=3.11

Step 4. Activate the environment

Use this command to activate the environment:

$ conda activate llamacpp

Step 5. Go to llama.cpp folder and Install the required dependencies

Type the following commands to continue the installation:

$ cd ..
$ cd llama.cpp
$ python3 -m pip install -r requirements.txt

Step 6.

Compile the source code

Option 1:

Enter this command to compile with only CPU support:

$ make 

Option 2:

To compile with CPU and GPU support, you need to have the official CUDA libraries from Nvidia installed.

#https://developer.nvidia.com/cuda-12-3-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin

sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600

wget https://developer.download.nvidia.com/compute/cuda/12.3.0/local_installers/cuda-repo-ubuntu2204-12-3-local_12.3.0-545.23.06-1_amd64.deb

sudo dpkg -i cuda-repo-ubuntu2204-12-3-local_12.3.0-545.23.06-1_amd64.deb

sudo cp /var/cuda-repo-ubuntu2204-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/

sudo apt-get update

sudo apt-get -y install cuda-toolkit-12-3

sudo apt-get install -y cuda-drivers

Double check that you have Nvidia installed by running:

$ nvcc --version
$ nvidia-smi

Export the the CUDA Home environment variable:

$ export CUDA_HOME=/usr/lib/cuda

Then, you can launch the compilation by typing:

$ make clean && LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=all make -j

Step 7. Perform a model conversion

Run the following command to convert the original models to f16 format (please note in the example I show examples the 7b-chat model / 13b-chat model / 70b-chat model):

$ mkdir models/7B
$ python3 convert.py --outfile models/7B/ggml-model-f16.gguf --outtype f16 ../llama/llama-2-7b-chat --vocab-dir ../llama/llama

$ mkdir models/13B
$ python3 convert.py --outfile models/13B/ggml-model-f16.gguf --outtype f16 ../llama/llama-2-13b-chat --vocab-dir ../llama/llama

$ mkdir models/70B
$ python3 convert.py --outfile models/70B/ggml-model-f16.gguf --outtype f16 ../llama/llama-2-70b-chat --vocab-dir ../llama/llama

If the conversion runs successfully, you should have the converted model stored in models/* folders. You can double check this with the ls command:

$ ls models/7B/ 
$ ls models/13B/ 
$ ls models/70B/ 

Step 8. Quantize

Now, you can quantize the model to 4-bits, for example, by using the following command (please note the q4_0 parameter at the end):

$ ./quantize models/7B/ggml-model-f16.gguf models/7B/ggml-model-q4_0.gguf q4_0 

$ ./quantize models/13B/ggml-model-f16.gguf models/13B/ggml-model-q4_0.gguf q4_0 

$ ./quantize models/70B/ggml-model-f16.gguf models/70B/ggml-model-q4_0.gguf q4_0 
$ ./quantize models/70B/ggml-model-f16.gguf models/70B/ggml-model-q2_0.gguf Q2_K
$ ./quantize -h
usage: ./quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] model-f32.gguf [model-quant.gguf] type [nthreads]

  --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
  --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
  --pure: Disable k-quant mixtures and quantize all tensors to the same type

Allowed quantization types:
   2  or  Q4_0   :  3.56G, +0.2166 ppl @ LLaMA-v1-7B
   3  or  Q4_1   :  3.90G, +0.1585 ppl @ LLaMA-v1-7B
   8  or  Q5_0   :  4.33G, +0.0683 ppl @ LLaMA-v1-7B
   9  or  Q5_1   :  4.70G, +0.0349 ppl @ LLaMA-v1-7B
  10  or  Q2_K   :  2.63G, +0.6717 ppl @ LLaMA-v1-7B
  12  or  Q3_K   : alias for Q3_K_M
  11  or  Q3_K_S :  2.75G, +0.5551 ppl @ LLaMA-v1-7B
  12  or  Q3_K_M :  3.07G, +0.2496 ppl @ LLaMA-v1-7B
  13  or  Q3_K_L :  3.35G, +0.1764 ppl @ LLaMA-v1-7B
  15  or  Q4_K   : alias for Q4_K_M
  14  or  Q4_K_S :  3.59G, +0.0992 ppl @ LLaMA-v1-7B
  15  or  Q4_K_M :  3.80G, +0.0532 ppl @ LLaMA-v1-7B
  17  or  Q5_K   : alias for Q5_K_M
  16  or  Q5_K_S :  4.33G, +0.0400 ppl @ LLaMA-v1-7B
  17  or  Q5_K_M :  4.45G, +0.0122 ppl @ LLaMA-v1-7B
  18  or  Q6_K   :  5.15G, -0.0008 ppl @ LLaMA-v1-7B
   7  or  Q8_0   :  6.70G, +0.0004 ppl @ LLaMA-v1-7B
   1  or  F16    : 13.00G              @ 7B
   0  or  F32    : 26.00G              @ 7B
          COPY   : only copy tensors, no quantizing

Step 9. Run one of the prompts

Option 1:

You can execute one of the example prompts using only CPU computation by typing the following command:

$ ./main -m ./models/7B/ggml-model-q4_0.gguf -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt

This example will initiate the chat in interactive mode in the console, starting with the chat-with-bob.txt prompt example.

Option 2:

If you compiled llama.cpp with GPU enabled in Step 8, then you can use the following command:

$ ./main -m ./models/7B/ggml-model-q4_0.gguf -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt --n-gpu-layers 3800

$ ./main -m ./models/13B/ggml-model-q4_0.gguf -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt --n-gpu-layers 3800

$ ./main -m ./models/70B/ggml-model-q4_0.gguf -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt --n-gpu-layers 1000

If you have GPU enabled, you need to set the number of layers to offload to the GPU based on your vRAM capacity. You can increase the number gradually and check with the nvtop command until you find a good spot.

You can see the full list of parameters of the main command by typing the following:

$ ./main --help

Private AI in HomeLAB: Affordable GPU Solution with NVIDIA Tesla P40

For Private AI in HomeLAB, I was searching for budget-friendly GPUs with a minimum of 24GB RAM. Recently, I came across the refurbished NVIDIA Tesla P40 on eBay, which boasts some intriguing specifications:

  • GPU Chip: GP102
  • Cores: 3840
  • TMUs: 240
  • ROPs: 96
  • Memory Size: 24 GB
  • Memory Type: GDDR5
  • Bus Width: 384 bit

Since the NVIDIA Tesla P40 comes in a full-profile form factor, we needed to acquire a PCIe riser card.

A PCIe riser card, commonly known as a “riser card,” is a hardware component essential in computer systems for facilitating the connection of expansion cards to the motherboard. Its primary role comes into play when space limitations or specific orientation requirements prevent the direct installation of expansion cards into the PCIe slots on the motherboard.

Furthermore, I needed to ensure adequate cooling, but this posed no issue. I utilized a 3D model created by MiHu_Works for a Tesla P100 blower fan adapter, which you can find at this link: Tesla P100 Blower Fan Adapter.

As for the fan, the Titan TFD-B7530M12C served the purpose effectively. You can find it on Amazon: Titan TFD-B7530M12C.

Currently, I am using a single VM with PCIe pass-through. However, it was necessary to implement specific Advanced VM parameters:

  • pciPassthru.use64bitMMIO = true
  • pciPassthru.64bitMMIOSizeGB = 64

Now, you might wonder about the performance. It’s outstanding, delivering speeds up to 16x-26x times faster than the CPU. To provide you with an idea of the performance, I conducted a llama-bench test:

pp 512CPU t/sGPU t/sAcceleration
llama 7B mostly Q4_09.50155.3716x
llama 13B mostly Q4_05.18134.7426x
./llama-bench -t 8
| model                          |       size |     params | 
| ------------------------------ | ---------: | ---------: | 
| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | 
| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | 
| backend    |    threads | test       |              t/s |
| ---------- | ---------: | ---------- | ---------------: |
| CPU        |          8 | pp 512     |      9.50 ± 0.07 |
| CPU        |          8 | tg 128     |      8.74 ± 0.12 |
./llama-bench -ngl 3800
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params |
| ------------------------------ | ---------: | ---------: |
| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B |
| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B |
| backend    | ngl | test       |              t/s | 
| ---------- | --: | ---------- | ---------------: | 
| CUDA       | 3800 | pp 512     |    155.37 ± 1.26 |
| CUDA       | 3800 | tg 128     |      9.31 ± 0.19 |
 ./llama-bench -t 8 -m ./models/13B/ggml-model-q4_0.gguf
| model                          |       size |     params |
| ------------------------------ | ---------: | ---------: |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B |
| backend    |    threads | test       |              t/s |
| ---------- | ---------: | ---------- | ---------------: |
| CPU        |          8 | pp 512     |      5.18 ± 0.00 |
| CPU        |          8 | tg 128     |      4.63 ± 0.14 |
./llama-bench -ngl 3800 -m ./models/13B/ggml-model-q4_0.gguf
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params |
| ------------------------------ | ---------: | ---------: |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B |
| backend    | ngl | test       |              t/s | 
| ---------- | --: | ---------- | ---------------: | 
| CUDA       | 3800 | pp 512     |    134.74 ± 1.29 |
| CUDA       | 3800 | tg 128     |      8.42 ± 0.10 |

Feel free to explore this setup for your Private AI in HomeLAB.

DELL R610 or R710: How to Convert an H200A to H200I for Dedicated Slot Use

For my project involving the AI tool llama.cpp, I needed to free up a PCI slot for an NVIDIA Tesla P40 GPU. I found an excellent guide and a useful video from ArtOfServer.

Based on this helpful video from ArtOfServer:

ArtOfServer wrote a small tutorial on how to modify an H200A (external) into an H200I (internal) to be used into the dedicated slot (e.g. instead of a Perc6i)

ArtOfServer wrote a small tutorial on how to modify an H200A (external) into an H200I (internal) to be used into the dedicated slot (e.g. instead of a Perc6i)

Install compiler and build tools (those can be removed later)

# apt install build-essential unzip

Compile and install lsirec and lsitool

# mkdir lsi
# cd lsi
# wget https://github.com/marcan/lsirec/archive/master.zip
# wget https://github.com/exactassembly/meta-xa-stm/raw/master/recipes-support/lsiutil/files/lsiutil-1.72.tar.gz
# tar -zxvvf lsiutil-1.72.tar.gz
# unzip master.zip
# cd lsirec-master
# make
# chmod +x sbrtool.py
# cp -p lsirec /usr/bin/
# cp -p sbrtool.py /usr/bin/
# cd ../lsiutil
# make -f Makefile_Linux

Modify SBR to match an internal H200I

Get bus address:

# lspci -Dmmnn | grep LSI
0000:05:00.0 "Serial Attached SCSI controller [0107]" "LSI Logic / Symbios Logic [1000]" "SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [0072]" -r03 "Dell [1028]" "6Gbps SAS HBA Adapter [1f1c]"

Bus address 0000:05:00.0

We are going to change id 0x1f1c to 0x1f1e

Unbind and halt card:

# lsirec 0000:05:00.0 unbind
Trying unlock in MPT mode...
Device in MPT mode
Kernel driver unbound from device
# lsirec 0000:05:00.0 halt
Device in MPT mode
Resetting adapter in HCB mode...
Trying unlock in MPT mode...
Device in MPT mode
IOC is RESET

Read sbr:

# lsirec 0000:05:00.0 readsbr h200.sbr
Device in MPT mode
Using I2C address 0x54
Using EEPROM type 1
Reading SBR...
SBR saved to h200.sbr

Transform binary sbr to text file:

# sbrtool.py parse h200.sbr h200.cfg

Modify PID in line 9 (e.g using vi or vim):
from this:
SubsysPID = 0x1f1c
to this:
SubsysPID = 0x1f1e

Important: if in the cfg file you find a line with:
SASAddr = 0xfffffffffffff
remove it!
Save and close file.

Build new sbr file:

# sbrtool.py build h200.cfg h200-int.sbr

Write it back to card:

# lsirec 0000:05:00.0 writesbr h200-int.sbr
Device in MPT mode
Using I2C address 0x54
Using EEPROM type 1
Writing SBR...
SBR written from h200-int.sbr

Reset the card an rescan the bus:

# lsirec 0000:05:00.0 reset
Device in MPT mode
Resetting adapter...
IOC is RESET
IOC is READY
# lsirec 0000:05:00.0 info
Trying unlock in MPT mode...
Device in MPT mode
Registers:
DOORBELL: 0x10000000
DIAG: 0x000000b0
DCR_I2C_SELECT: 0x80030a0c
DCR_SBR_SELECT: 0x2100001b
CHIP_I2C_PINS: 0x00000003
IOC is READY
# lsirec 0000:05:00.0 rescan
Device in MPT mode
Removing PCI device...
Rescanning PCI bus...
PCI bus rescan complete.

Verify new id (H200I):

# lspci -Dmmnn | grep LSI
0000:05:00.0 "Serial Attached SCSI controller [0107]" "LSI Logic / Symbios Logic [1000]" "SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [0072]" -r03 "Dell [1028]" "PERC H200 Integrated [1f1e]"

You can now move the card to the dedicated slot 🙂

Thanks to ArtOfServer for a great video.

CUDA Support for WSL 2

For more efficient testing of LLAMA 2, I recommend taking advantage of GPU acceleration in WSL 2, available on notebooks. This approach significantly increases performance and efficiency when working with LLAMA 2. In my latest blog post, you will find a detailed guide on how to easily and quickly set up GPU acceleration in WSL 2 on your notebook.

  • At first install – The latest NVIDIA Windows GPU Driver will fully support WSL 2. With CUDA support in the driver, existing applications (compiled elsewhere on a Linux system for the same target GPU) can run unmodified within the WSL environment.
  • Once a Windows NVIDIA GPU driver is installed on the system, CUDA becomes available within WSL 2. The CUDA driver installed on Windows host will be stubbed inside the WSL 2 as libcuda.so, therefore users must not install any NVIDIA GPU Linux driver within WSL 2.
  • Installation of Linux x86 CUDA Toolkit using WSL-Ubuntu Package
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.3.1/local_installers/cuda-repo-wsl-ubuntu-12-3-local_12.3.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-3-local_12.3.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3

Links