Highlights from VMware Explore 2024 Barcelona

General Session Insights:

  • Private Cloud as the Future of Innovation: Broadcom’s President and CEO, Hock Tan, emphasized that private cloud has become the foundation for enterprise innovation.
  • Balancing AI with Compliance: Chris Wolf highlighted the importance of aligning AI advancements with organizational privacy and compliance needs.
  • Community Acknowledgments: Joe Baguley celebrated the vibrant VMware community, from 1,600+ vExperts to 150,000 VMUG members, for their global contributions.
  • Agile IT for Business Success: Paul Turner emphasized the need for IT agility to rapidly deliver applications and services, crucial for achieving business goals.
  • VeloRAIN Architecture: Sanjay Uppal introduced the VeloRAIN (Robust AI Networking) architecture, leveraging AI/ML to enhance distributed AI workloads’ performance and security.

For more details, check the general session recap.

Notable Sessions and Presentations:

  • Distributed Security Simplified: Dive into the intricacies of distributed security in VMware Cloud Foundation. Read more.
  • VMware vSAN ESA: Explore how VMware vSAN ESA serves as a robust storage platform for VMware Cloud Foundation. Details here.
  • Demystifying DPUs and GPUs: Understand the role of DPUs and GPUs in VMware Cloud Foundation for advancing AI and data workloads. Learn more.
  • AI Without GPUs: Discover innovative ways to harness CPU power for AI workloads in GPU-limited environments. Read more.
  • Data Unlocking with VMware and NVIDIA: Broadcom and NVIDIA offer deep insights into unlocking data potential through AI-powered solutions. Deep dive.
  • VMware Fusion and Workstation: Exciting news—VMware Fusion and Workstation are now available for free. Get the details.

For a complete list of VMware Explore 2024 Barcelona presentations, visit this link.

An inspiring evening of networking and meaningful conversations at the Community Leadership Reception at Explore in Barcelona.
The VMware Community truly is the heart of Explore.
It was an honor to meet Hock Tan CEO,

and engage with incredible leaders like Corey Romero hashtag#vExpert leader and Josef Zach, our hashtag#VMUG Czech leader.

AI Without GPUs – Harnessing CPU Power for AI Workloads

At VMware Explore EU 2024, the session “AI Without GPUs: Using Your Existing CPU Resources to Run AI Workloads” showcased innovative approaches to AI and machine learning using CPUs. Presented by Earl Ruby from Broadcom and Keith Bradley from Nature Fresh Farms, this session emphasized the potential of leveraging Intel Xeon CPUs with Advanced Matrix Extensions (AMX) to run AI workloads efficiently without the need for GPUs.

Key Highlights:

  1. Introduction to AMX:
    • AMX (Advanced Matrix Extensions), available in Intel’s Sapphire Rapids processors, enables high-performance matrix computations directly on CPUs, making them more capable of handling AI/ML tasks traditionally reserved for GPUs.
  2. Why Use CPUs for AI?:
    • Cost Efficiency: Lower operating costs compared to GPUs.
    • Energy Efficiency: Ideal for environments where power consumption is a concern.
    • Sufficient Performance for Specific Use Cases: CPUs can efficiently handle tasks like inferencing and batch-processing ML workloads with models under 15-20 billion parameters.
  3. Software Stack:
    • OpenVINO Toolkit: Optimizes AI/ML workloads on CPUs by compressing neural networks, improving inference performance with minimal accuracy loss.
    • Intel oneAPI: Provides a unified software environment for both CPU and GPU workloads.
  4. Real-World Application:
    • Nature Fresh Farms: Demonstrated how AI-driven automation using CPUs effectively manages complex agricultural processes, including plant lifecycle control in greenhouses.

When to Choose CPUs Over GPUs:

  • Inferencing and Batch Processing: When real-time responses are not critical.
  • Sustainability Goals: Lower power consumption makes CPUs a viable option.
  • Cost-Conscious Environments: For scenarios where reducing operational costs is a priority.

Unlocking Your Data with VMware by Broadcom and NVIDIA — RAG Deep Dive

At VMware Explore EU 2024, the session “Unlocking Your Data with VMware by Broadcom and NVIDIA — RAG Deep Dive” delivered fascinating insights into the power of Retrieval Augmented Generation (RAG). Led by Frank Denneman and Shawn Kelly, this session explored how combining large language models (LLMs) with proprietary organizational data can revolutionize data utilization in enterprises.

What is RAG?

RAG combines the strengths of LLMs with a Vector Database to enhance AI applications by integrating them with an organization’s proprietary data. This synergy allows for more precise and context-aware responses, crucial for business-critical operations.

Why RAG Matters:

  • Enhanced Accuracy: Unlike traditional LLMs prone to “hallucinations” or inaccuracies, RAG provides validated, up-to-date answers by sourcing information directly from relevant databases.
  • Contextual Relevance: It seamlessly blends general knowledge from LLMs with specific proprietary data, delivering highly relevant insights.
  • Traceability and Transparency: RAG solutions can cite the documents used to generate responses, addressing one of the significant limitations of traditional LLMs.

How RAG Works:

  1. Data Indexing: Proprietary data is pre-processed and stored in a vector database.
  2. Question Processing: When a query is made, it is semantically embedded and matched against the vector database.
  3. Answer Generation: The most relevant data is retrieved and used to generate a precise answer.

Integration with NVIDIA:

NVIDIA’s Inference Microservice (NIM) accelerates this process by optimizing LLMs for rapid inference, leveraging GPU-accelerated infrastructure to enhance throughput and reduce latency.

Demystifying DPUs and GPUs in VMware Cloud Foundation

At VMware Explore EU 2024, the session “Demystifying DPUs and GPUs in VMware Cloud Foundation” provided deep insights into how these advanced technologies are transforming modern data centers. Presented by Dave Morera and Peter Flecha, the session highlighted the integration and benefits of Data Processing Units (DPUs) and Graphics Processing Units (GPUs) in VMware Cloud Foundation (VCF).

Key Highlights:

  1. Understanding DPUs:
    • Offloading and Acceleration: DPUs enhance performance by offloading network and communication tasks from the CPU, allowing more efficient resource usage and better performance for data-heavy operations.
    • Enhanced Security: By isolating security tasks, DPUs contribute to a stronger zero-trust security model, essential for protecting modern cloud environments.
    • Dual DPU Support: This feature offers high availability and increased network offload capacity, simplifying infrastructure management and boosting resilience.
  2. Leveraging GPUs:
    • Accelerated AI and ML Workloads: GPUs in VMware environments significantly speed up data-intensive tasks like AI model training and inference.
    • Optimized Resource Utilization: VMware’s vSphere enables efficient GPU resource sharing through virtual GPU (vGPU) profiles, accommodating various workloads, including graphics, compute, and machine learning.
  3. Distributed Services Engine:
    • This engine simplifies infrastructure management and enhances performance by integrating DPU-based services, creating a more secure and efficient data center architecture.

LLM Inference Sizing and Performance Guidance

LLM Inference Sizing and Performance Guidance

When planning to deploy a chatbot or simple Retrieval-Augmentation-Generation (RAG) pipeline on VMware Private AI Foundation with NVIDIA (PAIF-N) [1], you may have questions about sizing (capacity) and performance based on your existing GPU resources or potential future GPU acquisitions. For […]


Broadcom Social Media Advocacy

CUDA Support for WSL 2

For more efficient testing of LLAMA 2, I recommend taking advantage of GPU acceleration in WSL 2, available on notebooks. This approach significantly increases performance and efficiency when working with LLAMA 2. In my latest blog post, you will find a detailed guide on how to easily and quickly set up GPU acceleration in WSL 2 on your notebook.

  • At first install – The latest NVIDIA Windows GPU Driver will fully support WSL 2. With CUDA support in the driver, existing applications (compiled elsewhere on a Linux system for the same target GPU) can run unmodified within the WSL environment.
  • Once a Windows NVIDIA GPU driver is installed on the system, CUDA becomes available within WSL 2. The CUDA driver installed on Windows host will be stubbed inside the WSL 2 as libcuda.so, therefore users must not install any NVIDIA GPU Linux driver within WSL 2.
  • Installation of Linux x86 CUDA Toolkit using WSL-Ubuntu Package
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.3.1/local_installers/cuda-repo-wsl-ubuntu-12-3-local_12.3.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-3-local_12.3.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3

Links

OpenVINO

VMware Private AI with Intel is a very interesting support for OpenVINO. To illustrate what all OpenVINO enables, here is a summary of the thesis ACCELERATION OF FACE RECOGNITION ALGORITHM WITH NEURAL COMPUTE STICK 2 by my daughter Eva Micankova, with her permission here is a brief summary. The acceleration speed using NCS2, the Facenet model accuracy 97,08 % achieved a frame rate of 15.35 FPS and a latency of 65.164 ms – NUC Maxtang NX6412.

OpenVINO tools

Intel’s suite of OpenVINO tools. This open-source set of tools offers development tools for the optimization and deployment of deep learning models. It delivers better performance for vision, audio, and language models from popular frameworks such as TensorFlow, Caffe, PyTorch, and others. OpenVINO optimizes deep learning pipelines through memory reuse, graph fusion, load balancing, and inference parallelism across CPUs, GPUs, VPUs, and others, as seen in the figure. Accelerators can have additional operations for pre-processing and post-processing transferred or integrated to reduce latency between endpoints and improve throughput.

Popular algorithms FaceNet, SphereFace, and ArcFace, which differ in architecture and training procedures, all aim to learn a vector representation of the face that is robust to changes in conditions.

FaceNet

The FaceNet model was developed by Google’s research group in 2015. The model maps faces of individuals into clusters of geometric points (Euclidean spaces) referred to as an embedding, which is obtained from the measure of similarity and difference of faces.

SphereFace

The authors of SphereFace introduced a loss function A-Softmax, derived from the softmax loss function, in their work published in 2017. The A-Softmax (Angular Softmax) loss function was designed to learn discriminative facial features with a clear geometric interpretation, which no available face recognition algorithm offered until then.

ArcFace

ArcFace (Additive Angular Margin loss) is a loss function first introduced in 2018. It builds on the previous work of SphereFace, which introduced the concept of angular margin, which helps improve class separability and thereby the performance of face recognition. However, their loss function required a series of approximations, which led to unstable network training. In addition, the standard softmax loss function dominated training, meaning that the concept of angular margin was not fully utilized. ArcFace introduces a new loss function that aims to address these shortcomings. It introduced the Additive Angular Margin loss function, which allows for better class separability and more stable training without the need for approximations used in SphereFace.

VPU

Vision Processing Unit (VPU) accelerators are chips created to accelerate image processing using computer vision and deep learning algorithms. The Intel Neural Compute Stick 2 is a powerful, affordable, and compact solution, with low power consumption, for accelerating neural networks. It is designed to run deep neural networks at high speeds with low energy consumption without losing accuracy, enabling real-time computer vision processing.

Result

One Intel NSC2 was used for accelerating face detection and a second Intel NSC2 was utilized for face recognition

Figure A.4

Graph showing the accuracy of all validated models depending on the changing threshold. ArcFace achieved the highest accuracy of 0.84 at a threshold value of 0.79. SphereFace achieved the highest accuracy of 0.77 for a threshold of 0.74. FaceNet achieved the best accuracy of all the compared models, at 0.982 for a threshold value of 0.57.

Figure 6.13

Graph comparing the achieved frame rate for each series.

  • 1st series – ArcFace model, one person
  • 2nd series – ArcFace model, two people
  • 3rd series – FaceNet model, one person
  • 4th series – FaceNet model, two people
  • 5th series – SphereFace model, one person
  • 6th series – SphereFace model, two people

Conclusion

The best results were achieved by the FaceNet model, which reached an accuracy of 97.08%. The second part of the experiments focused on evaluating the speed of recognition on different platforms. The experiments provided answers to questions about the accuracy achieved by each model, the results of system acceleration using CPU, GPU, and NCS2, which configuration is most suitable for each model, and which configuration achieves the highest frame rate and lowest latency among the compared models. The best frame rates and latency were achieved by the SphereFace model, accelerated using NCS2, with 17.15 FPS and a latency of 58.293 ms. The FaceNet model achieved a frame rate of 15.35 FPS and a latency of 65.164 ms. For achieving a balance between accuracy and speed, the best system configuration is with the FaceNet model, accelerated using NCS2.

Abstract

ACCELERATION OF FACE RECOGNITION ALGORITHM WITH NEURAL COMPUTE STICK 2 thesis focuses on the issue of facial recognition in a face image using neural networks and its acceleration. It provides an overview of previously used techniques and addresses the use of currently dominant convolutional neural networks to solve this issue. The work also focuses on acceleration mechanisms that can be used in this area. Based on the knowledge of the issue, a system based on the concept of edge computing was created, which can be used as a home security system connected to an IP camera, which sends a notification about the presence of an unknown person in a guarded area.

https://www.vut.cz/studenti/zav-prace/detail/141562