Private AI in HomeLAB: Affordable GPU Solution with NVIDIA Tesla P40

For Private AI in HomeLAB, I was searching for budget-friendly GPUs with a minimum of 24GB RAM. Recently, I came across the refurbished NVIDIA Tesla P40 on eBay, which boasts some intriguing specifications:

  • GPU Chip: GP102
  • Cores: 3840
  • TMUs: 240
  • ROPs: 96
  • Memory Size: 24 GB
  • Memory Type: GDDR5
  • Bus Width: 384 bit

Since the NVIDIA Tesla P40 comes in a full-profile form factor, we needed to acquire a PCIe riser card.

A PCIe riser card, commonly known as a “riser card,” is a hardware component essential in computer systems for facilitating the connection of expansion cards to the motherboard. Its primary role comes into play when space limitations or specific orientation requirements prevent the direct installation of expansion cards into the PCIe slots on the motherboard.

Furthermore, I needed to ensure adequate cooling, but this posed no issue. I utilized a 3D model created by MiHu_Works for a Tesla P100 blower fan adapter, which you can find at this link: Tesla P100 Blower Fan Adapter.

As for the fan, the Titan TFD-B7530M12C served the purpose effectively. You can find it on Amazon: Titan TFD-B7530M12C.

Currently, I am using a single VM with PCIe pass-through. However, it was necessary to implement specific Advanced VM parameters:

  • pciPassthru.use64bitMMIO = true
  • pciPassthru.64bitMMIOSizeGB = 64

Now, you might wonder about the performance. It’s outstanding, delivering speeds up to 16x-26x times faster than the CPU. To provide you with an idea of the performance, I conducted a llama-bench test:

pp 512CPU t/sGPU t/sAcceleration
llama 7B mostly Q4_09.50155.3716x
llama 13B mostly Q4_05.18134.7426x
./llama-bench -t 8
| model                          |       size |     params | 
| ------------------------------ | ---------: | ---------: | 
| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | 
| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | 
| backend    |    threads | test       |              t/s |
| ---------- | ---------: | ---------- | ---------------: |
| CPU        |          8 | pp 512     |      9.50 ± 0.07 |
| CPU        |          8 | tg 128     |      8.74 ± 0.12 |
./llama-bench -ngl 3800
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params |
| ------------------------------ | ---------: | ---------: |
| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B |
| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B |
| backend    | ngl | test       |              t/s | 
| ---------- | --: | ---------- | ---------------: | 
| CUDA       | 3800 | pp 512     |    155.37 ± 1.26 |
| CUDA       | 3800 | tg 128     |      9.31 ± 0.19 |
 ./llama-bench -t 8 -m ./models/13B/ggml-model-q4_0.gguf
| model                          |       size |     params |
| ------------------------------ | ---------: | ---------: |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B |
| backend    |    threads | test       |              t/s |
| ---------- | ---------: | ---------- | ---------------: |
| CPU        |          8 | pp 512     |      5.18 ± 0.00 |
| CPU        |          8 | tg 128     |      4.63 ± 0.14 |
./llama-bench -ngl 3800 -m ./models/13B/ggml-model-q4_0.gguf
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params |
| ------------------------------ | ---------: | ---------: |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B |
| backend    | ngl | test       |              t/s | 
| ---------- | --: | ---------- | ---------------: | 
| CUDA       | 3800 | pp 512     |    134.74 ± 1.29 |
| CUDA       | 3800 | tg 128     |      8.42 ± 0.10 |

Feel free to explore this setup for your Private AI in HomeLAB.

Exciting Update: Cisco Unveils UCS Manager VMware vSphere 8U2 HTML Client Plugin Version 4.0(0)

I am thrilled to share my experience with the latest UCSM-plugin 4.0 for VMware vSphere 8U2, a remarkable tool that has significantly enhanced our virtualization management capabilities. Having tested its functionality across an extensive network of approximately 13 UCSM domains and 411 ESXi 8U2 hosts. A notable instance of its efficacy was observed with Alert F1236, where the Proactive HA feature seamlessly transitioned the Blade into Quarantine mode, showcasing the plugin’s advanced automation capabilities.

However, I did encounter a challenge with the configuration of Custom Alerts, particularly Alert F1705. Despite my efforts, Proactive HA failed to activate, suggesting a potential misconfiguration on my part. To streamline this process, I propose the integration of Alert F1705 into the default alert settings, thereby simplifying the setup and ensuring more efficient system monitoring.

The release of Cisco’s 4.0(0) version of the UCS Manager VMware vSphere 8U2 HTML remote client plugin marks a significant advancement in the field of virtualization administration. This plugin not only offers a comprehensive physical view of the UCS hardware inventory through the HTML client but also enhances the overall management and monitoring of the Cisco UCS physical infrastructure.

Key functionalities provided by this plugin include:

  1. Detailed Physical Hierarchy View: Gain a clear understanding of the Cisco UCS physical structure.
  2. Comprehensive Inventory Insights: Access detailed information on inventory, installed firmware, faults, and power and temperature statistics.
  3. Physical Server to ESXi Host Mapping: Easily correlate your ESXi hosts with their corresponding physical servers.
  4. Firmware Management: Efficiently manage firmware for both B and C series servers.
  5. Direct Access to Cisco UCS Manager GUI: Launch the Cisco UCS Manager GUI directly from the plugin.
  6. KVM Console Integration: Instantly launch KVM consoles of UCS servers for immediate access and control.
  7. Locator LED Control: Switch the state of the locator LEDs as needed for enhanced hardware identification.
  8. Proactive HA Fault Configuration: Customize and configure faults used in Proactive HA for improved system resilience.

Links

Detailed Release Notes

Software download link

Please see the User Guide for specific information on installing and using the plugin with the vSphere HTML client.

Add F1705 Alert to Cisco UCS Manager Plugin 4.0(0)

New Cisco UCS firmware brings possibility to have notification about F1705 Alerts – Rank VLS.

In latest version of Cisco UCS Manager Plugin for VMware vSphere HTML Client (Version 4.0(0)) we could add Custom fault addition for proactive HA monitoring. How to do it?

Cisco UCS / Proactive HA Registration / Registered Fault / Add / ADDDC_Memory_Rank_VLS

If You can’t Add, it is necessary to Unregister UCSM Manager Plugin.

Cisco UCS / Proactive HA Registration / Registered Fault / Add
Cisco UCS / Proactive HA Registration / vCenter server credentials / Register
Cisco UCS / Proactive HA Registration / Register
How Could I check it? Edit Proactive HA / Providers
It is better use Name “ADDDC_Memory_Rank_VLS” without spaces. On my picture I used “My F1705 Alerts”

Adding Custom Alert is only possible with unregistered Cisco UCS Provider, it is better to do it immediatly after Cisco UCS Manager Plugin instalation.

Now I can deceided If I will block F1705 or NOT. I personaly preffer to have F1705 Alert under Proactive HA. Then I only restart Blades with F1705. During reboot Hard-PPR permanently remaps accesses from a designated faulty row to a designated spare row.

Links:

vSphere 8.0 Update 2: Introducing Azure Active Directory Federated Authentication

At the 2023 VMware Explore event in Barcelona. I was on presentation with Viviana Miranda relates to the way users authenticate to vCenter.

vSphere 8.0 Update 2 update heralds a number of groundbreaking features, with one standout enhancement in user authentication methods for vCenter. vCenter Server 8.0 Update 2 has now incorporated federated authentication capabilities with Azure Active Directory.

Dive into the details of this integration and discover how to activate it.

The Advantages of External Identity Providers

Organizations that leverage external identity providers can expect to reap substantial benefits:

- Integration with existing identity provider infrastructure to streamline processes.
- Implementation of Single Sign-On to simplify access across services.
- Adherence to the best practices of role separation between infrastructure management and identity administration.
- Utilization of robust multi-factor authentication options that come with their chosen identity providers.

Supported Identity Providers in vSphere 8.0 U2

While our focus here is the Azure Active Directory integration, it’s essential to highlight the comprehensive range of authentication methods now supported with vCenter Server 8.0 U2.

More info: https://core.vmware.com/resource/vCenterAzureADFederation#Intro

Host Cache can significantly extend reboot times

Exploring the vSphere environment, I’ve found that configuring a large Host Cache with VMFS Datastore can significantly extend reboot times.

It’s a delicate balance of performance gains versus system availability. For an in-depth look at my findings and the impact on your VMware setup, stay tuned.

Host Cache can significantly extend reboot times

Downfall CVE-2022-40982 – Intel’s Downfall Mitigation

Security Fixes in Release 4.3(2b)

Defect ID – CSCwf30468

Cisco UCS M5 C-series servers are affected by vulnerabilities identified by the following Common Vulnerability and Exposures (CVE) IDs:

  • CVE-2022-40982—Information exposure through microarchitectural state after transient execution in certain vector execution units for some Intel® Processors may allow an authenticated user to potentially enable information disclosure through local access
  • CVE-2022-43505—Insufficient control flow management in the BIOS firmware for some Intel® Processors may allow a privileged user to potentially enable denial of service through local access

https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/gather-data-sampling.html

Workaround EVC Intel “Broadwell” Generation or gather_data_sampling

wget https://raw.githubusercontent.com/speed47/spectre-meltdown-checker/master/spectre-meltdown-checker.sh# sh spectre-meltdown-checker.sh --variant downfall --explain

EVC Intel “Skylake” Generation

CVE-2022-40982 aka 'Downfall, gather data sampling (GDS)'
> STATUS:  VULNERABLE  (Your microcode doesn't mitigate the vulnerability, and your kernel doesn't support mitigation)
> SUMMARY: CVE-2022-40982:KO

EVC Intel “Broadwell” Generation

CVE-2022-40982 aka 'Downfall, gather data sampling (GDS)'
> STATUS:  NOT VULNERABLE  (your CPU vendor reported your CPU model as not affected)
> SUMMARY: CVE-2022-40982:OK

Mitigation with an updated kernel

When an update of the microcode is not available via a firmware update package, you may update the Kernel with a version that implements a way to shut off AVX instruction set support. It can be achieved by adding the following kernel command line parameter:

gather_data_sampling=force

Mitigation Options

When the mitigation is enabled, there is additional latency before results of the gather load can be consumed. Although the performance impact to most workloads is minimal, specific workloads may show performance impacts of up to 50%. Depending on their threat model, customers can decide to opt-out of the mitigation.

Intel® Software Guard Extensions (Intel® SGX)

There will be an Intel SGX TCB Recovery for those Intel SGX-capable affected processors. This TCB Recovery will only attest as up-to-date when the patch has been FIT-loaded (for example, with an updated BIOS), Intel SGX has been enabled by BIOS, and hyperthreading is disabled. In this configuration, the mitigation will be locked to the enabled state. If Intel SGX is not enabled or if hyperthreading is enabled, the mitigation will not be locked, and system software can choose to enable or disable the GDS mitigation.

Links:

How to Configure NVMe/TCP with vSphere 8.0 Update 1 and ONTAP 9.13.1 for VMFS Datastores

vSphere 8U1 – Deep dive on configuring NVMe-oF (Non-Volatile Memory Express over Fabrics) for VMware vSphere datastores.
What’s new

With vSphere 8.0 update 1, VMware has completed their journey to a completely native end-to-end NVMe storage stack. Prior to 8.0U1, there was a SCSI translation layer which added some complexity to the stack and slightly decreased some of the efficiencies inherent in the NVMe protocol.

ONTAP 9.12.1 added support for secure authentication over NVMe/TCP as well as increasing NVMe limits (viewable on the NetApp Hardware Universe [HWU]).

For more info and source blog please check great post How to Configure NVMe/TCP with vSphere 8.0 Update 1 and ONTAP 9.13.1 for VMFS Datastores

How to run Secure Boot Validation Script on an ESXi Host

Help for validation script:

/usr/lib/vmware/secureboot/bin/secureBoot.py -h
usage: secureBoot.py [-h] [-a | -c | -s]

optional arguments:
  -h, --help            show this help message and exit
  -a, --acceptance-level-check
                        Validate acceptance levels for installed vibs
  -c, --check-capability
                        Check if the host is ready to enable secure boot
  -s, --check-status    Check if UEFI secure boot is enabled

Check if the host is ready to enable secure boot

/usr/lib/vmware/secureboot/bin/secureBoot.py -c
Secure boot can be enabled: All vib signatures verified. All tardisks validated. All acceptance levels validated

Check if UEFI secure boot is disabled

/usr/lib/vmware/secureboot/bin/secureBoot.py -s
Disabled

Create Cisco UCS Boot Policy

Check if UEFI secure boot is enabled and working

/usr/lib/vmware/secureboot/bin/secureBoot.py -s
Enabled
vSphere Secure Boot

Deprecation of legacy BIOS support in vSphere 8.0 (84233) + Booting vSphere ESXi 8.0 may fail with “Error 10 (Out of resources)” (89682)

UCSX-TPM2-002 Trusted Platform Module 2.0 for UCS servers

    Personally, here are the recommendations for new ESXi 8.0 installations:

    • VMware only supports UEFI boot in new installations
    • For the purchase of new servers, it is suitable with TPM 2.0
    • When upgrading to ESXi 8.0, verify that UEFI boot is enabled

    Booting vSphere ESXi 8.0 may fail with “Error 10 (Out of resources)” (89682)

    • Hardware machine is configured to boot in legacy BIOS mode.
    • Booting stops early in the boot process with messages displayed in red on black with wording similar to “Error 10 (Out of resources) while loading module”, “Requested malloc size failed”, or “No free memory”.
    “Error 10 (Out of resources) while loading module”, “Requested malloc size failed”, or “No free memory”

    VMware’s recommended workaround is to transition the machine to UEFI boot mode permanently, as discussed in KB article 84233 . There will not be a future ESXi change to allow legacy BIOS to work on this machine again.

    Deprecation of legacy BIOS support in vSphere (84233)

    VMware’s plans to deprecate support for legacy BIOS in server platforms.

    If you upgrade a server that was certified and running successfully with legacy BIOS to a newer release of ESXi, it is possible the server will no longer function with that release. For example, some servers may fail to boot with an “Out of resources” message because the newer ESXi release is too large to boot in legacy BIOS mode. Generally, VMware will not provide any fix or workaround for such issues besides either switching the server to UEFI

    Motivation

    UEFI provides several advantages over legacy BIOS and aligns with VMware goals for being “secure by default”. UEFI

    • UEFI Secure Boot, a security standard that helps ensure that the server boots using only software that is trusted by the server manufacturer.
    • Automatic update of the system boot order during ESXi installation.
    • Persistent memory
    • TPM 2.0
    • Intel SGX Registration
    • Upcoming support for DPU/SmartNIC
    Securing ESXi Hosts with Trusted Platform Module
    vSphere 6.7 Support for ESXi and TPM 2 0

    List of vSphere 8.0 Knowledge base articles and Important Links (89756)

    List of Knowledge base articles for vSphere 8.0 – [Main KB] – List of vSphere 8.0 Knowledge base articles and Important Links (89756)