Exciting Update: Cisco Unveils UCS Manager VMware vSphere 8U2 HTML Client Plugin Version 4.0(0)

I am thrilled to share my experience with the latest UCSM-plugin 4.0 for VMware vSphere 8U2, a remarkable tool that has significantly enhanced our virtualization management capabilities. Having tested its functionality across an extensive network of approximately 13 UCSM domains and 411 ESXi 8U2 hosts. A notable instance of its efficacy was observed with Alert F1236, where the Proactive HA feature seamlessly transitioned the Blade into Quarantine mode, showcasing the plugin’s advanced automation capabilities.

However, I did encounter a challenge with the configuration of Custom Alerts, particularly Alert F1705. Despite my efforts, Proactive HA failed to activate, suggesting a potential misconfiguration on my part. To streamline this process, I propose the integration of Alert F1705 into the default alert settings, thereby simplifying the setup and ensuring more efficient system monitoring.

The release of Cisco’s 4.0(0) version of the UCS Manager VMware vSphere 8U2 HTML remote client plugin marks a significant advancement in the field of virtualization administration. This plugin not only offers a comprehensive physical view of the UCS hardware inventory through the HTML client but also enhances the overall management and monitoring of the Cisco UCS physical infrastructure.

Key functionalities provided by this plugin include:

  1. Detailed Physical Hierarchy View: Gain a clear understanding of the Cisco UCS physical structure.
  2. Comprehensive Inventory Insights: Access detailed information on inventory, installed firmware, faults, and power and temperature statistics.
  3. Physical Server to ESXi Host Mapping: Easily correlate your ESXi hosts with their corresponding physical servers.
  4. Firmware Management: Efficiently manage firmware for both B and C series servers.
  5. Direct Access to Cisco UCS Manager GUI: Launch the Cisco UCS Manager GUI directly from the plugin.
  6. KVM Console Integration: Instantly launch KVM consoles of UCS servers for immediate access and control.
  7. Locator LED Control: Switch the state of the locator LEDs as needed for enhanced hardware identification.
  8. Proactive HA Fault Configuration: Customize and configure faults used in Proactive HA for improved system resilience.

Links

Detailed Release Notes

Software download link

Please see the User Guide for specific information on installing and using the plugin with the vSphere HTML client.

Add F1705 Alert to Cisco UCS Manager Plugin 4.0(0)

New Cisco UCS firmware brings possibility to have notification about F1705 Alerts – Rank VLS.

In latest version of Cisco UCS Manager Plugin for VMware vSphere HTML Client (Version 4.0(0)) we could add Custom fault addition for proactive HA monitoring. How to do it?

Cisco UCS / Proactive HA Registration / Registered Fault / Add / ADDDC_Memory_Rank_VLS

If You can’t Add, it is necessary to Unregister UCSM Manager Plugin.

Cisco UCS / Proactive HA Registration / Registered Fault / Add
Cisco UCS / Proactive HA Registration / vCenter server credentials / Register
Cisco UCS / Proactive HA Registration / Register
How Could I check it? Edit Proactive HA / Providers
It is better use Name “ADDDC_Memory_Rank_VLS” without spaces. On my picture I used “My F1705 Alerts”

Adding Custom Alert is only possible with unregistered Cisco UCS Provider, it is better to do it immediatly after Cisco UCS Manager Plugin instalation.

Now I can deceided If I will block F1705 or NOT. I personaly preffer to have F1705 Alert under Proactive HA. Then I only restart Blades with F1705. During reboot Hard-PPR permanently remaps accesses from a designated faulty row to a designated spare row.

Links:

Downfall CVE-2022-40982 – Intel’s Downfall Mitigation

Security Fixes in Release 4.3(2b)

Defect ID – CSCwf30468

Cisco UCS M5 C-series servers are affected by vulnerabilities identified by the following Common Vulnerability and Exposures (CVE) IDs:

  • CVE-2022-40982—Information exposure through microarchitectural state after transient execution in certain vector execution units for some Intel® Processors may allow an authenticated user to potentially enable information disclosure through local access
  • CVE-2022-43505—Insufficient control flow management in the BIOS firmware for some Intel® Processors may allow a privileged user to potentially enable denial of service through local access

https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/gather-data-sampling.html

Workaround EVC Intel “Broadwell” Generation or gather_data_sampling

wget https://raw.githubusercontent.com/speed47/spectre-meltdown-checker/master/spectre-meltdown-checker.sh# sh spectre-meltdown-checker.sh --variant downfall --explain

EVC Intel “Skylake” Generation

CVE-2022-40982 aka 'Downfall, gather data sampling (GDS)'
> STATUS:  VULNERABLE  (Your microcode doesn't mitigate the vulnerability, and your kernel doesn't support mitigation)
> SUMMARY: CVE-2022-40982:KO

EVC Intel “Broadwell” Generation

CVE-2022-40982 aka 'Downfall, gather data sampling (GDS)'
> STATUS:  NOT VULNERABLE  (your CPU vendor reported your CPU model as not affected)
> SUMMARY: CVE-2022-40982:OK

Mitigation with an updated kernel

When an update of the microcode is not available via a firmware update package, you may update the Kernel with a version that implements a way to shut off AVX instruction set support. It can be achieved by adding the following kernel command line parameter:

gather_data_sampling=force

Mitigation Options

When the mitigation is enabled, there is additional latency before results of the gather load can be consumed. Although the performance impact to most workloads is minimal, specific workloads may show performance impacts of up to 50%. Depending on their threat model, customers can decide to opt-out of the mitigation.

Intel® Software Guard Extensions (Intel® SGX)

There will be an Intel SGX TCB Recovery for those Intel SGX-capable affected processors. This TCB Recovery will only attest as up-to-date when the patch has been FIT-loaded (for example, with an updated BIOS), Intel SGX has been enabled by BIOS, and hyperthreading is disabled. In this configuration, the mitigation will be locked to the enabled state. If Intel SGX is not enabled or if hyperthreading is enabled, the mitigation will not be locked, and system software can choose to enable or disable the GDS mitigation.

Links:

How to run Secure Boot Validation Script on an ESXi Host

Help for validation script:

/usr/lib/vmware/secureboot/bin/secureBoot.py -h
usage: secureBoot.py [-h] [-a | -c | -s]

optional arguments:
  -h, --help            show this help message and exit
  -a, --acceptance-level-check
                        Validate acceptance levels for installed vibs
  -c, --check-capability
                        Check if the host is ready to enable secure boot
  -s, --check-status    Check if UEFI secure boot is enabled

Check if the host is ready to enable secure boot

/usr/lib/vmware/secureboot/bin/secureBoot.py -c
Secure boot can be enabled: All vib signatures verified. All tardisks validated. All acceptance levels validated

Check if UEFI secure boot is disabled

/usr/lib/vmware/secureboot/bin/secureBoot.py -s
Disabled

Create Cisco UCS Boot Policy

Check if UEFI secure boot is enabled and working

/usr/lib/vmware/secureboot/bin/secureBoot.py -s
Enabled
vSphere Secure Boot

Field Notice: FN – 72368 – Some DIMMs Might Fail Prematurely Due to a Manufacturing Deviation – Hardware Upgrade Available

Cisco announced Field Notice: FN – 72368 – Some DIMMs Might Fail Prematurely Due to a Manufacturing Deviation – Hardware Upgrade Available

My personal recommendation please use ADDDC and PPR – It could prevent hardware failures … UCS-ML-128G4RT-H is in 2nd revision from 28-Oct-22.

Problem Description

A limited number of DIMMs shipped from Cisco are impacted by a known deviation in the memory supplier’s manufacturing process. This deviation might result in a higher rate of failure.

Background

DIMM manufacturers compose their DIMMs of multiple memory modules to reach the desired capacity. A 16GB DIMM might be composed of the same modules that a 32GB DIMM is composed of. In this case, a manufacturing deviation in specific modules impacts 16GB, 32GB, 64GB, and 128GB DIMMs. This deviation was contained to a specific date range, and the DIMMs which use these chips were manufactured during the middle to end of 2020. Since the discovery of this deviation, additional limits have been imposed on the manufacturing process to ensure that future DIMMs are not exposed to this process variation.

Problem Symptom

Most DIMMs with this manufacturing deviation will exhibit persistent correctable memory errors. If left untreated, the DIMMs might eventually encounter an uncorrectable memory event. If encountered during runtime, uncorrectable errors will cause a sudden unexpected server reset. If encountered during Power-On Self-Test (POST), the DIMM will be mapped out and the total available memory reduced. In some cases a boot error might be seen.

Various DIMM Reliability, Availability, and Serviceability (RAS) features or even operating system features might mask the extent of these correctable errors. It is recommended to check your DIMMs for exposure using the Serial Number Validation Tool described in the Serial Number Validation section of this field notice. Only specific DIMMs are impacted by this issue, so do not rely solely on the DIMM error count to judge exposure.

Workaround/Solution

This is a hardware failure. A replacement is strongly recommended in order to avoid potential for unexpected server failure.

How to Boot ESXi 7.0 on UCS-M2-HWRAID Boot-Optimized M.2 RAID Controller

VMware strongly advises that you move away completely from using SD card/USB as a boot device option on any future server hardware.

SD cards can continue to be used for the bootbank partition provided that a separate persistent local device to store the OSDATA partition (32GB min., 128GB recommended) is available in the host.
Preferably, the SD cards should be replaced with an M.2 or another local persistent device as the standalone boot option.

vSphere 7 – ESXi System Storage Changes

Please refer to the following blog:
https://core.vmware.com/resource/esxi-system-storage-changes

How to setup ESXi boot on UCS-M2-HWRAID ?

Create Disk Group Policies – Storage / Storage Policies / root / Disk Group Policies / M.2-RAID1

Create Storage Profile – Storage / Storage Profiles / root / Storage Profile M.2-RAID1

Create Local LUNs – Storage / Storage Profiles / root / Storage Profile M.2-RAID1

Modify Storage Profile inside Service Profile

Change Boot Order to Local Disk

Links

Add F1705 Alert to Cisco UCS Manager Plugin

New Cisco UCS firmware brings possibility to have notification about F1705 Alerts – Rank VLS.

In latest version of Cisco UCS Manager Plugin for VMware vSphere HTML Client (Version 3.0(6)) we could add Custom fault addition for proactive HA monitoring. How to do it?

Cisco UCS / Proactive HA Registration / Fault monitoring details / Add / ADDDC_Memory_Rank_VLS
Cisco UCS / Proactive HA Registration / Fault monitoring details / Add
Cisco UCS / Proactive HA Registration / vCenter server credentials / Register
Cisco UCS / Proactive HA Registration / Register
How Could I check it? Edit Proactive HA / Providers
It is better use Name “ADDDC_Memory_Rank_VLS” without spaces. On my picture I used “My F1705 Alerts”

Adding Custom Alert is only possible with unregistered Cisco UCS Provider, it is better to do it immediatly after Cisco UCS Manager Plugin instalation.

Now I can deceided If I will block F1705 or NOT. I personaly preffer to have F1705 Alert under Proactive HA. Then I only restart Blades with F1705. During reboot Hard-PPR permanently remaps accesses from a designated faulty row to a designated spare row.

Links:

NSX-T Edge design guide for Cisco UCS

How to design NSX-T Edge inside Cisco UCS? I can’t find it inside Cisco Design Guide. But I find usefull topology inside Dell EMC VxBlock™ Systems, VMware® NSX-T Reference Design  and NSX-T 3.0 Edge Design Step-by-Step UI workflow. Thanks DELL and VMware …

VMware® NSX-T Reference Design

  • VDS Design update – New capability of deploying NSX on top of VDS with NSX
  • VSAN Baseline Recommendation for Management and Edge Components
  • VRF Based Routing and other enhancements
  • Updated security functionality
  • Design changes that goes with VDS with NSX
  • Performance updates

NSX-T 3.0 Edge Design Step-by-Step UI workflow

This document is an informal document that walks through the step-by-step deployment and configuration workflow for NSX-T Edge Single N-VDS Multi-TEP design.  This document uses NSX-T 3.0 UI to show the workflow, which is broken down into following 3 sub-workflows:

  1. Deploy and configure the Edge node (VM & BM) with Single-NVDS Multi-TEP.
  2. Preparing NSX-T for Layer 2 External (North-South) connectivity.
  3. Preparing NSX-T for Layer 3 External (North-South) connectivity.

NSX-T Design with EDGE VM

  • Under Teamings – Add 2 Teaming Policies: one with Active Uplink as “uplink-1” and other with “uplink-2”.
  • Make a note of the policy name used, as we would be using this in the next section. In this example they are “PIN-TO-TOR-LEFT” and “PIN-TO-TOR-RIGHT”.

How to design NSX-T Edge inside Cisco UCS?

Cisco Fabric Interconnect using Port Chanel. You need high bandwith for NSX-T Edge load.

C220 M5 could solved it.

The edge node physical NIC definition includes the following

  • VMNIC0 and VMNIC1: Cisco VIC 1457
  • VMNIC2 and VMNIC3: Intel XXV710 adapter 1 (TEP and Overlay)
  • VMNIC4 and VMNIC4: Intel XXV710 adapter 2 (N/S BGP Peering)
NSX-T transport nodes with Cisco UCS C220 M5
Logical topology of the physical edge host

Or for PoC or Lab – Uplink Eth Interfaces

For PoC od HomeLAB We could use Uplink Eth Interfaces and create vNIC template linked to these uplink.

Links:

Cisco UCS Manager Plugin for VMware vSphere HTML Client (Version 3.0(6))

Cisco has released the 3.0(6) version of the Cisco UCS Manager VMware vSphere HTML client plugin. The UCS Manager vSphere HTML client plugin enables a virtualization administrator to view, manage, and monitor the Cisco UCS physical infrastructure. The plugin provides a physical view of the UCS hardware inventory on the HTML client.

I notify BUG “Host not going into monitoring state vCenter restart”. Thank You for fix.

Release 3.0(6)

Here are the new features in Release 3.0(6):

  • Custom fault addition for proactive HA monitoring
  • Resolved host not going into monitoring state vCenter restart
  • Included defect fixes

VMware vSphere HTML Client Releases

Cisco UCS Manager plug-in is compatible with the following vSphere HTML Client releases:

VMware vSphere HTML Client VersionCisco UCS Manager Plugin for VMware vSphere Version
6.73.0(1), 3.0(2), 3.0(3), 3.0(4), 3.0(5), 3.0(6)
7.03.0(4), 3.0(5), 3.0(6)
7.0u1, 7.0u23.0(5), 3.0(6)

Note
VMware vSphere HTML Client Version 7.0u3 is not supported.
More info here.

Field Notice: FN – 70432 – Improved Memory RAS Features for UCS M5 Platforms – Software Upgrade Recommended – 4.2(1i)

I recommend upgrade to a Server Firmware Bundle that includes ADDDC Sparing to expand the memory error coverage – 4.2(1i). More info:

Handling RAS events

When BANK-level or RANK-level RAS events are observed (and PPR is enabled):

  1. Verify that no other DIMM faults are present (for example, an uncorrectable error)
  2. Schedule a maintenance window (MW).
  3. During MW, put the host in maintenance mode and reboot the server to attempt a permanent repair of the DIMM using Post Package Repair (PPR).
    1. If no errors occur after reboot, PPR was successful, and the server can be put back into use.
    2. If new ADDDC events occur, repeat the reboot process to perform additional permanent repairs with PPR.
  4. If an uncorrectable error occurs after reboot, replace the DIMM.

Release 4.1(1) firmware generates a Major severity fault for all BANK and RANK RAS events so that proactive action can be taken relative to a critical ADDDC defect CSCvr79388.

Releases 4.1(2) and 4.1(3) firmware generates a Major severity fault for RANK RAS events on advanced CPU SKUs. BANK RAS events will generate a fault for standard CPU SKUs.

Problem Symptom

Due to memory DIMM errors and architectural changes in memory error handling on Intel Xeon Scalable processors (formerly code-named “Skylake Server”) and 2nd Gen Intel Xeon Scalable processors (formerly code-named “Cascade Lake Server”), Cisco UCS M5 customers that experience memory DIMM errors might experience a higher rate of runtime uncorrectable memory errors than they experienced on previous generations with default SDDC Memory RAS mode.

Workaround/Solution

Cisco recommends that you upgrade to a Server Firmware Bundle that includes ADDDC Sparing to expand the memory error coverage. Refer to this table for supported and recommended firmware that includes ADDDC Sparing.

 Server Firmware That Supports ADDDC SparingRecommended Server Firmware
UCS M5 Blades and Integrated UCS M5 Rack Servers3.2(3p) or later
4.0(4i) or later
4.1(1d) or later
4.1(3d) or later
Defect IDHeadline
CSCvq38078UCSM:Default option for “SelectMemory RAS configuration” changed to ADDDC sparing
Links