Exciting Update: Cisco Unveils UCS Manager VMware vSphere 8U2 HTML Client Plugin Version 4.0(0)

I am thrilled to share my experience with the latest UCSM-plugin 4.0 for VMware vSphere 8U2, a remarkable tool that has significantly enhanced our virtualization management capabilities. Having tested its functionality across an extensive network of approximately 13 UCSM domains and 411 ESXi 8U2 hosts. A notable instance of its efficacy was observed with Alert F1236, where the Proactive HA feature seamlessly transitioned the Blade into Quarantine mode, showcasing the plugin’s advanced automation capabilities.

However, I did encounter a challenge with the configuration of Custom Alerts, particularly Alert F1705. Despite my efforts, Proactive HA failed to activate, suggesting a potential misconfiguration on my part. To streamline this process, I propose the integration of Alert F1705 into the default alert settings, thereby simplifying the setup and ensuring more efficient system monitoring.

The release of Cisco’s 4.0(0) version of the UCS Manager VMware vSphere 8U2 HTML remote client plugin marks a significant advancement in the field of virtualization administration. This plugin not only offers a comprehensive physical view of the UCS hardware inventory through the HTML client but also enhances the overall management and monitoring of the Cisco UCS physical infrastructure.

Key functionalities provided by this plugin include:

  1. Detailed Physical Hierarchy View: Gain a clear understanding of the Cisco UCS physical structure.
  2. Comprehensive Inventory Insights: Access detailed information on inventory, installed firmware, faults, and power and temperature statistics.
  3. Physical Server to ESXi Host Mapping: Easily correlate your ESXi hosts with their corresponding physical servers.
  4. Firmware Management: Efficiently manage firmware for both B and C series servers.
  5. Direct Access to Cisco UCS Manager GUI: Launch the Cisco UCS Manager GUI directly from the plugin.
  6. KVM Console Integration: Instantly launch KVM consoles of UCS servers for immediate access and control.
  7. Locator LED Control: Switch the state of the locator LEDs as needed for enhanced hardware identification.
  8. Proactive HA Fault Configuration: Customize and configure faults used in Proactive HA for improved system resilience.

Links

Detailed Release Notes

Software download link

Please see the User Guide for specific information on installing and using the plugin with the vSphere HTML client.

Add F1705 Alert to Cisco UCS Manager Plugin 4.0(0)

New Cisco UCS firmware brings possibility to have notification about F1705 Alerts – Rank VLS.

In latest version of Cisco UCS Manager Plugin for VMware vSphere HTML Client (Version 4.0(0)) we could add Custom fault addition for proactive HA monitoring. How to do it?

Cisco UCS / Proactive HA Registration / Registered Fault / Add / ADDDC_Memory_Rank_VLS

If You can’t Add, it is necessary to Unregister UCSM Manager Plugin.

Cisco UCS / Proactive HA Registration / Registered Fault / Add
Cisco UCS / Proactive HA Registration / vCenter server credentials / Register
Cisco UCS / Proactive HA Registration / Register
How Could I check it? Edit Proactive HA / Providers
It is better use Name “ADDDC_Memory_Rank_VLS” without spaces. On my picture I used “My F1705 Alerts”

Adding Custom Alert is only possible with unregistered Cisco UCS Provider, it is better to do it immediatly after Cisco UCS Manager Plugin instalation.

Now I can deceided If I will block F1705 or NOT. I personaly preffer to have F1705 Alert under Proactive HA. Then I only restart Blades with F1705. During reboot Hard-PPR permanently remaps accesses from a designated faulty row to a designated spare row.

Links:

Field Notice: FN – 72368 – Some DIMMs Might Fail Prematurely Due to a Manufacturing Deviation – Hardware Upgrade Available

Cisco announced Field Notice: FN – 72368 – Some DIMMs Might Fail Prematurely Due to a Manufacturing Deviation – Hardware Upgrade Available

My personal recommendation please use ADDDC and PPR – It could prevent hardware failures … UCS-ML-128G4RT-H is in 2nd revision from 28-Oct-22.

Problem Description

A limited number of DIMMs shipped from Cisco are impacted by a known deviation in the memory supplier’s manufacturing process. This deviation might result in a higher rate of failure.

Background

DIMM manufacturers compose their DIMMs of multiple memory modules to reach the desired capacity. A 16GB DIMM might be composed of the same modules that a 32GB DIMM is composed of. In this case, a manufacturing deviation in specific modules impacts 16GB, 32GB, 64GB, and 128GB DIMMs. This deviation was contained to a specific date range, and the DIMMs which use these chips were manufactured during the middle to end of 2020. Since the discovery of this deviation, additional limits have been imposed on the manufacturing process to ensure that future DIMMs are not exposed to this process variation.

Problem Symptom

Most DIMMs with this manufacturing deviation will exhibit persistent correctable memory errors. If left untreated, the DIMMs might eventually encounter an uncorrectable memory event. If encountered during runtime, uncorrectable errors will cause a sudden unexpected server reset. If encountered during Power-On Self-Test (POST), the DIMM will be mapped out and the total available memory reduced. In some cases a boot error might be seen.

Various DIMM Reliability, Availability, and Serviceability (RAS) features or even operating system features might mask the extent of these correctable errors. It is recommended to check your DIMMs for exposure using the Serial Number Validation Tool described in the Serial Number Validation section of this field notice. Only specific DIMMs are impacted by this issue, so do not rely solely on the DIMM error count to judge exposure.

Workaround/Solution

This is a hardware failure. A replacement is strongly recommended in order to avoid potential for unexpected server failure.

Field Notice: FN – 70432 – Improved Memory RAS Features for UCS M5 Platforms – Software Upgrade Recommended – 4.2(1i)

I recommend upgrade to a Server Firmware Bundle that includes ADDDC Sparing to expand the memory error coverage – 4.2(1i). More info:

Handling RAS events

When BANK-level or RANK-level RAS events are observed (and PPR is enabled):

  1. Verify that no other DIMM faults are present (for example, an uncorrectable error)
  2. Schedule a maintenance window (MW).
  3. During MW, put the host in maintenance mode and reboot the server to attempt a permanent repair of the DIMM using Post Package Repair (PPR).
    1. If no errors occur after reboot, PPR was successful, and the server can be put back into use.
    2. If new ADDDC events occur, repeat the reboot process to perform additional permanent repairs with PPR.
  4. If an uncorrectable error occurs after reboot, replace the DIMM.

Release 4.1(1) firmware generates a Major severity fault for all BANK and RANK RAS events so that proactive action can be taken relative to a critical ADDDC defect CSCvr79388.

Releases 4.1(2) and 4.1(3) firmware generates a Major severity fault for RANK RAS events on advanced CPU SKUs. BANK RAS events will generate a fault for standard CPU SKUs.

Problem Symptom

Due to memory DIMM errors and architectural changes in memory error handling on Intel Xeon Scalable processors (formerly code-named “Skylake Server”) and 2nd Gen Intel Xeon Scalable processors (formerly code-named “Cascade Lake Server”), Cisco UCS M5 customers that experience memory DIMM errors might experience a higher rate of runtime uncorrectable memory errors than they experienced on previous generations with default SDDC Memory RAS mode.

Workaround/Solution

Cisco recommends that you upgrade to a Server Firmware Bundle that includes ADDDC Sparing to expand the memory error coverage. Refer to this table for supported and recommended firmware that includes ADDDC Sparing.

 Server Firmware That Supports ADDDC SparingRecommended Server Firmware
UCS M5 Blades and Integrated UCS M5 Rack Servers3.2(3p) or later
4.0(4i) or later
4.1(1d) or later
4.1(3d) or later
Defect IDHeadline
CSCvq38078UCSM:Default option for “SelectMemory RAS configuration” changed to ADDDC sparing
Links

Fault Resilient Memory (FRM) for Cisco UCS

We can see annual incidence of uncorrectable errors is rissing. Here is one possibility – How to solved it with FRM.

ESXi supports reliable memory.

Some systems have reliable memory, which is a part of memory that is less likely to have hardware memory errors than other parts of the memory in the system. If the hardware exposes information about the different levels of reliability, ESXi might be able to achieve higher system reliability.

How to enable in Cisco UCS

Configuration is in BIOS policy / Advanced / RAS Memory

8GB Could be enough for ESXi hypervisor …

This forces the Hypervisor and some core kernel processes to be mirrored between DIMMs so ESXi itself can survive the complete and total failure of a memory DIMM.

# esxcli hardware memory get
    Physical Memory: 540800864256 Bytes
    Reliable Memory: 8589934592 Bytes
    NUMA Node Count: 2 
#  esxcli system settings kernel list | grep useReliableMem
 useReliableMem Bool TRUE TRUE TRUE System is aware of reliable memory. 

Configuring Reliable Memory in Per-virtual machine basis (2146595)

I can decided to configure more Reliable Memory for VM – not only 8GB for hypervisor.

To turn on the feature per VM:

  1. Edit the .vmx file using a text editor
  2. Add the parameter:
    sched.mem.reliable = "True"
  3. Save and close the file

Conclusion:

  • For enable Fault Resilient Memory (FRM) I had to disable ADDDC Sparing in BIOS policy / Advanced / RAS Memory / Memory RAS configuration
  • With ADDDC and Proactive HA I can save about 95% failures – Personaly I prefer to use ADDDC
  • The Best possibility is to have both in future firmware …

Interesting links:

Field Notice: FN – 70432 – Improved Memory RAS Features for UCS M5 Platforms – Software Upgrade Recommended

Memory Errors and Dell EMC PowerEdge YX4X Server Memory RAS Features

Memory Controller May Hang While in Virtual Lockstep – fix in UCSM 4.1(1c)

SAP HANA is very intensive for memory operation. With ADDDC Sparing We can add System reliability. It is optimized by holding memory in reserve so that it can be used in case other DIMMs fail. But there Could be another problem with.

Memory Controller May Hang While in Virtual Lockstep

For more information – Intel® Xeon® Processor Scalable Family Specification Update, # SKX108:

Problem: Under complex microarchitectural conditions, a memory controller that is in VirtualLockstep (VLS) may hang on a partial write transaction.

Workaround: It is possible for BIOS to contain a workaround see below.

Implication: The memory controller hangs with a mesh-to-mem timeout Machine Check Exception(MSCOD=20h, MCACOD=400h). The memory controller hang may lead to other machine check timeouts that can lead to an unexpected system shutdown.

Cisco UCS Manager, Release 4.1(1c) fix it

Cisco applied BIOS workaround for this scenario.

Defect IDSymptom
CSCvr79388Cisco UCS servers stop responding and reboot after ADDDC virtual lockstep is activated. This results in #IERR and M2M timeout in the memory system. This issue is resolved.
CSCvr79396On Cisco UCS M5 servers, the Virtual lock step (VLS) sparing copy finishes early, leading to incorrect values in the lock step region. This issue is resolved.
Resolved Caveats in Release 4.1(1c)

I recommended to update ASAP, firmware 4.1(1c) is stable. Cisco THX!

Proactive HA is working in VCSA 6.7 with Cisco UCS Manager Plugin for VMware vSphere HTML Client (beta Version 3.0(2))

Cisco has released the 3.0(2) beta version of the the Cisco UCS Manager VMware vSphere HTML client plugin. These version is working with vSphere 6.7. It’s currently running and enabled on 9 different clusters – 290 hosts. It works great so far.

Here are the new and changed features in Release3.0(2):

  • Included defect fixes
  • Added a new fault (F1706)to the Cisco UCS Provider failure conditions list
  • Added support for proactive High Availability for more than 100 hosts in vCenter

It is great to combine it with new Cisco UCS 4.1.1 because of Intel Post Package Repair (PPR).

  • Intel Post Package Repair (PPR) uses additional spare capacity within the DDR4 DRAM to remap and replace faulty cell areas detected during system boot time. Remapping is permanent and persists through power-down and reboot.
  • Newer memories, such as double data ram version 4 (DDR4) include so-called post-package repair (PPR) capabilities. PPR capabilities enable a compatible memory controller to remap accesses from a faulty row of a memory module to a spare row of the memory module that is not faulty.
    • Hard-PPR permanently remaps accesses from a designated faulty row to a designated spare row. A Hard-PPR row remapping survives power cycles.
    • Soft-PPR remapping temporarily maps accesses from a faulty row to a designated spare row. A Soft-PPR row remapping will survive a “warm” reboot,but does not survive a powercycle.
  • You can enabled it in BIOS policy / Memory RAS configuration – Select PPR type configuration – Hard PPR

  • To support “Alert F1706 – ADDDC Memory RAS Problem” is necessary
    ADDDC Sparing—System reliability is optimized by holding memory in reserve so that it can be used in case other DIMMs fail. This mode provides some memory redundancy, but does not provide as much redundancy as mirroring.
  • Cisco recommends upgrading to 4.0(4c) or later to expand memory fault coverage. Beginning with 4.0(4c) an additional RAS feature, Adaptive Double Device Data Correction (ADDDC Sparing) is available. It will be enabled and configured as “Platform Default” for Memory RAS configuration.