Proactive HA Archives

Exciting Update: Cisco Unveils UCS Manager VMware vSphere 8U2 HTML Client Plugin Version 4.0(0)

I am thrilled to share my experience with the latest UCSM-plugin 4.0 for VMware vSphere 8U2, a remarkable tool that has significantly enhanced our virtualization management capabilities. Having tested its functionality across an extensive network of approximately 13 UCSM domains and 411 ESXi 8U2 hosts. A notable instance of its efficacy was observed with Alert F1236, where the Proactive HA feature seamlessly transitioned the Blade into Quarantine mode, showcasing the plugin’s advanced automation capabilities.

However, I did encounter a challenge with the configuration of Custom Alerts, particularly Alert F1705. Despite my efforts, Proactive HA failed to activate, suggesting a potential misconfiguration on my part. To streamline this process, I propose the integration of Alert F1705 into the default alert settings, thereby simplifying the setup and ensuring more efficient system monitoring.

The release of Cisco’s 4.0(0) version of the UCS Manager VMware vSphere 8U2 HTML remote client plugin marks a significant advancement in the field of virtualization administration. This plugin not only offers a comprehensive physical view of the UCS hardware inventory through the HTML client but also enhances the overall management and monitoring of the Cisco UCS physical infrastructure.

Key functionalities provided by this plugin include:

Detailed Physical Hierarchy View: Gain a clear understanding of the Cisco UCS physical structure.
Comprehensive Inventory Insights: Access detailed information on inventory, installed firmware, faults, and power and temperature statistics.
Physical Server to ESXi Host Mapping: Easily correlate your ESXi hosts with their corresponding physical servers.
Firmware Management: Efficiently manage firmware for both B and C series servers.
Direct Access to Cisco UCS Manager GUI: Launch the Cisco UCS Manager GUI directly from the plugin.
KVM Console Integration: Instantly launch KVM consoles of UCS servers for immediate access and control.
Locator LED Control: Switch the state of the locator LEDs as needed for enhanced hardware identification.
Proactive HA Fault Configuration: Customize and configure faults used in Proactive HA for improved system resilience.

Links

Detailed Release Notes

Software download link

Please see the User Guide for specific information on installing and using the plugin with the vSphere HTML client.

Add F1705 Alert to Cisco UCS Manager Plugin 4.0(0)

New Cisco UCS firmware brings possibility to have notification about F1705 Alerts – Rank VLS.

In latest version of Cisco UCS Manager Plugin for VMware vSphere HTML Client (Version 4.0(0)) we could add Custom fault addition for proactive HA monitoring. How to do it?

Cisco UCS / Proactive HA Registration / Registered Fault / Add / ADDDC_Memory_Rank_VLS

If You can’t Add, it is necessary to Unregister UCSM Manager Plugin.

Cisco UCS / Proactive HA Registration / vCenter server credentials / Register

Cisco UCS / Proactive HA Registration / Register

How Could I check it? Edit Proactive HA / Providers

It is better use Name “ADDDC_Memory_Rank_VLS” without spaces. On my picture I used “My F1705 Alerts”

Adding Custom Alert is only possible with unregistered Cisco UCS Provider, it is better to do it immediatly after Cisco UCS Manager Plugin instalation.

Now I can deceided If I will block F1705 or NOT. I personaly preffer to have F1705 Alert under Proactive HA. Then I only restart Blades with F1705. During reboot Hard-PPR permanently remaps accesses from a designated faulty row to a designated spare row.

Links:

Field Notice: FN – 72368 – Some DIMMs Might Fail Prematurely Due to a Manufacturing Deviation – Hardware Upgrade Available

Cisco announced Field Notice: FN – 72368 – Some DIMMs Might Fail Prematurely Due to a Manufacturing Deviation – Hardware Upgrade Available

My personal recommendation please use ADDDC and PPR – It could prevent hardware failures … UCS-ML-128G4RT-H is in 2nd revision from 28-Oct-22.

Problem Description

A limited number of DIMMs shipped from Cisco are impacted by a known deviation in the memory supplier’s manufacturing process. This deviation might result in a higher rate of failure.

Background

DIMM manufacturers compose their DIMMs of multiple memory modules to reach the desired capacity. A 16GB DIMM might be composed of the same modules that a 32GB DIMM is composed of. In this case, a manufacturing deviation in specific modules impacts 16GB, 32GB, 64GB, and 128GB DIMMs. This deviation was contained to a specific date range, and the DIMMs which use these chips were manufactured during the middle to end of 2020. Since the discovery of this deviation, additional limits have been imposed on the manufacturing process to ensure that future DIMMs are not exposed to this process variation.

Problem Symptom

Most DIMMs with this manufacturing deviation will exhibit persistent correctable memory errors. If left untreated, the DIMMs might eventually encounter an uncorrectable memory event. If encountered during runtime, uncorrectable errors will cause a sudden unexpected server reset. If encountered during Power-On Self-Test (POST), the DIMM will be mapped out and the total available memory reduced. In some cases a boot error might be seen.

Various DIMM Reliability, Availability, and Serviceability (RAS) features or even operating system features might mask the extent of these correctable errors. It is recommended to check your DIMMs for exposure using the Serial Number Validation Tool described in the Serial Number Validation section of this field notice. Only specific DIMMs are impacted by this issue, so do not rely solely on the DIMM error count to judge exposure.

Workaround/Solution

This is a hardware failure. A replacement is strongly recommended in order to avoid potential for unexpected server failure.

Cisco UCS Manager Plugin for VMware vSphere HTML Client (Version 3.0(6))

Cisco has released the 3.0(6) version of the Cisco UCS Manager VMware vSphere HTML client plugin. The UCS Manager vSphere HTML client plugin enables a virtualization administrator to view, manage, and monitor the Cisco UCS physical infrastructure. The plugin provides a physical view of the UCS hardware inventory on the HTML client.

I notify BUG “Host not going into monitoring state vCenter restart”. Thank You for fix.

Release 3.0(6)

Here are the new features in Release 3.0(6):

Custom fault addition for proactive HA monitoring
Resolved host not going into monitoring state vCenter restart
Included defect fixes

VMware vSphere HTML Client Releases

Cisco UCS Manager plug-in is compatible with the following vSphere HTML Client releases:

VMware vSphere HTML Client Version	Cisco UCS Manager Plugin for VMware vSphere Version
6.7	3.0(1), 3.0(2), 3.0(3), 3.0(4), 3.0(5), 3.0(6)
7.0	3.0(4), 3.0(5), 3.0(6)
7.0u1, 7.0u2	3.0(5), 3.0(6)

Note

VMware vSphere HTML Client Version 7.0u3 is not supported.

More info here.

Field Notice: FN – 70432 – Improved Memory RAS Features for UCS M5 Platforms – Software Upgrade Recommended – 4.2(1i)

I recommend upgrade to a Server Firmware Bundle that includes ADDDC Sparing to expand the memory error coverage – 4.2(1i). More info:

Handling RAS events

When BANK-level or RANK-level RAS events are observed (and PPR is enabled):

Verify that no other DIMM faults are present (for example, an uncorrectable error)
Schedule a maintenance window (MW).
During MW, put the host in maintenance mode and reboot the server to attempt a permanent repair of the DIMM using Post Package Repair (PPR).
1. If no errors occur after reboot, PPR was successful, and the server can be put back into use.
2. If new ADDDC events occur, repeat the reboot process to perform additional permanent repairs with PPR.
If an uncorrectable error occurs after reboot, replace the DIMM.

Release 4.1(1) firmware generates a Major severity fault for all BANK and RANK RAS events so that proactive action can be taken relative to a critical ADDDC defect CSCvr79388.

Releases 4.1(2) and 4.1(3) firmware generates a Major severity fault for RANK RAS events on advanced CPU SKUs. BANK RAS events will generate a fault for standard CPU SKUs.

Problem Symptom

Due to memory DIMM errors and architectural changes in memory error handling on Intel Xeon Scalable processors (formerly code-named “Skylake Server”) and 2nd Gen Intel Xeon Scalable processors (formerly code-named “Cascade Lake Server”), Cisco UCS M5 customers that experience memory DIMM errors might experience a higher rate of runtime uncorrectable memory errors than they experienced on previous generations with default SDDC Memory RAS mode.

Workaround/Solution

Cisco recommends that you upgrade to a Server Firmware Bundle that includes ADDDC Sparing to expand the memory error coverage. Refer to this table for supported and recommended firmware that includes ADDDC Sparing.

	Server Firmware That Supports ADDDC Sparing	Recommended Server Firmware
UCS M5 Blades and Integrated UCS M5 Rack Servers	3.2(3p) or later 4.0(4i) or later 4.1(1d) or later	4.1(3d) or later

Defect ID	Headline
CSCvq38078	UCSM:Default option for “SelectMemory RAS configuration” changed to ADDDC sparing

Cisco UCS Manager Plugin for VMware vSphere HTML Client (Version 3.0(4))

Cisco has released the 3.0(4) version of the Cisco UCS Manager VMware vSphere HTML client plugin. The UCS Manager vSphere HTML client plugin enables a virtualization administrator to view, manage, and monitor the Cisco UCS physical infrastructure. The plugin provides a physical view of the UCS hardware inventory on the HTML client.

Here are the new features in Release 3.0(4):

Support for VMware vSphere HTML Client 7.0
Included defect fix

Resolved Bugs

Defect ID	Symptom	First Affected in Release	Resolved in Release
CSCvv01618	UCSM domain not visible in HTML plug-in 3.0(2).	3.0(2)	3.0(4)

More info here.

Memory Controller May Hang While in Virtual Lockstep – fix in UCSM 4.1(1c)

SAP HANA is very intensive for memory operation. With ADDDC Sparing We can add System reliability. It is optimized by holding memory in reserve so that it can be used in case other DIMMs fail. But there Could be another problem with.

Memory Controller May Hang While in Virtual Lockstep

For more information – Intel® Xeon® Processor Scalable Family Specification Update, # SKX108:

Problem: Under complex microarchitectural conditions, a memory controller that is in VirtualLockstep (VLS) may hang on a partial write transaction.

Workaround: It is possible for BIOS to contain a workaround see below.

Implication: The memory controller hangs with a mesh-to-mem timeout Machine Check Exception(MSCOD=20h, MCACOD=400h). The memory controller hang may lead to other machine check timeouts that can lead to an unexpected system shutdown.

Cisco UCS Manager, Release 4.1(1c) fix it

Cisco applied BIOS workaround for this scenario.

Defect ID	Symptom
CSCvr79388	Cisco UCS servers stop responding and reboot after ADDDC virtual lockstep is activated. This results in #IERR and M2M timeout in the memory system. This issue is resolved.
CSCvr79396	On Cisco UCS M5 servers, the Virtual lock step (VLS) sparing copy finishes early, leading to incorrect values in the lock step region. This issue is resolved.

Resolved Caveats in Release 4.1(1c)

I recommended to update ASAP, firmware 4.1(1c) is stable. Cisco THX!

How to Configure vSphere 6.7 Proactive HA with Cisco UCS Manager Plugin for VMware vSphere?

Proactive HA is working in VCSA 6.7 with Cisco UCS Manager Plugin for VMware vSphere HTML Client (beta Version 3.0(2))

I wrote in previous blog latest Cisco UCS Manager Plugin is working with vCenter 6.7 U3b.

Install Cisco UCS Manager Plugin

Install Cisco UCS Manager Plugin for VMware vSphere HTML Client (beta Version 3.0(2))
User guide is here: UCSM_Plugin_VMware_vSphere_Web_Client_User_Guide_3_x.pdf

vSphere Web Client – Enable Proactive HA

From vSphere Web Client -> Cluster Properties -> Configure -> vSphere Availability -> Proactive HA is Turned OFF – Click on Edit. You can notice vSphere Proactive HA is disabled by default.

Automation Level – Determine whether host quarantine or maintenance mode and VM migrations are recommendations or automatic.
- Manual – vCenter Server suggests migration recommendations for virtual machines.
- Automated – Virtual machines are migrated to healthy hosts and degraded hosts are entered into quarantine or maintenance mode depending on the configured Proactive HA automation level.

Remediation – Determine what happens to partially degraded hosts.
- Quarantine mode – for all failures. Balances performance and availability, by avoiding the usage of partially degraded hosts provided that virtual machine performance is unaffected.
- Mixed mode – Quarantine mode for moderate and Maintenance mode for severe failure (Mixed). Balances performance and availability, by avoiding the usage of moderately degraded hosts provided that virtual machine performance is unaffected. Ensures that virtual machines do not run on severely failed hosts.
- Maintenance mode – for all failures. Ensures that virtual machines do not run on partially failed hosts.

Select Cisco UCS Provider – NOT Block Failure Conditions

How is Proactive HA working?

With settings Automatic Level – Automated and Remediation – Mixed Mode after HW Failure. Proactive HA is Entering Host Into Quarantine Mode and Migrate all VMs from ESXi with HW Failure: