Field Notice: FN – 70432 – Improved Memory RAS Features for UCS M5 Platforms – Software Upgrade Recommended – 4.2(1i)

I recommend upgrade to a Server Firmware Bundle that includes ADDDC Sparing to expand the memory error coverage – 4.2(1i). More info:

Handling RAS events

When BANK-level or RANK-level RAS events are observed (and PPR is enabled):

  1. Verify that no other DIMM faults are present (for example, an uncorrectable error)
  2. Schedule a maintenance window (MW).
  3. During MW, put the host in maintenance mode and reboot the server to attempt a permanent repair of the DIMM using Post Package Repair (PPR).
    1. If no errors occur after reboot, PPR was successful, and the server can be put back into use.
    2. If new ADDDC events occur, repeat the reboot process to perform additional permanent repairs with PPR.
  4. If an uncorrectable error occurs after reboot, replace the DIMM.

Release 4.1(1) firmware generates a Major severity fault for all BANK and RANK RAS events so that proactive action can be taken relative to a critical ADDDC defect CSCvr79388.

Releases 4.1(2) and 4.1(3) firmware generates a Major severity fault for RANK RAS events on advanced CPU SKUs. BANK RAS events will generate a fault for standard CPU SKUs.

Problem Symptom

Due to memory DIMM errors and architectural changes in memory error handling on Intel Xeon Scalable processors (formerly code-named “Skylake Server”) and 2nd Gen Intel Xeon Scalable processors (formerly code-named “Cascade Lake Server”), Cisco UCS M5 customers that experience memory DIMM errors might experience a higher rate of runtime uncorrectable memory errors than they experienced on previous generations with default SDDC Memory RAS mode.

Workaround/Solution

Cisco recommends that you upgrade to a Server Firmware Bundle that includes ADDDC Sparing to expand the memory error coverage. Refer to this table for supported and recommended firmware that includes ADDDC Sparing.

 Server Firmware That Supports ADDDC SparingRecommended Server Firmware
UCS M5 Blades and Integrated UCS M5 Rack Servers3.2(3p) or later
4.0(4i) or later
4.1(1d) or later
4.1(3d) or later
Defect IDHeadline
CSCvq38078UCSM:Default option for “SelectMemory RAS configuration” changed to ADDDC sparing
Links

Author: Daniel Micanek

Senior Service Architect, SAP Platform Services Team at TietoEVRY | vExpert ⭐⭐ | vExpert NSX | VCIX-DCV | VCAP-NV Design | VCAP-DCV Design+Deploy | VCP-DCV/NV/CMA | NCIE-DP | OCP | Azure Solutions Architect | Certified Kubernetes Administrator (CKA)