I recommend upgrade to a Server Firmware Bundle that includes ADDDC Sparing to expand the memory error coverage – 4.2(1i). More info:
Handling RAS events
When BANK-level or RANK-level RAS events are observed (and PPR is enabled):
- Verify that no other DIMM faults are present (for example, an uncorrectable error)
- Schedule a maintenance window (MW).
- During MW, put the host in maintenance mode and reboot the server to attempt a permanent repair of the DIMM using Post Package Repair (PPR).
- If no errors occur after reboot, PPR was successful, and the server can be put back into use.
- If new ADDDC events occur, repeat the reboot process to perform additional permanent repairs with PPR.
- If an uncorrectable error occurs after reboot, replace the DIMM.
Release 4.1(1) firmware generates a Major severity fault for all BANK and RANK RAS events so that proactive action can be taken relative to a critical ADDDC defect CSCvr79388.
Releases 4.1(2) and 4.1(3) firmware generates a Major severity fault for RANK RAS events on advanced CPU SKUs. BANK RAS events will generate a fault for standard CPU SKUs.
Problem Symptom
Due to memory DIMM errors and architectural changes in memory error handling on Intel Xeon Scalable processors (formerly code-named “Skylake Server”) and 2nd Gen Intel Xeon Scalable processors (formerly code-named “Cascade Lake Server”), Cisco UCS M5 customers that experience memory DIMM errors might experience a higher rate of runtime uncorrectable memory errors than they experienced on previous generations with default SDDC Memory RAS mode.
Workaround/Solution
Cisco recommends that you upgrade to a Server Firmware Bundle that includes ADDDC Sparing to expand the memory error coverage. Refer to this table for supported and recommended firmware that includes ADDDC Sparing.
| Server Firmware That Supports ADDDC Sparing | Recommended Server Firmware | |
|---|---|---|
| UCS M5 Blades and Integrated UCS M5 Rack Servers | 3.2(3p) or later 4.0(4i) or later 4.1(1d) or later | 4.1(3d) or later |
| Defect ID | Headline |
|---|---|
| CSCvq38078 | UCSM:Default option for “SelectMemory RAS configuration” changed to ADDDC sparing |