Memory Controller May Hang While in Virtual Lockstep – fix in UCSM 4.1(1c)

SAP HANA is very intensive for memory operation. With ADDDC Sparing We can add System reliability. It is optimized by holding memory in reserve so that it can be used in case other DIMMs fail. But there Could be another problem with.

Memory Controller May Hang While in Virtual Lockstep

For more information – Intel® Xeon® Processor Scalable Family Specification Update, # SKX108:

Problem: Under complex microarchitectural conditions, a memory controller that is in VirtualLockstep (VLS) may hang on a partial write transaction.

Workaround: It is possible for BIOS to contain a workaround see below.

Implication: The memory controller hangs with a mesh-to-mem timeout Machine Check Exception(MSCOD=20h, MCACOD=400h). The memory controller hang may lead to other machine check timeouts that can lead to an unexpected system shutdown.

Cisco UCS Manager, Release 4.1(1c) fix it

Cisco applied BIOS workaround for this scenario.

Defect IDSymptom
CSCvr79388Cisco UCS servers stop responding and reboot after ADDDC virtual lockstep is activated. This results in #IERR and M2M timeout in the memory system. This issue is resolved.
CSCvr79396On Cisco UCS M5 servers, the Virtual lock step (VLS) sparing copy finishes early, leading to incorrect values in the lock step region. This issue is resolved.
Resolved Caveats in Release 4.1(1c)

I recommended to update ASAP, firmware 4.1(1c) is stable. Cisco THX!

Author: Daniel Micanek

Senior Service Architect, SAP Platform Services Team at Tietoevry | SUSE SCA | vExpert ⭐⭐⭐⭐⭐ | vExpert NSX | VCIX-DCV/NV | VCAP-DCV/NV Design+Deploy | VCP-DCV/NV/CMA/TKO/DTM | NCIE-DP | OCP | Azure Solutions Architect | Certified Kubernetes Administrator (CKA)