Fault Resilient Memory (FRM) for Cisco UCS

We can see annual incidence of uncorrectable errors is rissing. Here is one possibility – How to solved it with FRM.

ESXi supports reliable memory.

Some systems have reliable memory, which is a part of memory that is less likely to have hardware memory errors than other parts of the memory in the system. If the hardware exposes information about the different levels of reliability, ESXi might be able to achieve higher system reliability.

How to enable in Cisco UCS

Configuration is in BIOS policy / Advanced / RAS Memory

8GB Could be enough for ESXi hypervisor …

This forces the Hypervisor and some core kernel processes to be mirrored between DIMMs so ESXi itself can survive the complete and total failure of a memory DIMM.

# esxcli hardware memory get
    Physical Memory: 540800864256 Bytes
    Reliable Memory: 8589934592 Bytes
    NUMA Node Count: 2 
#  esxcli system settings kernel list | grep useReliableMem
 useReliableMem Bool TRUE TRUE TRUE System is aware of reliable memory. 

Configuring Reliable Memory in Per-virtual machine basis (2146595)

I can decided to configure more Reliable Memory for VM – not only 8GB for hypervisor.

To turn on the feature per VM:

  1. Edit the .vmx file using a text editor
  2. Add the parameter:
    sched.mem.reliable = "True"
  3. Save and close the file

Conclusion:

  • For enable Fault Resilient Memory (FRM) I had to disable ADDDC Sparing in BIOS policy / Advanced / RAS Memory / Memory RAS configuration
  • With ADDDC and Proactive HA I can save about 95% failures – Personaly I prefer to use ADDDC
  • The Best possibility is to have both in future firmware …

Interesting links:

Field Notice: FN – 70432 – Improved Memory RAS Features for UCS M5 Platforms – Software Upgrade Recommended

Memory Errors and Dell EMC PowerEdge YX4X Server Memory RAS Features