We can see annual incidence of uncorrectable errors is rissing. Here is one possibility – How to solved it with FRM.
ESXi supports reliable memory.
Some systems have reliable memory, which is a part of memory that is less likely to have hardware memory errors than other parts of the memory in the system. If the hardware exposes information about the different levels of reliability, ESXi might be able to achieve higher system reliability.
How to enable in Cisco UCS
Configuration is in BIOS policy / Advanced / RAS Memory
8GB Could be enough for ESXi hypervisor …
This forces the Hypervisor and some core kernel processes to be mirrored between DIMMs so ESXi itself can survive the complete and total failure of a memory DIMM.
# esxcli hardware memory get Physical Memory: 540800864256 Bytes Reliable Memory: 8589934592 Bytes NUMA Node Count: 2
# esxcli system settings kernel list | grep useReliableMem useReliableMem Bool TRUE TRUE TRUE System is aware of reliable memory.
Configuring Reliable Memory in Per-virtual machine basis (2146595)
I can decided to configure more Reliable Memory for VM – not only 8GB for hypervisor.
To turn on the feature per VM:
- Edit the .vmx file using a text editor
- Add the parameter:
sched.mem.reliable = "True"
- Save and close the file
Conclusion:
- For enable Fault Resilient Memory (FRM) I had to disable ADDDC Sparing in BIOS policy / Advanced / RAS Memory / Memory RAS configuration
- With ADDDC and Proactive HA I can save about 95% failures – Personaly I prefer to use ADDDC
- The Best possibility is to have both in future firmware …
Interesting links:
Memory Errors and Dell EMC PowerEdge YX4X Server Memory RAS Features