NSX-T Edge design guide for Cisco UCS

How to design NSX-T Edge inside Cisco UCS? I can’t find it inside Cisco Design Guide. But I find usefull topology inside Dell EMC VxBlock™ Systems, VMware® NSX-T Reference Design  and NSX-T 3.0 Edge Design Step-by-Step UI workflow. Thanks DELL and VMware …

VMware® NSX-T Reference Design

  • VDS Design update – New capability of deploying NSX on top of VDS with NSX
  • VSAN Baseline Recommendation for Management and Edge Components
  • VRF Based Routing and other enhancements
  • Updated security functionality
  • Design changes that goes with VDS with NSX
  • Performance updates

NSX-T 3.0 Edge Design Step-by-Step UI workflow

This document is an informal document that walks through the step-by-step deployment and configuration workflow for NSX-T Edge Single N-VDS Multi-TEP design.  This document uses NSX-T 3.0 UI to show the workflow, which is broken down into following 3 sub-workflows:

  1. Deploy and configure the Edge node (VM & BM) with Single-NVDS Multi-TEP.
  2. Preparing NSX-T for Layer 2 External (North-South) connectivity.
  3. Preparing NSX-T for Layer 3 External (North-South) connectivity.

NSX-T Design with EDGE VM

  • Under Teamings – Add 2 Teaming Policies: one with Active Uplink as “uplink-1” and other with “uplink-2”.
  • Make a note of the policy name used, as we would be using this in the next section. In this example they are “PIN-TO-TOR-LEFT” and “PIN-TO-TOR-RIGHT”.

How to design NSX-T Edge inside Cisco UCS?

Cisco Fabric Interconnect using Port Chanel. You need high bandwith for NSX-T Edge load.

C220 M5 could solved it.

The edge node physical NIC definition includes the following

  • VMNIC0 and VMNIC1: Cisco VIC 1457
  • VMNIC2 and VMNIC3: Intel XXV710 adapter 1 (TEP and Overlay)
  • VMNIC4 and VMNIC4: Intel XXV710 adapter 2 (N/S BGP Peering)
NSX-T transport nodes with Cisco UCS C220 M5
Logical topology of the physical edge host

Or for PoC or Lab – Uplink Eth Interfaces

For PoC od HomeLAB We could use Uplink Eth Interfaces and create vNIC template linked to these uplink.

Links:

Cisco UCS Manager Plugin for VMware vSphere HTML Client (Version 3.0(6))

Cisco has released the 3.0(6) version of the Cisco UCS Manager VMware vSphere HTML client plugin. The UCS Manager vSphere HTML client plugin enables a virtualization administrator to view, manage, and monitor the Cisco UCS physical infrastructure. The plugin provides a physical view of the UCS hardware inventory on the HTML client.

I notify BUG “Host not going into monitoring state vCenter restart”. Thank You for fix.

Release 3.0(6)

Here are the new features in Release 3.0(6):

  • Custom fault addition for proactive HA monitoring
  • Resolved host not going into monitoring state vCenter restart
  • Included defect fixes

VMware vSphere HTML Client Releases

Cisco UCS Manager plug-in is compatible with the following vSphere HTML Client releases:

VMware vSphere HTML Client VersionCisco UCS Manager Plugin for VMware vSphere Version
6.73.0(1), 3.0(2), 3.0(3), 3.0(4), 3.0(5), 3.0(6)
7.03.0(4), 3.0(5), 3.0(6)
7.0u1, 7.0u23.0(5), 3.0(6)

Note
VMware vSphere HTML Client Version 7.0u3 is not supported.
More info here.

Field Notice: FN – 70432 – Improved Memory RAS Features for UCS M5 Platforms – Software Upgrade Recommended – 4.2(1i)

I recommend upgrade to a Server Firmware Bundle that includes ADDDC Sparing to expand the memory error coverage – 4.2(1i). More info:

Handling RAS events

When BANK-level or RANK-level RAS events are observed (and PPR is enabled):

  1. Verify that no other DIMM faults are present (for example, an uncorrectable error)
  2. Schedule a maintenance window (MW).
  3. During MW, put the host in maintenance mode and reboot the server to attempt a permanent repair of the DIMM using Post Package Repair (PPR).
    1. If no errors occur after reboot, PPR was successful, and the server can be put back into use.
    2. If new ADDDC events occur, repeat the reboot process to perform additional permanent repairs with PPR.
  4. If an uncorrectable error occurs after reboot, replace the DIMM.

Release 4.1(1) firmware generates a Major severity fault for all BANK and RANK RAS events so that proactive action can be taken relative to a critical ADDDC defect CSCvr79388.

Releases 4.1(2) and 4.1(3) firmware generates a Major severity fault for RANK RAS events on advanced CPU SKUs. BANK RAS events will generate a fault for standard CPU SKUs.

Problem Symptom

Due to memory DIMM errors and architectural changes in memory error handling on Intel Xeon Scalable processors (formerly code-named “Skylake Server”) and 2nd Gen Intel Xeon Scalable processors (formerly code-named “Cascade Lake Server”), Cisco UCS M5 customers that experience memory DIMM errors might experience a higher rate of runtime uncorrectable memory errors than they experienced on previous generations with default SDDC Memory RAS mode.

Workaround/Solution

Cisco recommends that you upgrade to a Server Firmware Bundle that includes ADDDC Sparing to expand the memory error coverage. Refer to this table for supported and recommended firmware that includes ADDDC Sparing.

 Server Firmware That Supports ADDDC SparingRecommended Server Firmware
UCS M5 Blades and Integrated UCS M5 Rack Servers3.2(3p) or later
4.0(4i) or later
4.1(1d) or later
4.1(3d) or later
Defect IDHeadline
CSCvq38078UCSM:Default option for “SelectMemory RAS configuration” changed to ADDDC sparing
Links

Fault Resilient Memory (FRM) for Cisco UCS

We can see annual incidence of uncorrectable errors is rissing. Here is one possibility – How to solved it with FRM.

ESXi supports reliable memory.

Some systems have reliable memory, which is a part of memory that is less likely to have hardware memory errors than other parts of the memory in the system. If the hardware exposes information about the different levels of reliability, ESXi might be able to achieve higher system reliability.

How to enable in Cisco UCS

Configuration is in BIOS policy / Advanced / RAS Memory

8GB Could be enough for ESXi hypervisor …

This forces the Hypervisor and some core kernel processes to be mirrored between DIMMs so ESXi itself can survive the complete and total failure of a memory DIMM.

# esxcli hardware memory get
    Physical Memory: 540800864256 Bytes
    Reliable Memory: 8589934592 Bytes
    NUMA Node Count: 2 
#  esxcli system settings kernel list | grep useReliableMem
 useReliableMem Bool TRUE TRUE TRUE System is aware of reliable memory. 

Configuring Reliable Memory in Per-virtual machine basis (2146595)

I can decided to configure more Reliable Memory for VM – not only 8GB for hypervisor.

To turn on the feature per VM:

  1. Edit the .vmx file using a text editor
  2. Add the parameter:
    sched.mem.reliable = "True"
  3. Save and close the file

Conclusion:

  • For enable Fault Resilient Memory (FRM) I had to disable ADDDC Sparing in BIOS policy / Advanced / RAS Memory / Memory RAS configuration
  • With ADDDC and Proactive HA I can save about 95% failures – Personaly I prefer to use ADDDC
  • The Best possibility is to have both in future firmware …

Interesting links:

Field Notice: FN – 70432 – Improved Memory RAS Features for UCS M5 Platforms – Software Upgrade Recommended

Memory Errors and Dell EMC PowerEdge YX4X Server Memory RAS Features

Driver for Cisco nenic 1.0.35.0 – Enabled Geneve Offload support

Cisco Virtual Interface Card Native Ethernet Driver 1.0.35.0 Enabled Geneve Offload support for VIC 14xx adapters.

Bugs Fixed (since nenic 1.0.29.0):

  • CSCvw37990: Set Tx queue count to 1 in all cases except Netq
  • CSCvw39021: Add tx_budget mod param to nenic
  • CSCvo36323: Fix the issue of spurious drop counts with VIC 13XX in standalone rack servers.
  • CSCvq26550: Fix added in the driver to notify the VIC Firmware about any WQ/RQ errors.

Dependencies:
Cisco UCS Virtual Interface Card 1280 firmware version: 4.0
Cisco UCS Virtual Interface Card 1240 firmware version: 4.0
Cisco UCS Virtual Interface Card 1225 firmware version: 4.0
Cisco UCS Virtual Interface Card 1225T firmware version: 4.0
Cisco UCS Virtual Interface Card 1285 firmware version: 4.0
Cisco UCS Virtual Interface Card 1380 firmware version: 4.0
Cisco UCS Virtual Interface Card 1385 firmware version: 4.0
Cisco UCS Virtual Interface Card 1387 firmware version: 4.0
Cisco UCS Virtual Interface Card 1340 firmware version: 4.0
Cisco UCS Virtual Interface Card 1227 firmware version: 4.0
Cisco UCS Virtual Interface Card 1440 firmware version: 5.x
Cisco UCS Virtual Interface Card 1455 firmware version: 5.x
Cisco UCS Virtual Interface Card 1457 firmware version: 5.x
Cisco UCS Virtual Interface Card 1480 firmware version: 5.x
Cisco UCS Virtual Interface Card 1495 firmware version: 5.x
Cisco UCS Virtual Interface Card 1497 firmware version: 5.x

New Features:
Enabled Geneve Offload support for VIC 14xx adapters

More info:

https://my.vmware.com/en/group/vmware/downloads/details?downloadGroup=DT-ESXI67-CISCO-NENIC-10350&productId=742

Cisco UCS Manager Plugin for VMware vSphere HTML Client (Version 3.0(4))

Cisco has released the 3.0(4) version of the Cisco UCS Manager VMware vSphere HTML client plugin. The UCS Manager vSphere HTML client plugin enables a virtualization administrator to view, manage, and monitor the Cisco UCS physical infrastructure. The plugin provides a physical view of the UCS hardware inventory on the HTML client.

Here are the new features in Release 3.0(4):

  • Support for VMware vSphere HTML Client 7.0
  • Included defect fix

Resolved Bugs

Defect IDSymptomFirst Affected in ReleaseResolved in Release
CSCvv01618UCSM domain not visible in HTML plug-in 3.0(2).3.0(2)3.0(4)
More info here.

FIX: The virtual machine is configured for too much PMEM. 6.0 TB main memory and x PMEM exceeds the maximum allowed 6.0 TB

VMware supports Intel Optane PMem with vSphere 6.7 and 7.0. VMware and Intel worked with SAP to complete the validation of SAP HANA with Optane PMem enabled VMs.

I configured PoC testing VM RAM 1 TB and 1 TB PMem. I was unable to power on. Error – The virtual machine is configured for too much PMem. 6.0 TB main memory and 1024 GB PMem exceeds the maximum allowed 6.0 TB.

Problem was with enable Memory Hot Plug, because there is a limit calculation:

With enable Memory Hot Plug
* Limit 6TB <= size of RAM x 16 + size of PMEM

With disable Memory Hot Plug
* Limit 6TB <= size of RAM + size of PMEM

SAP HANA does not support hot-add memory. Because of this, hot-add memory was not validated by SAP and VMware with SAP HANA and is therefore not supported. According SAP HANA on VMware vSphere

Example how to reproduce an error:

6 TB limit
1 TB PMem
5 * 1024 / 16 = 320 GB

With enable Memory Hot Plug I can't start VM with more than 321 GB RAM.

Cisco Custom ISO MISSING_DEPENDENCY_VIBS ERROR during upgrade ESXi 6.7 -> 7.0

I found a problem and workaround to fix Cisco Custom ISO MISSING_DEPENDENCY_VIBS ERROR during upgrade 6.7 -> 7.0.

It was during these type of upgrade from VMware_ESXi_6.7.0_13006603_Custom_Cisco_6.7.2.1.iso to VMware_ESXi_7.0.0_15843807_Custom_Cisco_4.1.1a.iso

Workaround is to remove VIBs with dependency collision:

# esxcli software vib list | grep QLC
qcnic                          1.0.22.0-1OEM.670.0.0.8169922         QLC 
qedentv                        3.9.31.0-1OEM.670.0.0.8169922         QLC 
qedrntv                        3.9.31.1-1OEM.670.0.0.8169922         QLC 
qfle3                          1.0.77.2-1OEM.670.0.0.8169922         QLC 
qfle3f                         1.0.63.0-1OEM.670.0.0.8169922         QLC 
qfle3i                         1.0.20.0-1OEM.670.0.0.8169922         QLC 
scsi-qedil                     1.2.13.0-1OEM.600.0.0.2494585         QLC 

# esxcli software vib remove -f -n scsi-qedil
# esxcli software vib remove -f -n qfle3f
# reboot

Memory Controller May Hang While in Virtual Lockstep – fix in UCSM 4.1(1c)

SAP HANA is very intensive for memory operation. With ADDDC Sparing We can add System reliability. It is optimized by holding memory in reserve so that it can be used in case other DIMMs fail. But there Could be another problem with.

Memory Controller May Hang While in Virtual Lockstep

For more information – Intel® Xeon® Processor Scalable Family Specification Update, # SKX108:

Problem: Under complex microarchitectural conditions, a memory controller that is in VirtualLockstep (VLS) may hang on a partial write transaction.

Workaround: It is possible for BIOS to contain a workaround see below.

Implication: The memory controller hangs with a mesh-to-mem timeout Machine Check Exception(MSCOD=20h, MCACOD=400h). The memory controller hang may lead to other machine check timeouts that can lead to an unexpected system shutdown.

Cisco UCS Manager, Release 4.1(1c) fix it

Cisco applied BIOS workaround for this scenario.

Defect IDSymptom
CSCvr79388Cisco UCS servers stop responding and reboot after ADDDC virtual lockstep is activated. This results in #IERR and M2M timeout in the memory system. This issue is resolved.
CSCvr79396On Cisco UCS M5 servers, the Virtual lock step (VLS) sparing copy finishes early, leading to incorrect values in the lock step region. This issue is resolved.
Resolved Caveats in Release 4.1(1c)

I recommended to update ASAP, firmware 4.1(1c) is stable. Cisco THX!

How to Configure vSphere 6.7 Proactive HA with Cisco UCS Manager Plugin for VMware vSphere?

I wrote in previous blog latest Cisco UCS Manager Plugin is working with vCenter 6.7 U3b.

Install Cisco UCS Manager Plugin

vSphere Web Client – Enable Proactive HA

From vSphere Web Client -> Cluster Properties -> Configure -> vSphere Availability -> Proactive HA is Turned OFF – Click on Edit. You can notice vSphere Proactive HA is disabled by default.

  • Automation Level – Determine whether host quarantine or maintenance mode and VM migrations are recommendations or automatic.
    • Manual – vCenter Server suggests migration recommendations for virtual machines.
    • Automated – Virtual machines are migrated to healthy hosts and degraded hosts are entered into quarantine or maintenance mode depending on the configured Proactive HA automation level.
  • Remediation – Determine what happens to partially degraded hosts.
    • Quarantine mode – for all failures. Balances performance and availability, by avoiding the usage of partially degraded hosts provided that virtual machine performance is unaffected.
    • Mixed mode – Quarantine mode for moderate and Maintenance mode for severe failure (Mixed). Balances performance and availability, by avoiding the usage of moderately degraded hosts provided that virtual machine performance is unaffected. Ensures that virtual machines do not run on severely failed hosts.
    • Maintenance mode – for all failures. Ensures that virtual machines do not run on partially failed hosts.
Best options is Automated + Mixed Mode
Select Cisco UCS Provider – NOT Block Failure Conditions

How is Proactive HA working?

With settings Automatic Level – Automated and Remediation – Mixed Mode after HW Failure. Proactive HA is Entering Host Into Quarantine Mode and Migrate all VMs from ESXi with HW Failure:

After 4:10 mintes Proactive HA migrated all VMs from ESXi host with failure.