Lorenzo Moglie - Notes: novembre 2024

Issue

Recently I faced out on this issue:

Description Edge NIC fp-eth1 transmit ring buffer has overflowed by 100.000000% on Edge node cd5c7792-1e62-48bb-b6da-40e8eea154b7. The missed packet count is 420560 and processed packet count is 0.

Recommended Action 1. If a lot of VMs are accommodated along with edge by the hypervisor then edge VM might not get time to run, hence the packets might not be retrieved by hypervisor. Then probably migrating the edge VM to a host with fewer VMs. 2. Increase the ring size by 1024 using the command `set dataplane ring-size tx `. If even after increasing the ring size, the issue persists then contact VMware Support as the ESX side transmit ring buffer might be of lower value. If there is no issue on ESX side, it indicates the edge needs to be scaled to a larger form factor deployment to accommodate the traffic. 3. If the alarm keeps on flapping, i.e., triggers and resolves very soon, then it is due to bursty traffic. In this case check if tx pps using the command `get dataplane cpu stats`. If it is not high during the alarm active period then contact VMware Support. If pps is high it confirms bursty traffic. Consider suppressing the alarm. NOTE - There is no specific benchmark to decide what is regarded as a high pps value. It depends on infrastructure and type of traffic. The comparison can be made by noting down when alarm is inactive and when it is active.

Solution

Checking for recommended actions ...

edge-bridge-cluster3-A> get dataplane | find ring
Wed Nov 20 2024 UTC 12:16:02.831
Bfd_ring_size      : 512   
Lacp_ring_size     : 512   
Learning_ring_size : 512   
Livetrace_ring_size: 512   
Rx_ring_size       : 4096  
Slowpath_ring_size : 512   
Tx_ring_size       : 4096

... rx and tx ring_size was already at 4096. Looking for flow-cache ...

edge-bridge-cluster3-A> get dataplane flow-cache config
Wed Nov 20 2024 UTC 12:16:50.970
Enabled            : true
Mega_hard_timeout_ms: 4955
Mega_size          : 262144
Mega_soft_timeout_ms: 4904
Micro_size         : 262144

... we saw that the value can be incremented up to 524288. We incremented it and restarted the dataplane service

edge-bridge-cluster3-A> set dataplane flow-cache-size 524288
edge-bridge-cluster3-A> restart service dataplane

edge-bridge-cluster3-A> get dataplane flow-cache config
Wed Nov 20 2024 UTC 12:25:38.810
Enabled            : true
Mega_hard_timeout_ms: 4955
Mega_size          : 524288
Mega_soft_timeout_ms: 4904
Micro_size         : 524288

What does flow-cache do?
Flow Cache helps reduce CPU cycles spent on known traffic flows. NSX Edge node uses flow cache to achieve high packet throughput. This feature records actions applied on each flow when the first packet in the flow is processed so that subsequent packets can be processed using a match-and-action procedure.

When the key collisions rates are high, increasing the flow cache size help process the packets most efficiently. However, increasing the cache size might impact memory consumption. Typically, the higher the hit rates, the better the performance.

After this change I also proceeded to free up the host where the Edge was running by migrating the VMs elsewhere.
This was enough to solve my problem. The VMs on the bridged segments became newly available.

Looking around; we saw a nice analysis done by Giuliano on a similar issue at this link.

Further helpful information on the same topics can be found at the following links:

https://docs.vmware.com/en/VMware-Telco-Cloud-Platform/3.0/telco-cloud-platform-5g-edition-data-plane-performance-tuning-guide/GUID-64EEE4A0-23C1-49DB-AE4D-F235F8AB8EAB.html

https://knowledge.broadcom.com/external/article/330475/edge-nic-out-of-receive-buffer-alarm.html

https://knowledge.broadcom.com/external/article?legacyId=80233

That's it.