lunedì 14 agosto 2023

[NAPP] Deployment get stuck on "Create Guest Custer "

Issue


Deployment of NAPP get stucked on "Create Guest Cluster - Waiting for Guest cluster napp-cluster-01 to be available for login ..."
Looking at the vCenter, we can see that the SupervisorControlPlaneVM(s) has been created correctly, as well as the namespace and napp-cluster-01-control-plane VM.
What we don't see here are the workers VM.

Solution


To investigate and troubleshoot the the issue we connect via ssh on the SupervisorControlPlaneVM. I will explain in another post how to get the credentials to access the SV CP.

Describing the NAPP TKC ...
# kubectl describe tkc napp-cluster-01 -n nsx-01
we found 2 errors:

Message:          2 errors occurred:
                         * failed to configure DNS for /, Kind= nsx-01/napp-cluster-01: unable to reconcile kubeadm ConfigMap's CoreDNS info: unable to retrieve kubeadm Configmap from the guest cluster: configmaps "kubeadm-config" not found
                         * failed to configure kube-proxy for /, Kind= nsx-01/napp-cluster-01: unable to retrieve kube-proxy daemonset from the guest cluster: daemonsets.apps "kube-proxy" not found


Looking the deployment state of the workers node..
# kubectl get wcpmachine,machine,kcp,vm -n nsx-01

.. we saw that the workers node were still in Pending state. We describe the worker node ...
# kubectl describe wepmachine.infrastructure.cluster.vmware.com/napp-cluster-01-workers-qlpm6-7h2qr -n nsx-01
We also debugged Kubernetes with crictl command looking inside the logs
... and so on.

Tried to Ping from Supervisor cluster to the TKC VIP:
# kubectl get svc -A | grep -i napp-cluster-01
nsx-01                                      napp-cluster-01-control-plane-service                           LoadBalancer   10.96.1.25    192.168.100.25   6443:32296
At the end, we discovered that we were unable to :
  • ping from SupervisorControlPlane to Tanzu Kubernetes Cluster VIP
  • ping from TKC CP to Supervisor CP
Allowed connection on the firewall from SV CP to TKC VIP & from TKC CP to Supervisor CP, we never saw the error any more, but the state was still in pending.

So, we removed the namespace and re-deployed, now control-plane and workers node are UP and running ad we can contiune with NAPP installation.

That's it.

mercoledì 9 agosto 2023

[NSX-T] Stale logical-port(s) still connected in NSX-T 3.x

Issue


I was cleaning up a customer's NSX-T configuration to bring some changes, when I saw a lot of logical-ports still connected, more than hundred even if VM was no more present on vCenter.

Solution


I immediately thought of creating a script with rest APIs calls to remove the logical ports from NSX-T Manager. It is possible to find all the NSX-T API calls here.

For rest APIs calls within the bash script I will be using cURL with the suggestions provided here.

First, let's see the rest APIs to use:
  1. to retrieve the IDs of the Logical Ports
  2. to delete the connection

To get the list of Logical-Ports:
GET /api/v1/logical-ports

Below how it looks the command line ...
# lorenzo@ubuntu:~$ curl -ksn -X GET https://{NSX-T MANAGER IP}/api/v1/logical-ports 
... combining to the previous line the jq command and sed, we can extract only the ID field of our interst.
# lorenzo@ubuntu:~$ curl -ksn -X GET https://{NSX-T MANAGER IP}/api/v1/logical-ports |  jq '.results[] | .id' | sed 's/"//g' 
Outcome in the image below.


To get the deletion of the Logical-Port:
DELETE /api/v1/logical-ports/<LogicalPort-ID>?detach=true

We now have all the elements to build the bash script, which looks like the one below...

WARNING: It provided witout warranty. Use it at your own risk and only if you are aware of what you are doing

#!/bin/bash
curl -ksn -X GET https://{NSX-T MANAGER IP}/api/v1/logical-ports |  jq '.results[] | .id' | sed 's/"//g' | while read -r LP_ID
do
 curl -ksn -X DELETE https://{NSX-T MANAGER IP}/api/v1/logical-ports/${LP_ID}?detach=true
 echo " -> "${LP_ID}" removed "
done
... launch the script as below ...
lorenzo@ubuntu:~$ bash remove_all_logical_port.sh 
... the result is the following. All Logial-Ports have been cancelled.


That's it.