lunedì 14 agosto 2023

[NAPP] Deployment get stuck on "Create Guest Custer "

Issue


Deployment of NAPP get stucked on "Create Guest Cluster - Waiting for Guest cluster napp-cluster-01 to be available for login ..."
Looking at the vCenter, we can see that the SupervisorControlPlaneVM(s) has been created correctly, as well as the namespace and napp-cluster-01-control-plane VM.
What we don't see here are the workers VM.

Solution


To investigate and troubleshoot the the issue we connect via ssh on the SupervisorControlPlaneVM. I will explain in another post how to get the credentials to access the SV CP.

Describing the NAPP TKC ...
# kubectl describe tkc napp-cluster-01 -n nsx-01
we found 2 errors:

Message:          2 errors occurred:
                         * failed to configure DNS for /, Kind= nsx-01/napp-cluster-01: unable to reconcile kubeadm ConfigMap's CoreDNS info: unable to retrieve kubeadm Configmap from the guest cluster: configmaps "kubeadm-config" not found
                         * failed to configure kube-proxy for /, Kind= nsx-01/napp-cluster-01: unable to retrieve kube-proxy daemonset from the guest cluster: daemonsets.apps "kube-proxy" not found


Looking the deployment state of the workers node..
# kubectl get wcpmachine,machine,kcp,vm -n nsx-01

.. we saw that the workers node were still in Pending state. We describe the worker node ...
# kubectl describe wepmachine.infrastructure.cluster.vmware.com/napp-cluster-01-workers-qlpm6-7h2qr -n nsx-01
We also debugged Kubernetes with crictl command looking inside the logs
... and so on.

Tried to Ping from Supervisor cluster to the TKC VIP:
# kubectl get svc -A | grep -i napp-cluster-01
nsx-01                                      napp-cluster-01-control-plane-service                           LoadBalancer   10.96.1.25    192.168.100.25   6443:32296
At the end, we discovered that we were unable to :
  • ping from SupervisorControlPlane to Tanzu Kubernetes Cluster VIP
  • ping from TKC CP to Supervisor CP
Allowed connection on the firewall from SV CP to TKC VIP & from TKC CP to Supervisor CP, we never saw the error any more, but the state was still in pending.

So, we removed the namespace and re-deployed, now control-plane and workers node are UP and running ad we can contiune with NAPP installation.

That's it.

Nessun commento:

Posta un commento