martedì 24 dicembre 2024

My experience with the VCP - VMware Cloud Foundation 5.2 Administrator (2V0-11.24) Exam

How to Prepare for the VMware Cloud Foundation Administrator 2024 Exam

The VMware Cloud Foundation Administrator (VCP-VCF) exam is a critical step for IT professionals who want to certify their skills in managing modern cloud infrastructures. This certification not only validates technical knowledge but also opens doors to higher-level roles in the IT field. Here is a practical guide to help you tackle this challenge effectively.

1. Understanding the Exam Structure

Before you begin your preparation, it’s essential to understand the exam content and the skills required. The exam focuses on the management of VMware Cloud Foundation (VCF), combining key components like vSphere, NSX-T, vSAN, and SDDC Manager.
You can expect both theoretical and practical questions, with scenarios designed to test your daily operational skills and troubleshooting abilities. Core topics include the VCF lifecycle, component upgrades, and infrastructure orchestration. One recurring theme across these accounts is the emphasis on understanding the exam blueprint. The 2V0-11.24 exam, based on VCF 5.2, tests your ability to work with VCF's key components: NSX-T, vSAN, vSphere, and SDDC Manager; but as well as on Aria Suite and Tanzu.

2. Building a Solid Study Plan

Strategic preparation is crucial. Here are some key steps to follow:
  • Hands-On Lab Practice: Nothing beats real-world experience. Setting up a home lab or accessing a test environment is the best way to understand key operations such as workload domain creation, resource allocation, and common troubleshooting.
  • Official Study Materials: Dive into VMware documentation and take official training courses. These resources provide the foundational theory you need for the exam.
    VMware Cloud Foundation: Deploy, Manage, Configure course is highly recommended.
  • Mock Exams and Practice Tests: Test yourself with quizzes and simulated exams to gauge your preparedness. This helps you become familiar with the exam format and improve time management.

3. Focus on Technical Aspects

Some topics deserve special attention during your preparation:
  • VCF Architecture: Understand how the key components (vSphere, vSAN, NSX-T) integrate within the framework.
  • Lifecycle Management: Be capable of performing updates, patches, and troubleshooting using SDDC Manager.
  • Security and Networking: Configure NSX-T for secure traffic across domains.
  • Troubleshooting Skills: Tackle scenarios that require rapid diagnosis and corrective action.

4. Final Tips

Taking the VMware Cloud Foundation Administrator exam is a challenging yet rewarding journey. Here are some last-minute tips for exam day:
  • Time Management: Don’t spend too much time on a single question. Answer the easier ones first and return to more complex questions later.
  • Stay Calm: A composed approach is crucial for clear thinking and handling unexpected issues.
  • Believe in Your Abilities: Deep preparation builds confidence, which is key to a successful exam experience.

A Step Toward the Future

The VCP-VCF certification is not just a technical achievement but an opportunity to distinguish yourself in the job market and contribute to modernizing cloud infrastructures. Prepare diligently and face this challenge with determination: success is within reach.

Good luck!

mercoledì 4 dicembre 2024

[NSX] Edge VM Present In NSX Inventory Not Present In vCenter

Issue


Today while I was deleting edge bridges from NSX Manager I got this error message:

Edge VM Present In NSX Inventory Not Present In vCenter
Description The VM edge-bridge-cluster1-B with moref id vm-1370430 corresponding to the Edge Transport node a060574d-4e93-4b7e-83b4-7eb8464a645d vSphere placement parameters is found in NSX inventory but is not present in vCenter. Please check if the VM has been removed in vCenter or is present with a different VM moref id.

Recommended Action The managed object reference moref id of a VM has the form vm-number, which is visible in the URL on selecting the Edge VM in vCenter UI. Example vm-12011 in https:///ui/app/vm;nav=h/urn:vmomi:VirtualMachine:vm-12011:164ff798-c4f1-495b-a0be-adfba337e5d2/summary Please find the VM edge-bridge-cluster1-B with moref id vm-1370430 in vCenter for this Edge Transport Node a060574d-4e93-4b7e-83b4-7eb8464a645d. If the Edge VM is present in vCenter with a different moref id, please follow the below action. Use NSX add or update placement API with JSON request payload properties vm_id and vm_deployment_config to update the new vm moref id and vSphere deployment parameters. POST https:///api/v1/transport-nodes/?action=addOrUpdatePlacementReferences. If the Edge VM with name edge-bridge-cluster1-B is not present in vCenter, use the NSX Redeploy API to deploy a new VM for the Edge node. POST https:///api/v1/transport-nodes/?action=redeploy.

Solution


Googling around I found various solutions, but no one was fitting exactly my situation.

For example I found these:
VMware NSX Edge VMs not present in both NSX and vCenter
Edge VM present in NSX inventory not present in vCenter alarm

As shown into the first link I tried without success the command described in Scenario 3:
So, I also tried to check if the Transport Nodes was still present into NSX, whit commands below:
No trace of them. It seems like, the deletion process has not finished. I waited 30 minutes, but the problem was still there.

I solved restarting one by one the NSX Manager appliances.
I started rebotting the first appliance, waited for the cluster to return to a "stable" state, and continued with the next appliance, until I had restarted them all. At the last reboot the error message changed from "Open" to "Resolved" and was no longer present

That's it.

lunedì 2 dicembre 2024

[NSX] - API Authentication Using a Session Cookie on PowerShell

Issue


Recently I had to create a PowerShell script that grab some information from NSX via Rest API calls. To do so, I had to create a few lines of code to authenticate on the NSX.
To reduce the number of times that I have to enter username and password and/or they transit over the network, I used NSX session-based authentication method to generate a JSESSIONID cookie when using the API as described here.
The method describe how to create a new session cookie and how to use thex-xsrf-token for subsequent requests for cURL on linux environment. Below here I wrote few lines of code to use the same method in powershell environment.
Let's see below how does it works for powershell ....

Solution


The script must run on an Windows machine, so I decided to make a powershell script. Information regarding api call, can be found at the following link https://developer.vmware.com/apis
I thought was useful share with everyone how to do it, let's see the script:
#
# Create a Session Token 
#
# LM v. 0.2
#
# This script is an example on how to create a session token on NSX and reuse for subsequent requests.
# 


#Script accept in input the FQDN of the NSX Manager to connect on, or leave it blank to use the default "nsx-mgr.vcf.sddc.lab"
param(
    [string] $nsx_manager = 'nsx-mgr.vcf.sddc.lab'
)

#Used to manage/skip certificates
add-type @"
using System.Net;
using System.Security.Cryptography.X509Certificates;
public class TrustAllCertsPolicy : ICertificatePolicy {
    public bool CheckValidationResult(
        ServicePoint srvPoint, X509Certificate certificate,
        WebRequest request, int certificateProblem) {
        return true;
    }
}
"@
$AllProtocols = [System.Net.SecurityProtocolType]'Ssl3,Tls,Tls11,Tls12'
[System.Net.ServicePointManager]::SecurityProtocol = $AllProtocols
[System.Net.ServicePointManager]::CertificatePolicy = New-Object TrustAllCertsPolicy


function createSession {
    $script:session = New-Object Microsoft.PowerShell.Commands.WebRequestSession
    $script:headers = @{}
    $script:nsx_uri = "https://" + $nsx_manager
    $uri = $nsx_uri + "/api/session/create"
    $private:body = "j_username=$($nsx_user)&j_password=$($nsx_pass)" 
    try {
        $response = invoke-webrequest -contentType "application/x-www-form-urlencoded" -WebSession $session -uri $uri -Method 'POST' -Body $body -usebasicparsing -Erroraction Stop
        $xsrftoken = $response.headers["X-XSRF-TOKEN"]
 
        #$response
        $script:loginSuccess = $true
        $script:headers.Add("X-XSRF-TOKEN", $xsrftoken)
        $script:headers.Add("Accept", "application/json")
        $script:headers.Add('Content-Type','"application/x-www-form-urlencoded')
    }
    catch {
        Write-Host "Failed" -ForegroundColor Red
        Write-Host "$($_.Exception)" -ForegroundColor Red
        write-host "Error Details:" $_.ErrorDetails.Message -ForegroundColor Magenta
        $script:loginSuccess = $false
    }
}

#If you want insert Credential on fly uncomment the three lines below here and comment the hardcoded credentials 
#$MyCredential = Get-Credential -Message "Insert $nsx_manager "
#$nsx_user = $MyCredential.UserName
#$nsx_pass = [Runtime.InteropServices.Marshal]::PtrToStringAuto([Runtime.InteropServices.Marshal]::SecureStringToBSTR($MyCredential.Password))

#Harcoded credentials; uncomment if you don't want to insert them with Get-Credential function or comment otherwise
$nsx_user = 'admin'
$nsx_pass = 'VMware123!VMware123!'

#Create the cookie session 
createSession


#how looks like subsequent example requests
#List of segments
$response_q1 = Invoke-webrequest -WebSession $session -uri $($nsx_uri + "/policy/api/v1/infra/segments") -Method 'GET' -Headers $headers -usebasicparsing -Erroraction Stop

#List of tier-1s
$response_q2 = invoke-webrequest -WebSession $session -uri $($nsx_uri + "/policy/api/v1/infra/tier-1s") -Method 'GET' -Headers $headers -usebasicparsing -Erroraction Stop

write-host " ----- Segments ----- " -ForegroundColor Green
write-host $response_q1.Content
write-host 
write-host 
write-host " ----- Tier-1s ----- " -ForegroundColor Green
#write-host $response_q2.Content

# END #

That's it.

venerdì 29 novembre 2024

[NSX Edge] NIC ring out of buffer

Issue


Recently I faced out on this issue:
Description Edge NIC fp-eth1 transmit ring buffer has overflowed by 100.000000% on Edge node cd5c7792-1e62-48bb-b6da-40e8eea154b7. The missed packet count is 420560 and processed packet count is 0.

Recommended Action 1. If a lot of VMs are accommodated along with edge by the hypervisor then edge VM might not get time to run, hence the packets might not be retrieved by hypervisor. Then probably migrating the edge VM to a host with fewer VMs. 2. Increase the ring size by 1024 using the command `set dataplane ring-size tx `. If even after increasing the ring size, the issue persists then contact VMware Support as the ESX side transmit ring buffer might be of lower value. If there is no issue on ESX side, it indicates the edge needs to be scaled to a larger form factor deployment to accommodate the traffic. 3. If the alarm keeps on flapping, i.e., triggers and resolves very soon, then it is due to bursty traffic. In this case check if tx pps using the command `get dataplane cpu stats`. If it is not high during the alarm active period then contact VMware Support. If pps is high it confirms bursty traffic. Consider suppressing the alarm. NOTE - There is no specific benchmark to decide what is regarded as a high pps value. It depends on infrastructure and type of traffic. The comparison can be made by noting down when alarm is inactive and when it is active.

Solution


Checking for recommended actions ...
edge-bridge-cluster3-A> get dataplane | find ring
Wed Nov 20 2024 UTC 12:16:02.831
Bfd_ring_size      : 512   
Lacp_ring_size     : 512   
Learning_ring_size : 512   
Livetrace_ring_size: 512   
Rx_ring_size       : 4096  
Slowpath_ring_size : 512   
Tx_ring_size       : 4096
... rx and tx ring_size was already at 4096. Looking for flow-cache ...
edge-bridge-cluster3-A> get dataplane flow-cache config
Wed Nov 20 2024 UTC 12:16:50.970
Enabled            : true
Mega_hard_timeout_ms: 4955
Mega_size          : 262144
Mega_soft_timeout_ms: 4904
Micro_size         : 262144
... we saw that the value can be incremented up to 524288. We incremented it and restarted the dataplane service
edge-bridge-cluster3-A> set dataplane flow-cache-size 524288
edge-bridge-cluster3-A> restart service dataplane

edge-bridge-cluster3-A> get dataplane flow-cache config
Wed Nov 20 2024 UTC 12:25:38.810
Enabled            : true
Mega_hard_timeout_ms: 4955
Mega_size          : 524288
Mega_soft_timeout_ms: 4904
Micro_size         : 524288
What does flow-cache do?
Flow Cache helps reduce CPU cycles spent on known traffic flows. NSX Edge node uses flow cache to achieve high packet throughput. This feature records actions applied on each flow when the first packet in the flow is processed so that subsequent packets can be processed using a match-and-action procedure.

When the key collisions rates are high, increasing the flow cache size help process the packets most efficiently. However, increasing the cache size might impact memory consumption. Typically, the higher the hit rates, the better the performance.

After this change I also proceeded to free up the host where the Edge was running by migrating the VMs elsewhere.
This was enough to solve my problem. The VMs on the bridged segments became newly available.

Looking around; we saw a nice analysis done by Giuliano on a similar issue at this link.

Further helpful information on the same topics can be found at the following links:

https://docs.vmware.com/en/VMware-Telco-Cloud-Platform/3.0/telco-cloud-platform-5g-edition-data-plane-performance-tuning-guide/GUID-64EEE4A0-23C1-49DB-AE4D-F235F8AB8EAB.html

https://knowledge.broadcom.com/external/article/330475/edge-nic-out-of-receive-buffer-alarm.html

https://knowledge.broadcom.com/external/article?legacyId=80233

That's it.

mercoledì 20 novembre 2024

Fail to deploy OVF due the ThrowableProxy.cause response code 407

Issue


New OVA appliance deployment fails with the following error message:

Failed to deploy OVF package. ThrowableProxy.cause A general system error occurred: Transfer failed: Invalid response code: 407, note that HTTP/s proxy is configured for the transfer.


Solution


Check if the vCenter side proxy is set.
If so, connect to the VAMI (https://<vCenter_IP>:5480), log in and click EDIT under Networking in the Proxy Settings area...
... Disable it (like in the pic below) and SAVE.
Then try again to deploy the OVA. Now it should work.
Once the deployment is complete, remember to re-enable the proxy.

That's it.

giovedì 31 ottobre 2024

[NSX] Dump of the VMs list and its Security TAGs

Issue


Recently I had to create a script that periodically take a snapshot of the current situation regarding the list of Virtual Machines and the associated NSX Security Tags, via rest API calls.
To do this, I had to create some lines of code. Let's see below, the call to get the list of VMs and how it works....

Solution


The script must run on an Ubuntu machine, so I decided to make a bash script and use cURL. Information regarding api call, can be found at the following link https://developer.vmware.com/apis
More specific information regarding NSX, can be found here.

Our call to retrieve informations, looks like this:
curl -sk -u username:password -X GET 'https://{nsx_manager}/policy/api/v1/infra/realized-state/virtual-machines' > vms-all.json
... and from terminal using the "jq" command (with the options "-r '.results[]' ") ...
jq -r ' .results[]' vms-all.json
... the output is as below ...


Now, that we have the whole VMs list on a file, we are able to query it with 'jq' and retrieve information we need such as:
  • we can search for a whole json configuration of a specific VM (for example "db-01a") ...
    jq -r ' .results[] | select(.display_name=="db-01a") ' vms-all.json

  • we can search for the tags of a specific VM (for example "db-01a")
    jq -r ' .results[] | select(.display_name=="db-01a") | .tags' vms-all.json

  • we can count how many VMs are present into the list
    jq -c ' .results[] | [.display_name, .tags]" vms-all.json | wc -l

  • we can list all VMs with their tags
    jq -c ' .results[] | [.display_name, .tags]' vms-all.json

  • .... and so on.

That's it.

mercoledì 4 settembre 2024

[vSAN - SRM] - Reduced availability without rebuild

Issue


One month ago, I came across this. After a disaster recovery test performed via SRM, we encountered the following error "Reduced availability without rebuild" for 27 objects, we tried to click "repair object immediately" without success.
There are no resync objects in progress.
The source and target infrastructure consists of two VCF 5.X environment based on vSAN file system, where Site Recovery Manager is used to replicate VMs.
This issue reduce the Health score rate to 60%.
Below is the GSS analysis

Issue Clarification:
Customer has 27 objects all using the same policy showing a reduced availably with no rebuild

Issue Verification:
We verified that the cluster has 27 objects that are in reduced availability with no rebuild.

Cause Identification:
We found that the customer is using a ftt2 policy with force provisioning for these objects.

Cause Justification:
As we see in the chart on https://docs.vmware.com/en/VMware-Cloud-on-AWS/services/com.vmware.vmc-aws-operations/GUID-EDBB551B-51B0-421B-9C44-6ECB66ED660B.html
In order to satisfy a ftt2 policy we will need 5 hosts.
Customer is using force provisioning but this will only provision the object if the primary number of is not met it will not allow a policy to be compliant till the policy is satisfied.

Force provisioning:
If the option is set to Yes, the object is provisioned even if the Primary level of failures to tolerate, Number of disk stripes per object, and Flash read cache reservation policies specified in the storage policy cannot be satisfied by the datastore. Use this parameter in bootstrapping scenarios and during an outage when standard provisioning is no longer possible.

The default No is acceptable for most production environments. vSAN fails to provision a virtual machine when the policy requirements are not met, but it successfully creates the user-defined storage policy.

Solution Recommendation:
Change the policy to match the current host configuration or add a host to match the policy

Solution Justification:
Once the policy is set to match the cluster the objects will be compliant and will no longer be in the reduced availability with no rebuild status.
Unfortunately, the recommendations, in this case, did not solve my problem.
I also changed, created a new policy, re-applied the storage policy on those VMs/Objects without success.

Solution


We have been able to solved the issue performing the following steps:

Short Answer
  • Create a Protection Group and a new Recovery Plan on SRM;
  • Check that the VMs(/objects with the issue) were correctly replicated (in sync) with the target;
  • Migrate the VMs(/objects with the issue) to the new Protection Group;
  • Check the configuration of the VMs with the Edit Setting (no disks connected);
  • Active the Recovery Plan test and the VMs should turned on correctly;
  • Check again via Edit Settings whether the disks are present, and indeed they are correctly attached.
  • Verify in "vSAN Object Health" there are no more objects in "reduced availability with no rebuild".
  • Perform the test clean up and migrate the VMs back to the original Protection Group.
  • Re-check the configuration of the VMs with the Edit Setting (no disks connected, maybe they are managed by SRM and connected when required).
  • Perform a double check and make sure everything is working fine.


Long Answer (with screenshots and details)
  • Create a Protection Group and a new Recovery Plan on SRM:
     - Connect to the Site Recovery Manager where the VM with the error are present
     - Create a new PG; in my case I named it "BA-Test-vSAN-Issue"
     - Creare a new RP; in my case I named it "TEST-BA-MGMT_RecoveryPlan"
     - In "Virtual Objects" we can see that the VM is in the "Reduced availability without rebuild" state

  • Check that the VMs(/objects with the issue) were correctly replicated (in sync) with the target:
     - Check on the current PG that the virtual machine is synchronized

  • Migrate the VMs(/objects with the issue) to the new Protection Group (BA-Test-vSAN-Issue):
     - Edit the original PG
     - Unflag the VM (to remove it from the PG)
     - Edit the new PG and Add the VM

  • Check the configuration of the VMs with the Edit Setting (no disks connected):
     - Edit Settings on the VM and check it

  • Active the Recovery Plan test and the VMs should turned on correctly:
     - The virtual machine is synchronized
     - Go to the new Recovery Plan (in my case "TEST-BA-MGMT_RecoveryPlan") and activate it

  • Check again via Edit Settings whether the disks are present, and indeed they are correctly attached:
     - When the RP is in progress and the VM is turning on, check the presence of the disk on the VM via Edit Settings
     - Wait untill the test is completed
     - Check that the machine is up and running

  • Verify in "vSAN Object Health" there are no more objects in "reduced availability with no rebuild":
     - Verify that the VM is no longer present in the object list with "reduced availability with no rebuild"

  • Perform the test clean up and migrate the VMs back to the original Protection Group:
     - As soon as the Cleanup procedure is completed ...
     - ... and the VM is in Ready state, move it back to the original Protection Group

  • Re-check the configuration of the VMs with the Edit Setting (no disks connected, maybe they are managed by SRM and connected when required):

  • Perform a double check and make sure everything is working fine:
     - Once you have performed the above steps for all the VMs with the problem, you should see the "Cluster Health score" at 100% as shown in the image below



Reactivating the entire Recovery Plan would probably have solved the "Reduced availability without rebuild" issue.
However, this approach is more granular and aim to solve the problem of the single VM, without negatively impacting the performance of the entire target environment. It is not mandatory to proceed one VM at a time.
Obviously, it is possible to migrate multiple VMs simultaneously into the temporary Protection Group, power them on simultaneously via recovery plan and then bring them back into the original protection group once the problem has been resolved.

That's it.