Pre-upgrade considerations in Multi-vCenter environments

With vSphere 7.0 being released April 2nd, 2020 and vSphere 6.0 reaching its end of general support on March 12th, 2020, this is one of the moments in which many environments are in the process of upgrading their vSphere version, either from 6.0 to 6.5/6.7 (to continue having support) as well as to 7.0 to take advantage of all the new features, such as Kubernetes native integration.

However, we have been getting an increased number of Support Requests with issues after upgrades in Multi-vCenter environments using Enhanced Linked Mode (from now on, ELM), especially if the environment is using more than one Platform Services Controller (from now on, PSC) either embedded, or external.

The goal of this article is to help you understand your roadblocks to upgrade PRIOR to actually doing the upgrade, so you don’t incur in any downtime and can proactively fix everything that’s needed before upgrading.

For the purposes of this article, I will try to demonstrate everything with a Demo Environment, so everything is more clear.

Demo Environment

Super simple environment!


Two vCenter Server Appliances with Embedded PSC, in a single SSO domain.
I’m going to demonstrate the issues that we could get in if we upgrade an environment that is not in a healthy state of PSC replication.

What’s PSC Replication?

As you know, data replicates between the PSC instances (embedded in this scenario) when Enhanced Linked Mode is configured.

What data is replicated?

  • Users and roles
  • Trusted Roots store certificates
  • Lookup Service service registrations
  • Computer accounts
  • Domain controller accounts

And many, many more things. VMDIR (VMware Directory Service) is a Multi-master LDAP database.

I did mention Lookup Service service registrations… what are those?

Lookup Service

The Lookup Service is a component that registers the location of vSphere components so they can securely find and communicate with each other. This includes every internal service as well as some 2nd Party Tools (such as NSX, vSphere Replication, SRM) and 3rd Party Tools (Storage plugins, for example)

This is the output of the amount of Service Registrations per Service Type, for our Demo environment

  2         Service Type: applmgmt
  2         Service Type: certificatemanagement
  2         Service Type: cis.cls
  2         Service Type: cis.vmonapi
  2         Service Type: client
  2         Service Type: com.vmware.vsan.dp
  2         Service Type: com.vmware.vsphere.client
  2         Service Type: cs.authorization
  2         Service Type: cs.componentmanager
  2         Service Type: cs.ds
  2         Service Type: cs.eam
  2         Service Type: cs.identity
  2         Service Type: cs.inventory
  2         Service Type: cs.keyvalue
  2         Service Type: cs.license
  2         Service Type: cs.perfcharts
  2         Service Type: cs.vapi
  2         Service Type: cs.vsm
  2         Service Type: imagebuilder
  2         Service Type: messagebus.config
  2         Service Type: mixed
  2         Service Type: phservice
  2         Service Type: rbd
  2         Service Type: sca
  2         Service Type: sms
  2         Service Type: sso:admin
  2         Service Type: sso:groupcheck
  2         Service Type: sso:sts
  2         Service Type: topologysvc
  2         Service Type: vcenterserver
  2         Service Type: vcha
  2         Service Type: vcIntegrity
  2         Service Type: vsan-dps
  2         Service Type: vsan-health
  2         Service Type: vsphereclient
  2         Service Type: vsphereui

You can see services such as vsphereclient (vSphere Flash Client), vsphereui (vSphere HTML5 Client) and vcenterserver (vCenter Server), among others.

You can also see that there is two of every registration. Every PSC has its own Lookup Service, but they replicate the data through VMDIR, so every registration exists on every PSC.

Let’s take a look at the vCenter Server registrations:

I’m running the following command on one of the vCenter Servers (with Embedded PSC)

/usr/lib/vmidentity/tools/scripts/lstool.py list --url http://localhost:7080/lookupservice/sdk | grep -i "Service type: vCenterServer" -A9 | egrep "Service Type:|Version|URL"

For the purposes of this article, I’m only interested in the Service Type, Version and URL. However, a service registration contains much more data than that, such as the Service Registration ID, Node ID, and all the URL for the different endpoints with its own SSL certificate, but we’re not going to dive into that.

Service Type: vcenterserver
Version: 6.7
URL: https://vcsa1.gsslabs.org:443/sdk
Service Type: vcenterserver
Version: 6.7
URL: https://vcsa2.gsslabs.org:443/sdk

We can see that every registration has the URL and the version. This is really important! Keep this in the back of your minds because we’re going to go back to this!

PSC Replication Status

As we mentioned previously, the VMDIR database replicates between the PSCs

You can check the replication status of any PSC instance with the following command

/usr/lib/vmware-vmdir/bin/vdcrepadmin -f showpartnerstatus -h localhost -u administrator -w SSO_Password

This is the output on our Demo environment


Partner: vcsa2.gsslabs.org
Host available: Yes
Status available: Yes
My last change number: 10360
Partner has seen my change number: 10360
Partner is 0 changes behind.

This means that outgoing replication for this node is working, however, it does not mean that replication is working correctly in both directions. For this, you would need to run the same command on its replication partner.

Partner: vcsa1.gsslabs.org
Host available: Yes
Status available: Yes
My last change number: 10351
Partner has seen my change number: 10351
Partner is 0 changes behind.

This is good, our environment is healthy replication wise!
But what if it wasnt?

  • Host available: no, would mean that the replication partner is not reachable through the network
  • Status available: no, would mean that the replication partner is reachable through the network, but VMDIR state is either on read-only or null (this is bad!)
  • Having a big number of changes behind and not updating could mean that this local node is in read-only or null state (this is also really bad!)

So how do we check our VMDIR state if the “showpartnerstatus” command shows any of these errors?
Running the following command

echo 6 | /usr/lib/vmware-vmdir/bin/vdcadmintool
You will get an output similar to:
VmDir State is - Normal
This state could also be Null, Read-Only and Standalone – For the purposes of this document, all three are bad!

But how does a PSC get into this state?

After restoring a PSC (Either embedded or external) either from a snapshot, image-level backup, file-level backup, or VM-based backup, the Update Sequence Number (USN) value is a lower number that its replication partners. This results in the replication partners being out of synchronization with the restored node.

This is why you should always, when snapshotting a Multi-vCenter environment, you should always do it with all nodes powered off, and if you restore one of the nodes to a snapshot, you have to restore all of the involved nodes. This also applies to backups!

What can broken replication affect?

Replication issues are usually called a “Silent Killer” because you don’t notice it is working until you want to do a change in the SSO environment. These changes can be adding a new 2nd or 3rd party tool, creating local users / roles in the SSO domain, installing a new vCenter or PSC, Converging from External PSC to Embedded PSC, and the one we’re discussing in this document, upgrading!

So let’s go back to our demo environment…

This image has an empty alt attribute; its file name is image.png

And we’re now going to upgrade vcsa1.gsslabs.org – The upgrade succeeds…
Remember what we discussed about the versions?
This is what vcsa1.gsslabs.org (the upgraded one) now sees in Lookup Service

Service Type: vcenterserver
Version: 7.0
URL: https://vcsa1.gsslabs.org:443/sdk
Service Type: vcenterserver
Version: 6.7
URL: https://vcsa2.gsslabs.org:443/sdk

This is only a simple change to demonstrate the issue. This would happen for every other internal service, and in the case of vSphere 6.7 to 7.0, it will create, rename and re-register a bunch of other services, since the whole VMDIR structure changed.

This is fine! When we log in to vcsa1.gsslabs.org, we see both vCenter Servers…
But what happens if we log in to vcsa2.gsslabs.org ? We see that vcsa1.gsslabs.org is not showing up!

So we go to check the Lookup Service entries, and we find the following…

Service Type: vcenterserver
Version: 6.7
URL: https://vcsa1.gsslabs.org:443/sdk
Service Type: vcenterserver
Version: 6.7
URL: https://vcsa2.gsslabs.org:443/sdk


Since replication was not working, vcsa2.gsslabs.org never got the changes that vcsa1.gsslabs.org made during the upgrade… so when vcsa2.gsslabs.org‘s Web Client tries to contact the vCenter instance in vcsa1.gsslabs.org, there is a version mismatch, and therefore it does not load it.

If you now upgrade vcsa2.gsslabs.org, the same thing is going to happen, and both are going to show something like this…

VCSA1
Service Type: vcenterserver
Version: 7.0
U
RL: https://vcsa1.gsslabs.org:443/sdk
Service Type: vcenterserver
Version: 6.7
URL: https://vcsa2.gsslabs.org:443/sdk

VCSA2
Service Type: vcenterserver
Version: 6.7
URL: https://vcsa1.gsslabs.org:443/sdk
Service Type: vcenterserver
Version: 7.0
URL: https://vcsa2.gsslabs.org:443/sdk


Effective immediately, ELM is officialy broken – These vCenters won’t see eachother in the Web Client, let alone replicate VMDIR changes.

And this state is not easily fixable, this would likely involve cleaning up both sides VMDIR and then executing a cross domain repoint between eachother. Now imagine if instead of this simple environment, you have a 6 vCenter Environment, and you run into these issues, can you imagine the trouble you will get into?

OK, so now what do we do?

Now that the impacts of broken PSC replication in upgrades (it will also affect convergence, and many other SSO operations), this is something you can do to avoid being sucker punched by the upgrade process.

  • Check if replication between all your PSC instances is working correctly and showing 0 changes behind across the board. This is done using the vdcrepadmin command that was shown before
  • If you run into any issue such as the ones already mentioned, check the VMDIR status using the vdcadmintool command that was shown before
  • If you get any of the errors detailed in this article, please open a Support Request with VMware -> https://kb.vmware.com/s/article/2006985
    We have a multitude of internal tools that can help you fix the replication issues and get you into a healthy state before attempting any other disruptive process, such as upgrading!

Closing Note

I hope this blog post (my debug blog post!) is helpful for everyone that is running into these situations. The idea was to demonstrate a really possible issue you might have, using a simple aspect such as the Service Registration for vCenter Server version change, in the process of an upgrade.

Hopefully this will avoid many critical issues in Multi-vCenter Environment

Regards,

8 thoughts on “Pre-upgrade considerations in Multi-vCenter environments

Leave a comment