Welcome to the new Schneider Electric Community

It's your place to connect with experts and peers, get continuous support, and share knowledge.

  • Explore the new navigation for even easier access to your community.
  • Bookmark and use our new, easy-to-remember address (community.se.com).
  • Get ready for more content and an improved experience.

Contact SchneiderCommunity.Support@se.com if you have any questions.

Close
Invite a Co-worker
Send a co-worker an invite to the Exchange portal.Just enter their email address and we’ll connect them to register. After joining, they will belong to the same company.
Send Invite Cancel
84591members
353858posts

VMware ESXi shutdown incomplete for newer hosts

APC UPS Data Center & Enterprise Solutions Forum

Schneider Electric support forum for our Data Center and Business Power UPS, UPS Accessories, Software, Services, and associated commercial products designed to share knowledge, installation, and configuration.

Solved
ChrisHaag_apc
Ensign
Ensign
0 Likes
2
359

VMware ESXi shutdown incomplete for newer hosts

This was originally posted on APC forums on 10/20/2015


There are other threads about VMware shutdowns not working as expected. While I cannot provide a solution for this, there is maybe a hint in this post.

History and Symptoms

After a power outage two of our VMware ESXi hosts did not shut down completely. All VMs where shut down and the hosts were put to maintenance mode. However, the hosts were not shut down.

Other VMware ESXi hosts were completely shut down, as expected.

Power came back very soon. The UPS was still running at that time. I left the maintenance mode for the two hosts manually and started the VMs manually for those two hosts. The other hosts were booting fine automatically.

Lots of login errors were logged in the vSphere Web Client after the hosts and VMs were powered up again.  The error message was like “user@[IP vMA with PCNS] could not log in. Insufficient permissions”. The errors were only for the two vSphere hosts.

Analysis Part 1 – Login Errors in vSphere Web Client

With the information from the thread

http://forums.apc.com/spaces/7/ups-management-devices-powerchute-software/forums/general/11365/power...

the login error issue visible in the vSphere Web client was fixed quickly.

PCNS error log after the reboot, multiple entries of the same type:

ERROR pool-80-thread-1 com.apcc.m11.components.Shutdowner.Hosts.ESXManagedHost - Failed to take host out of maintenance mode: myhost14.my.domain

Obviously PCNS was trying to take the two hosts out of the maintenance mode. I could stop this by emptying /opt/APC/PowerChute/group1/VirtualizationFileStore.properties. From the thread above I learned this is the “things to be done after reboot” list.

The question is why PCNS could not log into the two hosts. There is no more information about that in the error log of PCNS. I assume that PCNS was trying to log into the two hosts directly – although the vCenter Server Appliance was up and running. Another option is that – for whatever reason – the vCenter Server Appliance could not “forward” the command of PCNS.

Analysis Part 2 – No complete shutdown of two hosts

PCNS error log at the time of the shutdown:

ERROR Thread-33 com.apcc.m11.components.WebServer.util.virtualization.VMWareConnection - getESXiHostConnection, Host myhost14.my.domain - (NoPermission or InvalidLogin) [no details available]

ERROR Thread-33 com.apcc.m11.components.Shutdowner.Hosts.ESXManagedHost - checkForVCSAVMAndHostInCriticalHosts - cannot obtain HostSystem using findByIP and findByDnsName for critical host myhost14.my.domain

I assume the second line is just a consequence of the first line. Again I assume that PCNS tried to log into the vSphere host directly to shut down the host – although the vCenter Server appliance was still up and running. Still the other alternative I see is that the vCenter Server Appliance did not forward the command.

 

Analysis Part 3 – Differences of the Servers

So far the issue is narrowed down to a login problem. The questions is why the login problem happened on two servers, but not on others.

The two affected hosts are newer. I compared the permission settings of the older and the two newer hosts.

The user configured in PCNS is “vmware-pcns@my.win.domain”. This user is a member of the group “VMware-Administrators@my.win.domain”, which we use to set permissions.

In the vSphere Web Client the group is displayed as “MY.WIN.DOMAIN\VMware-Administrators” in the permissions tab. This is displayed for the older and the two newer hosts.

I checked the permissions again using the (old) vSphere C# Client. For the older host the entry is “MYWINDOMAIN\vmware-administrators”. For the two newer hosts there is no such entry at all.

We are writing installation and configuration blogs for each machine in our IT landscape. A look in the past showed that we have added the older hosts to the Active Directory using the vSphere C# Client, while the two newer were added using the vSphere Web Client.

Maybe the mixed use of capital and small letters vs small letters only is another indication.

 

Idea for PCNS Improvement

This is just a side note on that case. I think it is an essential information to display the ability to log into each vSphere host individually. The individual login is used when the vCenter Server (Applicance) is not available and it is important to know whether it is working or not.

There is a “Check Details” button to verify whether PCNS can connect to the UPS. And I can check indirectly whether PCNS can log into the vCenter Server (Appliance) by clicking the menu item “Host Protection”. There is no way to check the individual host login from the PCNS web interface.

A button “Check Connectivity” would make sense. It might display:

1) Connection to the UPS

2) Connection to the vCenter Server (Appliance)

3) Connection to each vSphere ESXi host individually

A green “OK” checkmark or a red “X” behind each line would indicate the connectivity.

 

Environment

We use PCNS 4.1, vSphere 6, latest version of vCenter Server Appliance, vCenter Server Appliance and vMA with PCNS are on one of the older hosts. There is no other (global) user to set permissions. The group mentioned above is set globally.

Tags (2)

Accepted Solutions
ChrisHaag_apc
Ensign
Ensign
0 Likes
0
359

Re: VMware ESXi shutdown incomplete for newer hosts

This was originally posted on APC forums on 10/23/2015


Hi Bill,

Thanks for the info about the shutdown procedure.

Check

I cannot log into the two newer hosts (internal #12 and #14) with the user “vmware-pcns@my.win.domain” using the vSphere C# client. I did not think about that test, because logging in over vCenter Server, also with VMware Workstation over vCenter Server, works fine. Thanks for the advice.

This is now a VMware issue, not APC PCNS. I continue writing, because someone might benefit from the solution.

Fix

For whatever reason it was not possible to add the group “VMware-Administrators@my.win.domain” to the permissions in the vSphere C# Client of host #12 and #14. The AD domain “my.win.domain“ did not show up in the dropdown.  Procedure to fix:

 1) Remove host from AD using vSphere C# Client

 That might work with the vSphere Web Client as well, I did not check it.

 2) Remove (resource) entry for the host from AD on one of the Windows AD servers

 Without removing the resource entry of the host from the AD selecting the windows domain is possible, but instead of the list of users/groups an error is displayed: ‘Call "UserDirectory.RetrieveUserGroups" for object "ha-user-directory" on ESXi "" failed’. This errors has been discussed in the VMware forum. Removing the host from the AD seems to help.

3) Wait 2 minutes

4) Add host to AD using vSphere C# Client

This has to be done using the vSphere C# client! Otherwise I was not able to select a Windows group from the dropdown when setting permissions. Actually VMware advises so, see:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=207536...

(Did they tell the C# Client is deprecated and advise to use the Web Client..?)

3) Wait 2 minutes

AD needs to think about changes…

5) Add group “VMware-Administrators@my.win.domain” to the permission tab with administrator role using vSphere C# Client.

Result: Now logging into the vSphere C# Client with the user “vmware-pcns@my.win.domain” works. I have not tested a simulated power outage shutdown yet.

Side note: Also the “gss acquire cred failed” error, when trying to log in with vSphere C# Client using the “Use Windows Credentials” checkbox, disappeared after the procedure. This has been discussed in a couple of threads in the VMware forums without a solution.

VMware

It appears that VMware is using a kind of two “permisson boxes“. One for the vCenter Server and one for what is set by the vSphere C# Client. See:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=200633...

New Problem with Host #11

We had a broken SD card for our host #11. All of our hosts boot from SD card. When we received the replacement SD card, we set up a blank ESXi and restored the config backup using vicfg-cfgbackup.

The host was part of the AD. The group “VMware-Administrators@my.win.domain” showed up in the permission tab in the vSphere C# Client. All looked good.

However, I could not log into the host using the vSphere C# Client and Windows credentials anymore. And of course not with the user “vmware-pcns@my.win.domain”. It seems that the AD registration was broken after the restore.

After applying the procedure above all worked fine.

Feature Wish List PCNS

The additional experience with host #11 shows that we cannot rely on a login with the PCNS Windows user. It might break and potentially nobody notices it. The VMware Windows authentication does not appear bulletproof to me. This is not good, but no risk if the Windows authentication is used for real users. But as PCNS uses the Windows authentication during the shutdown on power outages this is a risk.

It might be a good idea to build in a regular self-test into PCNS. Somehow like the UPS self-test, which runs scheduled and can be started manually. I think PCNS should check the connectivity of the UPS, the vCenter and all hosts automatically on a daily base. Errors should be reported by email to the administrator. That would make the processes way safer. At least the administrator is informed if something is broken.

Christian

See Answer In Context

2 Replies 2
BillP
Administrator Administrator
Administrator
0 Likes
0
359

Re: VMware ESXi shutdown incomplete for newer hosts

This reply was originally posted by Bill on APC forums on 10/20/2015


Hi,

Can you log into each of the hosts using vmware-pcns@my.win.domain credentials? The following sequence is triggered when a critical UPS event such as UPS on Battery occurs.

1. PowerChute reports that the UPS is on battery.

2. Shutdown delay for the On Battery event elapses. PowerChute starts a maintenance mode task on each Host. At the same time it sends a command to turn off the UPS or Outlet Group.

3. PowerChute starts VM shutdown followed by vApp shutdown.

4. VM/vApp shutdown durations elapse and PowerChute gracefully shuts down the vCenter Server VM.

5. vCenter VM shutdown duration elapses. PowerChute starts executing the shutdown command file.

6. Shutdown command file duration elapses.

7. PowerChute shuts down the VMware hosts using the order on the VMware Host Protection page. (The host running vCenter is shutdown second last before the host running PowerChute1 ).

8. UPS waits for greater of Low Battery Duration/Maximum Required Delay (NonOutlet Aware UPS‟s) or the Outlet Group Power Off Delay (initiated during step 2).

9. UPS turns off after the user-configurable Shutdown Delay time has elapsed or the Outlet Group turns off after the power off Delay elapses.

Once vCenter is powered down PCNS communicates directly with each of the host. If it cannot communicate the host cannot be powered down. Common causes are credential mismatch and or AD off line. 

ChrisHaag_apc
Ensign
Ensign
0 Likes
0
360

Re: VMware ESXi shutdown incomplete for newer hosts

This was originally posted on APC forums on 10/23/2015


Hi Bill,

Thanks for the info about the shutdown procedure.

Check

I cannot log into the two newer hosts (internal #12 and #14) with the user “vmware-pcns@my.win.domain” using the vSphere C# client. I did not think about that test, because logging in over vCenter Server, also with VMware Workstation over vCenter Server, works fine. Thanks for the advice.

This is now a VMware issue, not APC PCNS. I continue writing, because someone might benefit from the solution.

Fix

For whatever reason it was not possible to add the group “VMware-Administrators@my.win.domain” to the permissions in the vSphere C# Client of host #12 and #14. The AD domain “my.win.domain“ did not show up in the dropdown.  Procedure to fix:

 1) Remove host from AD using vSphere C# Client

 That might work with the vSphere Web Client as well, I did not check it.

 2) Remove (resource) entry for the host from AD on one of the Windows AD servers

 Without removing the resource entry of the host from the AD selecting the windows domain is possible, but instead of the list of users/groups an error is displayed: ‘Call "UserDirectory.RetrieveUserGroups" for object "ha-user-directory" on ESXi "" failed’. This errors has been discussed in the VMware forum. Removing the host from the AD seems to help.

3) Wait 2 minutes

4) Add host to AD using vSphere C# Client

This has to be done using the vSphere C# client! Otherwise I was not able to select a Windows group from the dropdown when setting permissions. Actually VMware advises so, see:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=207536...

(Did they tell the C# Client is deprecated and advise to use the Web Client..?)

3) Wait 2 minutes

AD needs to think about changes…

5) Add group “VMware-Administrators@my.win.domain” to the permission tab with administrator role using vSphere C# Client.

Result: Now logging into the vSphere C# Client with the user “vmware-pcns@my.win.domain” works. I have not tested a simulated power outage shutdown yet.

Side note: Also the “gss acquire cred failed” error, when trying to log in with vSphere C# Client using the “Use Windows Credentials” checkbox, disappeared after the procedure. This has been discussed in a couple of threads in the VMware forums without a solution.

VMware

It appears that VMware is using a kind of two “permisson boxes“. One for the vCenter Server and one for what is set by the vSphere C# Client. See:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=200633...

New Problem with Host #11

We had a broken SD card for our host #11. All of our hosts boot from SD card. When we received the replacement SD card, we set up a blank ESXi and restored the config backup using vicfg-cfgbackup.

The host was part of the AD. The group “VMware-Administrators@my.win.domain” showed up in the permission tab in the vSphere C# Client. All looked good.

However, I could not log into the host using the vSphere C# Client and Windows credentials anymore. And of course not with the user “vmware-pcns@my.win.domain”. It seems that the AD registration was broken after the restore.

After applying the procedure above all worked fine.

Feature Wish List PCNS

The additional experience with host #11 shows that we cannot rely on a login with the PCNS Windows user. It might break and potentially nobody notices it. The VMware Windows authentication does not appear bulletproof to me. This is not good, but no risk if the Windows authentication is used for real users. But as PCNS uses the Windows authentication during the shutdown on power outages this is a risk.

It might be a good idea to build in a regular self-test into PCNS. Somehow like the UPS self-test, which runs scheduled and can be started manually. I think PCNS should check the connectivity of the UPS, the vCenter and all hosts automatically on a daily base. Errors should be reported by email to the administrator. That would make the processes way safer. At least the administrator is informed if something is broken.

Christian