APC UPS Data Center & Enterprise Solutions Forum
Schneider, APC support forum to share knowledge about installation and configuration for Data Center and Business Power UPSs, Accessories, Software, Services.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2024-03-21 12:03 PM . Last Modified: 2024-03-22 06:05 AM
Hi all, posting this here to see if anyone has heard about this behavior or experienced something like it. I have seen in the past when swapping NMC between an old UPS and new UPS that the management network gets flooded with some type of traffic from the card that causes communication issues. This is usually indicated by all monitored devices dropping off the network for a minute or 2 then reconnecting. In the past it was isolated to the UPS management network so there was no real issue. Until today.
Remote site UPS management card crashes or otherwise goes offline. We receive these log events: (newest to oldest)
03/21/2024 11:49:45 Device UPS: A discharged battery condition no longer exists.
03/21/2024 11:47:56 Device UPS: The battery power is too low to support the load; if power fails, the UPS will be shut down immediately.
03/21/2024 11:47:53 Device UPS: Restored the local network management interface-to-UPS communication.
03/21/2024 11:47:39 Device Environment: Restored the local network management interface-to-integrated Environmental Monitor (Universal I/O at Port 1) communication.
03/21/2024 11:47:32 System Network service started. System IP is xxx from manually configured settings.
03/21/2024 11:47:24 System Network Interface coldstarted.
Immediately after these events the switch this UPS was connected to crashes (Cisco 3650). Switch comes back online at 11:54 and we lose all connectivity to our virtual server and storage infrastructure. Communication is lost between clusters, vms are brought up in failover mode. Virtual hosts (Nutanix) are giving errors about cluster management services crashing. Backup appliance (Cohesity) is also complaining about unhealthy services.
It seems like the NMC was the cause of these communication issues. Maybe I am way off base but just collecting information at this point. I have tickets open to check logs on the switch and Nutanix hosts to see if we can figure out anything there.
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2024-03-27 06:03 AM
Looking at the logs you provided It appears that something happened to shut the UPS down, since the card "cold started". But since you did not mention a power outage. I can not be sure of that. But the fact that it also talks about a discharged battery, then either you have a bad battery that is dropping out and causing the UPS to reboot/go down or losing power to the UPS for some reason. If the card simply reboot it would say "warm start".
Moving on The Cisco 3650's were decent switches but they also have been out or production for almost 3 years. I have seen in aging UPSs (especially in the on-line series (not double conversion)) that the switching mechanism from utility power to battery sometimes gets weak and slow. this can cause disruptions but usually it is only a voltage drop but not enough to shut down gear. Unless that gear is already getting low (100-107 volts) and/or the PSU in the switch is getting weak and can not compensate for voltage swings.
That said you mentioned network flooding. I have never seen that from a NMC. And I have swapped many many many cards from older, failing, and failed UPSs to new ones. Those cards generally do not put out a lot of information, unless you are having it send a ton of emails to various people or groups. Or a bunch of logs to some syslog server. Even at that though the Cisco 3650 should be way more than capable of handling that.
Do you have any logs from the switch in question for that time frame? Unless I miss my guess the UPS is going into a self test and either you have a weak battery that is causing the UPS to drop out and go down or a failing PSU in the switch. I think that the logs will show a loss of power to the switch. Which would not be an NMC issue.
Link copied. Please paste this link to share this article on your social media post.
Create your free account or log in to subscribe to the board - and gain access to more than 10,000+ support articles along with insights from experts and peers.