APC UPS Data Center & Enterprise Solutions Forum
Schneider, APC support forum to share knowledge about installation and configuration for Data Center and Business Power UPSs, Accessories, Software, Services.
Posted: 2021-07-01 02:13 AM . Last Modified: 2024-03-05 11:48 PM
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 02:13 AM . Last Modified: 2024-03-05 11:48 PM
The strangest thing just happened to me, I have 4 NMC 2's on two different models of UPS's. These cards have all been up and running without a problem for the last 10 months or so. We monitor them via an SNMP app, all at once all 4 dropped off the network. They continued to respond to pings but SNMP, telnet and HTTP all failed to access them. Initially I assumed a network problem but they are on at least two different switches, after eliminating all possible network issues I hit the reset button on one of the NMC's. After booting it started responding normally, the logs show no event at the time it stopped responding. I then did the same on the other 3 cards and everything is back to normal.
Anyone have any ideas on why 4 independent cards would all apparently lock up simultaneously like that?
Thanks,
John
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 02:14 AM . Last Modified: 2024-03-05 11:48 PM
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 02:14 AM . Last Modified: 2024-03-05 11:48 PM
Yeah I thought about the watchdog timer too but the cards didn't restart and as you pointed out the time before I rebooted them was far greater than the 9.5 minutes. Also we have two different monitoring systems, one does SNMP polls every 5 minutes so in theory if there was a delay there the timer could in theory expire but the other system does a ping every minute so that would stop the timer from expiring. 3 of the cards are on AOS 5.1.7 and 1 is on 5.1.6. The IPv6 is an interesting thought, if it reoccurs I might turn that off but as it stands now this is the only time it's happened in almost a year so it's definitely not something I can repeat and test against. I'm also not going to lose any sleep over this but it was just such a strange thing that I had to ask about it.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 02:13 AM . Last Modified: 2024-03-05 11:48 PM
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 02:13 AM . Last Modified: 2024-03-05 11:48 PM
The strangest thing just happened to me, I have 4 NMC 2's on two different models of UPS's. These cards have all been up and running without a problem for the last 10 months or so. We monitor them via an SNMP app, all at once all 4 dropped off the network. They continued to respond to pings but SNMP, telnet and HTTP all failed to access them. Initially I assumed a network problem but they are on at least two different switches, after eliminating all possible network issues I hit the reset button on one of the NMC's. After booting it started responding normally, the logs show no event at the time it stopped responding. I then did the same on the other 3 cards and everything is back to normal.
Anyone have any ideas on why 4 independent cards would all apparently lock up simultaneously like that?
Thanks,
John
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 02:13 AM . Last Modified: 2024-03-05 11:48 PM
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 02:13 AM . Last Modified: 2024-03-05 11:48 PM
I was thinking the same thing and I suppose anything is possible but it was very odd. I waited about 45 minutes before rebooting them, pretty much after I had exhausted every other possibility I could think of.
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 02:13 AM . Last Modified: 2024-03-05 11:48 PM
I was only asking because I was curious if the watchdog timer kicked in that is outlined in the user's guide and knowledge base:
The default gateway can be any valid node's IP on the network management card's network. The Management Card implements internal watchdog mechanisms to protect itself from becoming inaccessible over the network. For example, if the Management Card does not receive any network traffic for 9.5 minutes (either direct traffic, such as SNMP,or broadcast traffic, such as an Address Resolution Protocol [ARP] request), it assumes that there is a problem with its network interface and restarts to prevent further problems. To ensure that the Management Card does not restart if the network is quiet for 9.5 minutes, the Management Card attempts to contact the default gateway every 4.5 minutes. If the gateway is present, it responds to the Management Card, and that response restarts the 9.5-minute timer. If your application does not require or have a gateway, specify the IP address of a computer that is running on the network most of the time and is on the same subnet. The network traffic of that computer will restart the 9.5-minute timer frequently enough to prevent the Management Card from restarting.
I was thinking this might have kicked in and the NMC would've rebooted itself but it is hard to say because of what caused the lock up in the first place. I have never really seen that before but as I am typing this, something did come to mind. Do you have IPv6 enabled on these cards? (It comes enabled by default on the newer firmware revisions as does IPv4). One customer I've worked with in the past claimed this caused a problem but I am not sure how that occurred - if it occurred because other devices on the network were using IPv6 and caused some type of storm but to me, it seems like it'd bring a lot of things down, not just the NMCs.
What firmware revisions do the cards have? (located under Administration->General->About in the web UI if you aren't sure) Also curious if they are all the same AOS rev.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 02:14 AM . Last Modified: 2024-03-05 11:48 PM
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 02:14 AM . Last Modified: 2024-03-05 11:48 PM
Yeah I thought about the watchdog timer too but the cards didn't restart and as you pointed out the time before I rebooted them was far greater than the 9.5 minutes. Also we have two different monitoring systems, one does SNMP polls every 5 minutes so in theory if there was a delay there the timer could in theory expire but the other system does a ping every minute so that would stop the timer from expiring. 3 of the cards are on AOS 5.1.7 and 1 is on 5.1.6. The IPv6 is an interesting thought, if it reoccurs I might turn that off but as it stands now this is the only time it's happened in almost a year so it's definitely not something I can repeat and test against. I'm also not going to lose any sleep over this but it was just such a strange thing that I had to ask about it.
Link copied. Please paste this link to share this article on your social media post.
Create your free account or log in to subscribe to the board - and gain access to more than 10,000+ support articles along with insights from experts and peers.