APC UPS Data Center & Enterprise Solutions Forum
Schneider, APC support forum to share knowledge about installation and configuration for Data Center and Business Power UPSs, Accessories, Software, Services.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-26 02:56 AM . Last Modified: 2024-02-14 02:37 AM
I have a SMT1500RM2U UPS with an AP9631 at a customer site. The AP9631 has been restarting with "Failsafe Reset" every few days. The NMC2 has been running 6.4.0 since it was installed. The UPS had been at ID18 9.2 and was updated this past weekend to 9.3 in case that was the problem.
The start of the dump.txt file is:
07/26/2016 11:42:05 Failsafe Reset
Specific code = 201
AOS v6.4.0 sumx v6.4.0
Serial Number: 5A1116T0xxxx
AOS Binary Date/Time: Dec 18 2015 15:04:27
APP Binary Date/Time: Dec 18 2015 15:14:26
Task Dump Task ID 167
OSIntNesting 1
inUioFlag 0
uioErr 0
Current stack at _SS:_SP 03ad: 2510
The complete dump.txt is attached. Can you take a look at this and let me know if it looks like a hardware problem or a software bug, and what steps I should take to investigate further?
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-26 02:57 AM . Last Modified: 2024-02-14 02:37 AM
2 Symmetra 1P, 2 Matrix, dozens of various generations of Smart-UPS, and 1 microlink UPS. Unfortunately, the only AP933x cards are in one of the Symmetras and the microlink UPS. Everything else is AP961x.
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-26 02:56 AM . Last Modified: 2024-01-31 02:57 AM
Hi Terry, we need the entire .tar file/bundle please. You can sanitize it too first before posting but dump.txt in conjunction with the config, event, and debug.txt is most helpful to debug why this is occurring.
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-26 02:56 AM . Last Modified: 2024-02-14 02:37 AM
Here you go. It isn't worth unzipping it to sanitize.
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-26 02:56 AM . Last Modified: 2024-02-14 02:37 AM
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-26 02:57 AM . Last Modified: 2024-01-31 02:57 AM
Sorry Terry, I've been on vacation. I'll try to look at this later today while playing catch up.
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-26 02:57 AM . Last Modified: 2024-01-31 02:57 AM
Hi Terry,
In looking at this, we need to see what was happening in the event log at the same time usually as the dump is only kept from the last reboot. So, the event log only goes back to 7/22 and we won't be able to research these:
07/03/2016 06:41:23 Failsafe Reset
07/11/2016 08:41:31 Failsafe Reset
07/23/2016 14:03:36 Netsafe Reset - Netsafe reset is the the watchdog mechanism that the NMC uses to reboot itself if network traffic is too little or too much in an effort to rule out any problems with itself not being able to talk on the network. This is normal behavior.
07/26/2016 11:42:05 Failsafe Reset - this one I will investigate a little more and update you when I know anything further.
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-26 02:57 AM . Last Modified: 2024-02-14 02:37 AM
Right - I'm interested in the last one. The netsafe was expected as I was working on the network at the time.
Thanks!
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-26 02:57 AM . Last Modified: 2024-01-31 02:57 AM
Hi Terry,
This looks to potentially be something related to the email format task/process. It is possible something else caused this task to crash (most likely) or there is a problem specifically with this task somewhere. I haven't seen this specifically before yet either.
One question was, were all of the expected emails received when logging in via FTP around 11:42 on 7/26 (which is shown in the event log)?
I would see if this crash can be replicated frequently and we can log a bug on it but it may be a needle in a haystack considering the NMC being a real time OS and whatever was happening on the system at the exact time could've contributed to a one off. But, if we replicate it multiple times and all of the failsafes generate the same dump.txts, then I certainly will log a bug and provide the log files for review to see what can be done in the future releases.
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-26 02:57 AM . Last Modified: 2024-02-14 02:37 AM
The FTP script runs hourly on all of my APC devices (close to 100 or so, I'd say, of which maybe 6 are NMC2) to track configuration changes. It is not expected that the FTP script would cause any email to be sent by the NMC.
What normally happens with these resets (which I am only seeing on this one device) is that I get a flurry of emails from the UPS about the NMC restarting, discovering the UIO probes, connecting with the UPS, and so on. My SNMP script may report a temporary inability to reach the device (it polls every 5 minutes) and I may get a "configuration change" alert from my FTP monitoring script where I will get a "UPS not discovered" change:
There was another reset on the 30th. I'm attaching that debug file. Generally, I can get you a new one every couple of days.
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-26 02:57 AM . Last Modified: 2024-02-14 02:37 AM
I should probably add that this is the only "modern" Smart-UPS that I have (with the blue LCD display). All of my other units are either LED-only Smart-UPS units or Symmetra and Matrix units. So if it is related to the new-style UPS and NMC protocol, then I wouldn't see it on any other unit.
If necessary, I can format / update / configure a replacement NMC2 and send it to the remote site for someone to swap, just to confirm it isn't a hardware problem in that card. But I'd rather keep collecting diagnostic information with this one, as long as it leads to a resolution.
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-26 02:57 AM . Last Modified: 2024-01-31 02:57 AM
Hi Terry,
We don't feel this is a hardware problem and I'd assume you'd agree with that if you did end up sending them a replacement card to compare. I was going to ask what's different about this card configuration-wise versus your others but you've sort of mentioned that now - the UPS model is micro-link but other config items are pretty much the same (right?).
We do have a similar bug logged already that was under investigation and I am thinking to add your specific debug files to. It appears to potentially be a resource utilization problem depending on the specific environment or configuration (including UPS type), what tasks are running and what their priority is.
There is not a smoking gun in your log files where I can give you a straight answer unless you know it started right after you enabled/disabled something or made a change somewhere. I imagine someone needs to look closely at your configuration and log files and comb through how tasks are prioritized to make sure resources are managed as efficiently as possible to avoid failsafes.
The other thing we can look at and note in the bug is the specific pattern - does it happen every X hours on the dot? Always after an FTP log in? Are the dump.txt files always identical with the same codes/tasks mentioned?
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-26 02:57 AM . Last Modified: 2024-02-14 02:37 AM
Right - this is the only micro-link UPS I have. Other than that, the config is the same as I use on a number of other NMC2 cards.
It has been happening randomly - since all the timestamps on the crashes are around xx:4x, I'd say that was the FTP login.
It doesn't happen on any specific timeframe. The recent history is:
07/03/2016 06:41:23 Failsafe Reset
07/11/2016 08:41:31 Failsafe Reset
07/23/2016 14:03:36 Netsafe Reset
07/26/2016 11:42:05 Failsafe Reset
07/27/2016 13:41:36 Failsafe Reset
07/29/2016 22:41:45 Failsafe Reset
07/30/2016 01:41:13 Failsafe Reset
I just started monitoring this UPS on 07/02/2016, so the failsafe resets started soon after that. To compare, a different UPS has been running since March 2015 and has never had a failsafe reset, being monitored via the same script. The FTP script is pretty simple - it just logs into the NMC, gets the config.ini file, and logs out. You can find it at ftp://ftp.shrubbery.net/pub/rancid/contrib/rancid-apc.tar.gz
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-26 02:57 AM . Last Modified: 2024-01-31 02:57 AM
Hi Terry,
Can you share a .tar of one of these different, older style UPSs for comparison, which do not show any issue?
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-26 02:57 AM . Last Modified: 2024-02-14 02:37 AM
Here you go. This is a Symmetra, but as the bug is probably in AOS and not the APP file, hopefully it will be helpful.
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-26 02:57 AM . Last Modified: 2024-01-31 02:57 AM
Thanks Terry. It seems to be resources related as I noted so ideally, we'd want something using sumx app to compare apples to apples in case it is specific to resources in the sumx app specifically since some tasks across sumx and sy apps may be different. Or, are all your other UPS units Symmetras?
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-26 02:57 AM . Last Modified: 2024-02-14 02:37 AM
2 Symmetra 1P, 2 Matrix, dozens of various generations of Smart-UPS, and 1 microlink UPS. Unfortunately, the only AP933x cards are in one of the Symmetras and the microlink UPS. Everything else is AP961x.
Link copied. Please paste this link to share this article on your social media post.
Create your free account or log in to subscribe to the board - and gain access to more than 10,000+ support articles along with insights from experts and peers.