APC UPS Data Center & Enterprise Solutions Forum
Schneider Electric support forum for our Data Center and Business Power UPS, UPS Accessories, Software, Services, and associated commercial products designed to share knowledge, installation, and configuration.
Posted: 2021-07-26 02:56 AM
This was originally posted on APC forums on 7/26/2016
I have a SMT1500RM2U UPS with an AP9631 at a customer site. The AP9631 has been restarting with "Failsafe Reset" every few days. The NMC2 has been running 6.4.0 since it was installed. The UPS had been at ID18 9.2 and was updated this past weekend to 9.3 in case that was the problem.
The start of the dump.txt file is:
07/26/2016 11:42:05 Failsafe Reset
Specific code = 201
AOS v6.4.0 sumx v6.4.0
Serial Number: 5A1116T0xxxx
AOS Binary Date/Time: Dec 18 2015 15:04:27
APP Binary Date/Time: Dec 18 2015 15:14:26
Task Dump Task ID 167
OSIntNesting 1
inUioFlag 0
uioErr 0
Current stack at _SS:_SP 03ad: 2510
The complete dump.txt is attached. Can you take a look at this and let me know if it looks like a hardware problem or a software bug, and what steps I should take to investigate further?
Posted: 2021-07-26 02:57 AM
This was originally posted on APC forums on 8/11/2016
2 Symmetra 1P, 2 Matrix, dozens of various generations of Smart-UPS, and 1 microlink UPS. Unfortunately, the only AP933x cards are in one of the Symmetras and the microlink UPS. Everything else is AP961x.
Posted: 2021-07-26 02:56 AM
This reply was originally posted by Angela on APC forums on 7/27/2016
Hi Terry, we need the entire .tar file/bundle please. You can sanitize it too first before posting but dump.txt in conjunction with the config, event, and debug.txt is most helpful to debug why this is occurring.
Posted: 2021-07-26 02:56 AM
This was originally posted on APC forums on 7/27/2016
Here you go. It isn't worth unzipping it to sanitize.
Posted: 2021-07-26 02:56 AM
This was originally posted on APC forums on 8/2/2016
Posted: 2021-07-26 02:57 AM
This reply was originally posted by Angela on APC forums on 8/2/2016
Sorry Terry, I've been on vacation. I'll try to look at this later today while playing catch up.
Posted: 2021-07-26 02:57 AM
This reply was originally posted by Angela on APC forums on 8/2/2016
Hi Terry,
In looking at this, we need to see what was happening in the event log at the same time usually as the dump is only kept from the last reboot. So, the event log only goes back to 7/22 and we won't be able to research these:
07/03/2016 06:41:23 Failsafe Reset
07/11/2016 08:41:31 Failsafe Reset
07/23/2016 14:03:36 Netsafe Reset - Netsafe reset is the the watchdog mechanism that the NMC uses to reboot itself if network traffic is too little or too much in an effort to rule out any problems with itself not being able to talk on the network. This is normal behavior.
07/26/2016 11:42:05 Failsafe Reset - this one I will investigate a little more and update you when I know anything further.
Posted: 2021-07-26 02:57 AM
This was originally posted on APC forums on 8/2/2016
Right - I'm interested in the last one. The netsafe was expected as I was working on the network at the time.
Thanks!
Posted: 2021-07-26 02:57 AM
This reply was originally posted by Angela on APC forums on 8/2/2016
Hi Terry,
This looks to potentially be something related to the email format task/process. It is possible something else caused this task to crash (most likely) or there is a problem specifically with this task somewhere. I haven't seen this specifically before yet either.
One question was, were all of the expected emails received when logging in via FTP around 11:42 on 7/26 (which is shown in the event log)?
I would see if this crash can be replicated frequently and we can log a bug on it but it may be a needle in a haystack considering the NMC being a real time OS and whatever was happening on the system at the exact time could've contributed to a one off. But, if we replicate it multiple times and all of the failsafes generate the same dump.txts, then I certainly will log a bug and provide the log files for review to see what can be done in the future releases.
Posted: 2021-07-26 02:57 AM
This was originally posted on APC forums on 8/2/2016
The FTP script runs hourly on all of my APC devices (close to 100 or so, I'd say, of which maybe 6 are NMC2) to track configuration changes. It is not expected that the FTP script would cause any email to be sent by the NMC.
What normally happens with these resets (which I am only seeing on this one device) is that I get a flurry of emails from the UPS about the NMC restarting, discovering the UIO probes, connecting with the UPS, and so on. My SNMP script may report a temporary inability to reach the device (it polls every 5 minutes) and I may get a "configuration change" alert from my FTP monitoring script where I will get a "UPS not discovered" change:
There was another reset on the 30th. I'm attaching that debug file. Generally, I can get you a new one every couple of days.
Posted: 2021-07-26 02:57 AM
This was originally posted on APC forums on 8/2/2016
I should probably add that this is the only "modern" Smart-UPS that I have (with the blue LCD display). All of my other units are either LED-only Smart-UPS units or Symmetra and Matrix units. So if it is related to the new-style UPS and NMC protocol, then I wouldn't see it on any other unit.
If necessary, I can format / update / configure a replacement NMC2 and send it to the remote site for someone to swap, just to confirm it isn't a hardware problem in that card. But I'd rather keep collecting diagnostic information with this one, as long as it leads to a resolution.
Posted: 2021-07-26 02:57 AM
This reply was originally posted by Angela on APC forums on 8/3/2016
Hi Terry,
We don't feel this is a hardware problem and I'd assume you'd agree with that if you did end up sending them a replacement card to compare. I was going to ask what's different about this card configuration-wise versus your others but you've sort of mentioned that now - the UPS model is micro-link but other config items are pretty much the same (right?).
We do have a similar bug logged already that was under investigation and I am thinking to add your specific debug files to. It appears to potentially be a resource utilization problem depending on the specific environment or configuration (including UPS type), what tasks are running and what their priority is.
There is not a smoking gun in your log files where I can give you a straight answer unless you know it started right after you enabled/disabled something or made a change somewhere. I imagine someone needs to look closely at your configuration and log files and comb through how tasks are prioritized to make sure resources are managed as efficiently as possible to avoid failsafes.
The other thing we can look at and note in the bug is the specific pattern - does it happen every X hours on the dot? Always after an FTP log in? Are the dump.txt files always identical with the same codes/tasks mentioned?
Posted: 2021-07-26 02:57 AM
This was originally posted on APC forums on 8/3/2016
Right - this is the only micro-link UPS I have. Other than that, the config is the same as I use on a number of other NMC2 cards.
It has been happening randomly - since all the timestamps on the crashes are around xx:4x, I'd say that was the FTP login.
It doesn't happen on any specific timeframe. The recent history is:
07/03/2016 06:41:23 Failsafe Reset
07/11/2016 08:41:31 Failsafe Reset
07/23/2016 14:03:36 Netsafe Reset
07/26/2016 11:42:05 Failsafe Reset
07/27/2016 13:41:36 Failsafe Reset
07/29/2016 22:41:45 Failsafe Reset
07/30/2016 01:41:13 Failsafe Reset
I just started monitoring this UPS on 07/02/2016, so the failsafe resets started soon after that. To compare, a different UPS has been running since March 2015 and has never had a failsafe reset, being monitored via the same script. The FTP script is pretty simple - it just logs into the NMC, gets the config.ini file, and logs out. You can find it at ftp://ftp.shrubbery.net/pub/rancid/contrib/rancid-apc.tar.gz
Posted: 2021-07-26 02:57 AM
This reply was originally posted by Angela on APC forums on 8/9/2016
Hi Terry,
Can you share a .tar of one of these different, older style UPSs for comparison, which do not show any issue?
Posted: 2021-07-26 02:57 AM
This was originally posted on APC forums on 8/9/2016
Here you go. This is a Symmetra, but as the bug is probably in AOS and not the APP file, hopefully it will be helpful.
Posted: 2021-07-26 02:57 AM
This reply was originally posted by Angela on APC forums on 8/9/2016
Thanks Terry. It seems to be resources related as I noted so ideally, we'd want something using sumx app to compare apples to apples in case it is specific to resources in the sumx app specifically since some tasks across sumx and sy apps may be different. Or, are all your other UPS units Symmetras?
Posted: 2021-07-26 02:57 AM
This was originally posted on APC forums on 8/11/2016
2 Symmetra 1P, 2 Matrix, dozens of various generations of Smart-UPS, and 1 microlink UPS. Unfortunately, the only AP933x cards are in one of the Symmetras and the microlink UPS. Everything else is AP961x.
Create your free account or log in to subscribe to the forum - and gain access to more than 10,000+ support articles along with insights from experts and peers.