DCE 7.2.7 Device view shows a status of critical after a power outage

DCIM_Support · ‎2020-07-03

I monitor over 400 remote APC UPS devices; 1500s, 2200s and some 3000 XL. The DCE device view screen continues to show a device is in a critical status after a power outage at the remote location. I web into the UPS local interface and the device is green showing no current problems. I can see in the logs that over the last week the device had experienced a loss of AC power and the UPS shut down after the battery ran out.

The APC UPS is now back up and showing green in the local interface yet DCE is still showing a Critical status for the device. I can request a device rescan of the UPS yet it does not refresh to the new status. If I request to view device sensors, the Link Status shows as being "Online" but it is RED in color like there is a problem.

The only thing I have been able to do to reset the status in DCE back to normal is to delete the device and rediscover it. The status then shows up as normal.

This is happening on most of the UPS devices that I have tracked down. I go in once a week and delete and rescan all the UPS that have lost power over the last week to clear the critical status.

The UPS Clients are mostly the 9630 and 9631 NMC with v6.4.0 application running.

Any help or direction would be most appreciated. It is time consuming to reset all my remote location that lose power over a week.

Thanks for your great forum,

Chuck

DCIM_Support · ‎2020-07-03

Hi Chuck,

It's not a known issue. Although I have seen it before, it's not something common and you definitely should not have to remove and re-add the devices. I've also not seen it with so many devices.

The first thing I would try is right clicking the device and requesting a device scan. If this works, what is the polling time for these devices.

When is the last time the server itself was rebooted? That would possibly be a next step.

You mentioned having to do this with remote systems. Are there local systems as well and if so, do they react the same way?

Another thing you might want to think about is contacting support directly to see if you can get an update. We're at 7.4.1 now. As I mentioned, this isn't a known issue so there's nothing added specifically for this but it can't hurt to be up to date.

I'm wondering if this is only happening to systems that go to battery and run out. You mentioned that in the initial paragraph. Or is this just any time they go to battery. I'd also be curious to know if a unit does this, if you caused another error and cleared it (a temperature violation if you have environmental would do it) do the error(s) clear?

Steve

DCIM_Support · ‎2020-07-03

Chuck. Another thing to try prior to deleting the device (which you will not want to do, as you will lose the historical data for the device), is to reset the web interface card (NIC) itself. Note that you will only need to reboot the Management interface NOT the UPS. Launch to the UPS, log on, Control -> Network -> Reset/Reboot -> Ensure that the selected butter is on "Reboot Management Interface". Good luck

DCIM_Support · ‎2020-07-03

Steve, Thank you for the timely response. I was hit with a lot of overtime and needed to drop my projects. Back to my DCE problem: I tried doing a right click and request device scan with no success. I also rebooted the server after you mentioned. This too did not seem to help. Iâ€™ve preformed an AC outage in my test lab with an APC smart 1500 and 2200 with a NMC 9630 and latest firmware. I did not seem to have the same problem with these test lab devices. They will go into a red x status and the return to normal after the device returns from the loss of AC power. I am not sure what this means as we are on a segmented network. The DCE server is running on a VM across the WAN in the data center? We do not have many ALCs on the remote router because they all terminate on our own network. One more thing to note: since I have a wide number of APC UPS models and ages, I was constantly being hammered with false alerts using the default alarm policy. I created a new alarm policy for critical alerts only. I am only tracking some battery alarms as it relates to AC power and communication issues with the NMC indicating when the device lost battery power. All the other alerts have been configured to a warning status. They will continue to log but not send alerts. I am not sure how this could have any bearing on my issue but thought I would bring it up? I receive around a dozen power outages per week, I need to track weather this happens to all the devices or only a certain model or age of device. This seems like the only logical course to track. I think I have tried just rebooting the NMC but I need to try that again before say that did not work. Any additional thoughts would be most appreciated? Chuck

DCIM_Support · ‎2020-07-03

Hi Chuck, I agree, If you have changed the alarm policy or configurations in any way, are the ones that are not clearing any different than the ones that are clearing? If you change it back to it's defaults (just for testing) does the issue still occur? Maybe even disable and re-enable the alarm configurations just to be sure. Did you enable / disable return to normal alerts in the alarm action? Did you simply change the alarms under device alarm configurations or did you actually create new thresholds? If you created new thresholds, did you select the option that "Return to Normal Requires User Input"? If you can poll the device directly using SNMP, can you see if any value is returned from this OID (Not in the MIB): .1.3.6.1.4.1.318.1.4.2.11.1.3.1 If no value is returned, no alarm should be seen. Steve

DCIM_Support · ‎2020-07-03

This question is closed for comments. You're welcome to start a new topic if you have further comments on this issue.