Solved: 2 SUA1500RM2 dropping loads simultaneously?

Anonymous user · ‎2021-06-28

Hello,
I have an interesting problem with a remote site, where we have a rack with multiple SmartUPS in it. The oldest is a SU1400RM2, this one doesn't seem to be causing any problem. We then have 3 SUA1500RM2s various ages. One of these has a NMC in it, and we use PCNS to shut down the various servers, and we don't have anything plugged into the USB or Serial ports.

First: Full disclosure - these are the Dell 'branded' version of these units, so their part number really starts with DLA. I hope that's not the problem here... and here's a status dump from the NMC:

On Line, No Alarms Present

Reason For Last Transfer To Battery: Unacceptable utility voltage rate of change.
Internal Temperature: 88.7 °F
Runtime: 28 minutes

Utility power status

Input Voltage: 113.0 VAC
Input Frequency: 60.00 Hz
Maximum Line Voltage: 117.3 VAC
Minimum Line Voltage: 112.3 VAC

Output power status

Output Voltage: 113.0 VAC
Output Frequency: 60.00 Hz
Load Power: 044.2 % Watts

Battery status

Battery Capacity: 100.0 %
Battery Voltage: 27.40 VDC
Number of External Batteries: 000
Self-Test Result: Passed
Self-Test Date: 11/05/2008
Calibration Result: Invalid
Calibration Date: Unknown

About UPS

Model: Smart-UPS 1500 RM
Firmware Revision: 617.3.D
Manufacture Date: 08/12/05
Serial Number: AS0532133715

Battery was replaced in March of 08. Anyhow, most of our servers are Dell 2950's with redundant power supplies, and we mix and match them between the 4 UPSs for redundancy. Well, we started to notice during times of bad power that our servers were ending up at the login prompt with warning about unexpected shutdown. There would be no evidence of powerchute warnings in the event logs, and the UPSs would show momentary on battery events in the log, but since this site has a backup generator, blackouts would typically be less than 30 seconds or so. These servers are dumping within the first second of the event, which should be unlikely with each being plugged into more than one UPS.

Yes, we tested each UPS recently, pulling the plug for up to 2 minutes each, with no ill effect. We did not try pulling them all simultaneously... When the network connected UPS was tested, we received the appropriate messages on the servers in question. And no, the servers were not all in common UPSs. They were all spread over the 3 DLA1500RM2s. As I said, the oldest unit is the most robust.

Any known issues with these units? These are located in a remote NH location at the end of utility lines, and we have had the power company come in to check line quality, and they always come back saying they are 'within spec.' Probably not by much.

null

Erasmus_apc · ‎2021-06-28

Assumed answered due to lack of customer update/response.

See Answer In Context

Anonymous user · ‎2021-06-28

I have attached a number of logs and have done a bit more to clarify what's going on. The zip file includes:
-config.ini, data and event logs for the 9619 card in one UPS. (UPS3)
-PCNS event logs for 5 servers attached to the 9619.
-PCBE data and event logs for 2 servers directly connected to UPSs.

For reference, the event occurred at or around 14:03 on Jan 25. (note: due to DST and clock drift, the 9619 logs are off and show 14:59 instead. I corrected this within the hour using NTP.)

Here is a quick map of what's going on:

UPS 1: DLA1400RM2 (The oldest RM unit)
UPS 2: DLA1500RM2
UPS 3: DLA1500RM2 (9619 card installed, USB to Anubis2 for PCBE)
UPS 4: DLA1500RM2 (USB to Toral2 for PCBE)

UPS 5: SU1000BX120 (deskmount next to rack)
UPS 6: SU1400BX120 (ditto)

5 PCNS Servers
- Anubis3 - dual PS on 5,6
- Grima - dual PS on 2,4
- Legolas - single PS on 1
- Ozawa - dual PS on 1,2
- Peregrine - dual PS on 3,4

2 PCBE Servers
- Anubis2 - dual PS on 1,3
- Toral2 - dual PS on 3,4

1 unprotected dev server
- Mothra - dual PS on 1,4

OK those are the Powerchute Players. Now here comes the strange part. Here is the list of servers that rebooted without any powerchute messages:
-Grima (2,4)
-Ozawa (1,2)
-Peregrine (3,4)
-Toral2 (3,4)
-Mothra (1,4)

The ones that did NOT reboot and got messages:
-Legolas (1)
-Anubis2 (1,3)
-Anubis3 (5,6)

The darndest thing is that legolas, a single PS unit, lived through it on UPS1. Anubis3 wasn't participating in the RM UPSs at all, but was attached to older desktop units. (Not Dell rebrands)

For the record, UPS4 also has a PoE switch attached to it, and we have
evidence it rebooted as well at the time. I'm fairly sure the software
is not an issue here.

I did want to add that this event also affected a newer DLA1500 tower unit in a different building that had its battery replaced just a couple of months ago, and self tests fine and can be unplugged for 10 minutes at a time.

I also see that I have a strange config in that UPS3 has both the NMC and a USB connection to a PCBE client. While I agree this might look strange, the messaging seems to go fine and certainly doesn't explain the sudden drops.

I look forward to your comments.

Message was edited by: amcmis

BillP · ‎2021-06-28

Okay. There's a lot of information here and I'm trying to get an idea of your setup and everything to hopefully figure out where things went wrong. Just letting you know I haven't forgotten about you. 🙂

BillP · ‎2021-06-28

First, I will agree it's unusual (and unsupported) to use PCBE in conjunction with an NMC. However, as you said, PCBE seems to be working completely fine so I'm not suspecting that's related or an issue. Nor do I suspect a software/comm issue at this time.

Looking at the layout of servers to UPSes and how everything is connected helps. According to the logs for UPS #3 (NMC logs & Anubis2's PCBE logs), the 44% load somehow matches exactly the 44% load on UPS #4 (per Toral's PCBE logs). This is perplexing since UPS #3 has 3 servers connected and UPS #4 has 4 servers. Normally I would think the load on UPS #4 should be higher than that of UPS #3, but that doesn't seem to be the case.

Because of this, I would check the redundant power supplies themselves - this load consistency might make sense logically (aside from coincidence I suppose) if some of the redundant supplies aren't working. Assuming the lights for each power supply are on, I would actually pull the plug on one of the supplies at a time to make sure the other kicks in fully and the server stays up.

When working with redundant power supplies on multiple UPSes, it's important to remember every UPS must be able to support the entire load assuming one power supply in each server fails. That means UPS #4 needs to be able to support all 4 servers at full load, assuming no redundancy. If it can't, it will become overloaded and turn off, which may cause an overload chain reaction on the other units. Granted that doesn't entirely explain the erratic sequence of shutdown servers... Unless some of your servers don't show reboot messages after they come back up?

Additionally, I'm interested to see how everything is wired in this room. Are all of the UPSes on one circuit? Two? How are the circuits grounded? We need to eliminate the possibility of ground loops. I understand remote sites don't always have the best wiring/power available.

You said the battery in UPS #3 was replaced in 2008. How old are the other batteries? I assume none of the units had replace battery warnings when you were testing onsite.

Finally, the reports from Toral don't seem to sync with everything else. It shows a loss of communication at 14:05, around the time of the outage, which is expected for a rebooted server. However, it also reports a loss of power at 14:54 the same day, which is not reported in UPS #3's logs (NMC or PCBE). Are we sure Toral dropped? Does it have the wrong timestamps? Or is this UPS just part of a separate circuit that may have lost power independently from the others?

That should be everything for now. Look forward to hearing some replies. 🙂

Anonymous user · ‎2021-06-28

Hmmm, I'll have to research a bit on the distribution of circuits in the room. As I said, this is a remote site, so I'll need to get someone to map it out. In answer to the other questions...
- I made a mistake, the PoE switch I said was on UPS4 is actually on UPS3. That meshes with your comment, that switch likely uses the equivalent power of a server.
- I believe the PS redundancy tests were done and good. It's not too likely that more than one server has bad PSs, but I will schedule another full 'pull test' soon. (The Dell OpenManage software would email me if it knew there was a problem...)
- Every server that dumped asked for the reason it was unexpectedly shut down, including 4 VMs running inside of Mothra, a Hyper-V host. Only servers that survived the event showed Powerchute messages.
- agreed on the quality of power, both from an internal wiring and Utility perspective. There is likely circuit sharing going on, but definitely not all 4 on one!
- All batteries are replaced as they go bad, I will try to get the exact dates. As far as I know, the bi-weekly tests all work, although UPS3 at least hasn't self-tested in a while.

I'll post again next week with more detail.

Message was edited by: amcmis

Anonymous user · ‎2021-06-28

Hello,
I have an interesting problem with a remote site, where we have a rack with multiple SmartUPS in it. The oldest is a SU1400RM2, this one doesn't seem to be causing any problem. We then have 3 SUA1500RM2s various ages. One of these has a NMC in it, and we use PCNS to shut down the various servers, and we don't have anything plugged into the USB or Serial ports.

First: Full disclosure - these are the Dell 'branded' version of these units, so their part number really starts with DLA. I hope that's not the problem here... and here's a status dump from the NMC:

On Line, No Alarms Present

Reason For Last Transfer To Battery: Unacceptable utility voltage rate of change.
Internal Temperature: 88.7 °F
Runtime: 28 minutes

Utility power status

Input Voltage: 113.0 VAC
Input Frequency: 60.00 Hz
Maximum Line Voltage: 117.3 VAC
Minimum Line Voltage: 112.3 VAC

Output power status

Output Voltage: 113.0 VAC
Output Frequency: 60.00 Hz
Load Power: 044.2 % Watts

Battery status

Battery Capacity: 100.0 %
Battery Voltage: 27.40 VDC
Number of External Batteries: 000
Self-Test Result: Passed
Self-Test Date: 11/05/2008
Calibration Result: Invalid
Calibration Date: Unknown

About UPS

Model: Smart-UPS 1500 RM
Firmware Revision: 617.3.D
Manufacture Date: 08/12/05
Serial Number: AS0532133715

Battery was replaced in March of 08. Anyhow, most of our servers are Dell 2950's with redundant power supplies, and we mix and match them between the 4 UPSs for redundancy. Well, we started to notice during times of bad power that our servers were ending up at the login prompt with warning about unexpected shutdown. There would be no evidence of powerchute warnings in the event logs, and the UPSs would show momentary on battery events in the log, but since this site has a backup generator, blackouts would typically be less than 30 seconds or so. These servers are dumping within the first second of the event, which should be unlikely with each being plugged into more than one UPS.

Yes, we tested each UPS recently, pulling the plug for up to 2 minutes each, with no ill effect. We did not try pulling them all simultaneously... When the network connected UPS was tested, we received the appropriate messages on the servers in question. And no, the servers were not all in common UPSs. They were all spread over the 3 DLA1500RM2s. As I said, the oldest unit is the most robust.

Any known issues with these units? These are located in a remote NH location at the end of utility lines, and we have had the power company come in to check line quality, and they always come back saying they are 'within spec.' Probably not by much.

null

Erasmus_apc · ‎2021-06-28

Assumed answered due to lack of customer update/response.

2 SUA1500RM2 dropping loads simultaneously?

APC UPS Data Center & Enterprise Solutions Forum

Improve your search experience:

Recommended Forums

APC UPS for Home and Office Forum

EcoStruxure IT Forum

Data Center Certified Associate Exam Development Path

2 SUA1500RM2 dropping loads simultaneously?

WHAT’S NEXT?

Ask our Experts

My Dashboard

Ask our Experts

Email Us

2 SUA1500RM2 dropping loads simultaneously?

APC UPS Data Center & Enterprise Solutions Forum

Recommended Forums

APC UPS for Home and Office Forum

EcoStruxure IT Forum

Data Center Certified Associate Exam Development Path

2 SUA1500RM2 dropping loads simultaneously?

WHAT’S NEXT?

Ask our Experts

My Dashboard

Ask our Experts

Email Us

Welcome!