DCE 7.4.2 Unable to retrieve data from server

DCIM_Support · ‎2020-07-04

We look after a particular VM DCE server that has been running happily for a few years and had no changes made to it in at least the last few months (since shortly after the 7.4.2 update was released)

Last week (about 9 days ago) and again today, it has come up with the error "Unable to retrieve data from server" on the web browser and I am unable to login using the client at all. I have managed to download the capture logs from the web interface, but I am not sure how to interpret them.

Please can a Schneider SXW expert assist with this and advise what may be wrong?

The server is quite low resource:-

VM Version 10
CPU 1vCPU
Memory 1024MB
Hard Disk 1 18GB
Hard Disk 2 60GB
Hard disk controller LSI Logic Parallel
Network Adapter 1 E1000

Links to the capturelogs files from both incidents are below

2017-10-09

2017-10-16

DCIM_Support · ‎2020-07-04

Hi Garry,

Exactly when did the issue(s) occur? The more accurate you can get with the time/date, the more easily I can find specific time references in the logs.

How did the server again become responsive, did you reboot or simply wait it out?

How many devices are you monitoring and of what type (SNMP, NetBotz, Modbus)?

Are you using surveillance?

If you look at the time this happened, can you check backups, purges, reports, etc to see if something else specific was happening at this time?

Looking at the messages log, I can see most of the messages for today look like this:

Oct 16 08:30:03 vlp0722 nbSNMPScan: failScan: Marking scan of 10.36.96.7 with SCAN_FAILED resultcode.
Oct 16 08:30:03 vlp0722 nbSNMPScan: failScan: Marking scan of 10.36.96.9 with SCAN_FAILED resultcode.

Usually failed SNMP scans won't cause the system to lock up but it can't hurt to try to correct any comm issues if you can.

Looking at the top output, I can see that the memory is not being overworked:

Mem: 1020020k total, 874388k used,

Swap: 2097148k total, 275756k used,

At least at the time this capture was run, everything appeared to be low enough. Thing is, I'm guessing that since you couldn't get to the web during whatever the issue was, this type of reading could have been drastically different. I'm assuming you got the logs from the web page, correct?

the latest NBC.xml appears to be after a reboot with a timestamp of 1507563033882, equating to Monday, October 9, 2017 3:30:33.882 PM

I see a number of errors based on com.apc.isxc.common.integration.rest.RestExceptionResolver

Are you using DCO or some other integration? Have you seen any errors there? Anything happen with those systems at the time of the issue?

I also see messages relating to RMS, are you using that as well?

I ask about things like DCO and RMS simply because it's more load on the system. If you say that something happened today (Oct 16) I can't see any specific errors in the log that can account for it. There are at least no exceptions ni java that can account for this issue on that day. There was one on Oct 9 but that shouldn't have an issue on the system today.

Do you use a proxy? I looked at the HTTP error logs that there didn't seem to be much activity between the 9th when the system was rebooted and yesterday. Yesterday and today however, the log is filled with errors:

[Sun Oct 15 22:40:41 2017] [error] (70007)The timeout specified has expired: ajp_ilink_receive() can't receive header
[Sun Oct 15 22:40:41 2017] [error] ajp_read_header: ajp_ilink_receive failed
[Sun Oct 15 22:40:41 2017] [error] (70007)The timeout specified has expired: proxy: read response failed from 127.0.0.1:8009 (localhost)

Note that this log starts on Nov 22nd of last year and 1/2 of this log's errors are from just yesterday and today.

When was the last backup or restore performed on this system?

Steve

DCIM_Support · ‎2020-07-04

Hi Steve - sorry for the delay - the 'private' posts don't seem to send email alerts when they have changed...

Not sure exactly when the issue occurred the second time - it was sometime between 1730 on Friday and 08:30 on the Monday.

I did not wait for the server to become responsive as on the first occasion it had been like this for many hours. The customer shutdown the server using the VM host. It rebooted again quite quickly.

I have attached a spreadsheet of the device list on the server.

No cameras. There are a number of devices offline maybe 10 in total, although the device list does not reflect this - some comms alarms have been disabled.

There are some custom DDF's on the system (written by Schneider) for some non APC cooling units - all identified as RTCS....

No DCO is connected, no RMS. We use Portal to look at the active alarms on the system though.

No proxy - the DCE server does not have access to the outside world. We access it via a secured VPN straight to the public interface.

The server was rebooted on the 9th Oct when the issue first was reported (although it may have been in this condition for a couple of days), and again on 16th Oct.

The server has never been backed up - they use some sort of VM snapshot backup system as far as I am aware, although this seems to be on an adhoc basis rather than a regular thing.

The last known change on the server was to add 2 more sensors to a Virtual sensor. I don't recall which sensors they were though... The changes were made on the 3rd October I think.

The server is currently being updated to 7.4.3 at this very moment. It has been 'Fixing APC SmartUPS UPS sensors' for over 6 hours now... so much for the 45 minutes claimed when you start the update.

DCIM_Support · ‎2020-07-04

Hi Garry,

As I mentioned, I see tons of HTTP errors on the 15/16 of Oct. but this didn't seem to be the same as what may have happened around the 9th. I guess this could just be another symptom. I don't see anything where I could say do this or stop that to make this issue never occur again.

I don't know if it's possible but if it does happen again and if you can do a remote session with them, contacting support directly before a reboot could help us get the logs while the issue is happening. That'll not only show us the messages that are happening at that time but also what the resources are saying is being used most and if memory could be an issue.

Increasing memory from 1 gig to 2 or more and increasing processors from a single one (especially if it's shared) could be helpful if it's simply a resource issue.

Steve

DCIM_Support · ‎2020-07-04

This question is closed for comments. You're welcome to start a new topic if you have further comments on this issue.