APC UPS Data Center & Enterprise Solutions Forum
Schneider, APC support forum to share knowledge about installation and configuration for Data Center and Business Power UPSs, Accessories, Software, Services.
Posted: 2021-07-01 05:07 AM . Last Modified: 2024-03-05 01:46 AM
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 05:07 AM . Last Modified: 2024-03-05 01:46 AM
Good morning.
At work we have 12 x ACRC103 units, an UPS and some PDU's and an Ifastruxure server
We have a check_mk server that check periodically these units (every two minutes) with a custom python script (I will publish it here in next days for who needs) that simply perform some SNMP query to detect some values.
We noticed that always there is one or more server (but never more than 2-3) that does not reply in time and give timeout. The strange thing is that are not always the same server, but they are rotating every 5-10 minutes.
I tried to call snmpwalk manually and I noticed that sometime it happens that the snmpwalk simply hangs and don't answer anymore.
What we did to debug this problem:
1) Changed IP address to find out if was an IP conflict
2) Changed check_mk server to find out if it was a server problem
3) Increate the SNMP timeout up to the max possible
4) Reduce the ckeck frequency
My suspect is that when the Infastruxure server contact the units, the unit simply hang and does not answer anymore to snmp queries. Is it possible?
Another thing I noticed when I was checking the firmware version is that seems there are two different version installed in the same unit. Is this ok?
Thank you in advance
Michele
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 05:07 AM . Last Modified: 2024-03-05 01:45 AM
Hi Michele.
I would definitely recommend to upgrade the Central server to the newest version.You would need a software support contract to receive the link for upgrade.
Is it possible to post the python script for review?
In your initial post, you indicate that 2-3 servers do not reply and you get timeouts. Where does it show these timeouts? On Central? On another polling system? Please specify.
At the time of the timeouts, are you able to connect to the devices using a different SNMP Utility? What does this show? If the device is not reachable via SNMP (at the time of the timeout/comms loss) and there are no events on the device, then it sounds like it could point to a network issue.
What utility are you using for your SNMPWalk? Did you carry out an SNMPWalk on the APC devices? Do these devices timeout?
Do you get any timeouts/comms loss alarms on Central for the devices that are listed above? If so, is it possible to send in the logs? If would also be helpful to send in the alarm history on Central for those devices that are timing out.
What is the timeout/retries specified in Central? What is the scan interval set to?
How many devices in total are you monitoring via Central?
Are you using a proxy?
Sometimes if you are using other applications to poll the same devices, this could be interfering with our polling. When did you first notice the issue? Any changes been made to the Network? If it possible as a test to disable the other polling applications and just use Central to poll the devices as a test?
What happens when you directly poll this devices? Is there any timeouts then?
Are all the devices on the same subnet? Is there heavy traffic on this subnet that might be causing issues/timeouts? Have you tried to move one of the devices to a different subnet as a test?
Are the devices losing comms at the same time or a different times? Is it the same time every day? How many times a day? Is it random?
Is it possible to run a packet capture? If you can, we should be able to tell if the Central is polling the devices or not and if they are responding.
Regards,
B
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 05:07 AM . Last Modified: 2024-03-05 01:46 AM
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 05:07 AM . Last Modified: 2024-03-05 01:46 AM
Good morning.
At work we have 12 x ACRC103 units, an UPS and some PDU's and an Ifastruxure server
We have a check_mk server that check periodically these units (every two minutes) with a custom python script (I will publish it here in next days for who needs) that simply perform some SNMP query to detect some values.
We noticed that always there is one or more server (but never more than 2-3) that does not reply in time and give timeout. The strange thing is that are not always the same server, but they are rotating every 5-10 minutes.
I tried to call snmpwalk manually and I noticed that sometime it happens that the snmpwalk simply hangs and don't answer anymore.
What we did to debug this problem:
1) Changed IP address to find out if was an IP conflict
2) Changed check_mk server to find out if it was a server problem
3) Increate the SNMP timeout up to the max possible
4) Reduce the ckeck frequency
My suspect is that when the Infastruxure server contact the units, the unit simply hang and does not answer anymore to snmp queries. Is it possible?
Another thing I noticed when I was checking the firmware version is that seems there are two different version installed in the same unit. Is this ok?
Thank you in advance
Michele
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 05:07 AM . Last Modified: 2024-03-05 01:46 AM
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 05:07 AM . Last Modified: 2024-03-05 01:46 AM
Hello Angela.
Here there are the required informations:
1) We have an InfraStruxure Central - Version 6.2.0
2) We have a dedicated private lan for these devices:
12 x ACRC103 (application acrc v.3.7.0, os aos 3.7.3 - Is it ok to have different versions?)
1 x 0G-9354-01 - PDU (application xrdp v.3.7.0, os aos v.3.7.3)
2 x AP7957 - switched rack pdu (application rpdu v.3.7.0, os aos v.3.7.0)
1 x AP7853 - rack pdu (application rpdu v.2.6.5, os aos v.2.6.4)
1 x AP7853 - metered rack pdu (application rpdu v.3.5.5, os aos v.3.5.6)
3) SNMP versiomn: v1 (v3 is disabled)
4) I don't have to restart anything. After some time the one that was in timeout is running back again but another (or other 2-3) show the same problem.
5) There are quite a lot OID requested: 9 OID for PDU (every OID returns 22 entries), 30 OID for ACRC103 (every OID returns just one value)
Thank you in advance
Michele
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 05:07 AM . Last Modified: 2024-03-05 01:45 AM
QueenB - does this sound related to ISX Central at all (since it is several revisions behind the current)? I was leaning towards no since it sounds like Michele Renda has attached to the APC LAN and done an SNMPWalk on the devices which don't respond either but they come back to life by themselves.
On the AOS and APP versions, it's OK to have different numbers. Sometimes they are the same, sometimes they are not. AOS v3.7.4 and rpdu v3.7.4 are the latest versions for rpdu. v2.X.X is quite old (like 10 years old!!) and is there a reason you have not updated your AP78XX or AP79XX devices to newer versions? Your xrdp ISX PDU device is only one or two revisions behind. I certainly cannot vouch for SNMP stability on the v2.X.X firmwares but it should be OK on most of the 3.X.X firmware level stuff. I'd still consider upgrading anything that can use upgrading if you wanted to in order to rule out any issues. If there is still a problem, we at least know it is occurring on the latest production revs we offer.
Are there any APC devices on the APC LAN that are not experiencing this issue in the same way? And from what you said, it's random, correct? So it's not like they all timeout at the same time?
When they don't respond, does it get say it's a timeout completely or do devices return partial values just really slowly..
From what I see, check_mk is a nagios plug in? Is that just a computer you have on the APC LAN that pings these devices or retrieves a particular set of OIDs too? (in addition to the status that ISX Central can provide?)
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 05:07 AM . Last Modified: 2024-03-05 01:45 AM
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 05:07 AM . Last Modified: 2024-03-05 01:45 AM
Hello Angela.
Thank you for your answers. I followed your suggestion and I updated the firmware of almost all the units (I have still something to complete) and the problem is still there.
Now, the only thing I needs to complete is the "InfraStruxure Central - Version 6.2.0" unit. I tried to look for a firmware update but no success. Do you know if there is a public available update in apc.com site? We don't have (anymore) a support contracrt with APC.
Thank you very much for your support and have a nice day.
Regards
Michele
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 05:07 AM . Last Modified: 2024-03-05 01:45 AM
Hi Michele,
You'll need to have a paid support contract as far as I know in order to obtain an update for your server..
Link copied. Please paste this link to share this article on your social media post.
Link copied. Please paste this link to share this article on your social media post.
Posted: 2021-07-01 05:07 AM . Last Modified: 2024-03-05 01:45 AM
Hi Michele.
I would definitely recommend to upgrade the Central server to the newest version.You would need a software support contract to receive the link for upgrade.
Is it possible to post the python script for review?
In your initial post, you indicate that 2-3 servers do not reply and you get timeouts. Where does it show these timeouts? On Central? On another polling system? Please specify.
At the time of the timeouts, are you able to connect to the devices using a different SNMP Utility? What does this show? If the device is not reachable via SNMP (at the time of the timeout/comms loss) and there are no events on the device, then it sounds like it could point to a network issue.
What utility are you using for your SNMPWalk? Did you carry out an SNMPWalk on the APC devices? Do these devices timeout?
Do you get any timeouts/comms loss alarms on Central for the devices that are listed above? If so, is it possible to send in the logs? If would also be helpful to send in the alarm history on Central for those devices that are timing out.
What is the timeout/retries specified in Central? What is the scan interval set to?
How many devices in total are you monitoring via Central?
Are you using a proxy?
Sometimes if you are using other applications to poll the same devices, this could be interfering with our polling. When did you first notice the issue? Any changes been made to the Network? If it possible as a test to disable the other polling applications and just use Central to poll the devices as a test?
What happens when you directly poll this devices? Is there any timeouts then?
Are all the devices on the same subnet? Is there heavy traffic on this subnet that might be causing issues/timeouts? Have you tried to move one of the devices to a different subnet as a test?
Are the devices losing comms at the same time or a different times? Is it the same time every day? How many times a day? Is it random?
Is it possible to run a packet capture? If you can, we should be able to tell if the Central is polling the devices or not and if they are responding.
Regards,
B
Link copied. Please paste this link to share this article on your social media post.
Create your free account or log in to subscribe to the board - and gain access to more than 10,000+ support articles along with insights from experts and peers.