Ask our Experts
Didn't find what you are looking for? Ask our experts!
Search in
EcoStruxure IT Support
Submit a support request for additional assistance with EcoStruxure IT software.
Link copied. Please paste this link to share this article on your social media post.
Last Updated:
jdutra
2024-05-07
09:53 AM
There are several symptoms you might see while you interact with DCE that could suggest that there is a potential performance problem in your environment. This list is by no means exhaustive. Some common symptoms are:
Missed sensor update values
The nbc.xml log from the DCE server contains ERROR level messages about dropped sensor updates coming from com.apc.isxc.vb.listeners.sensor.impl.SensorQProcessorRunnable or com.netbotz.server.services.repository.impl.RepositoryEventServiceImpl
Log in to the DCE web client and click Logs in the upper right corner to view the nbc.xml log.
Delay in receiving alarm data
Alarms come into the system significantly after they were triggered on the monitored device.
Server hang, crash, or timeouts
This error message is displayed on the DCE server: Hung_task_timeout error
Contact technical support to gather capture server logs.
See https://www.apc.com/us/en/faqs/index?page=content&id=FA303596
Top is a standard Linux diagnostic tool used to monitor system performance. Direct access to the DCE server is not allowed. Contact technical support to capture server logs that include a top_output.
Note: Prior to DCE 7.7, the top output from captured server logs is averaged across all CPU cores. Starting with 7.7, the output per core is available, which is more insightful.
Support looks at a few different values in the top output:
This lists a load average for the last one, five, and fifteen-minute period. If this number is abnormally high relative to the number of cores you have defined for the system, it is a good indicator that the system is running with a lot of CPU load. The exact cause of the CPU load won’t be clear from this data. The system could be CPU starved if this value remains high for an extended period of time.
It is expected that this value is elevated for some period of time after a system reboot or during a large discovery. You divide this value by the number of cores, and then multiply by 100 to get a percent utilization. Each physical core counts as 1; a hyperthreaded core counts as ½.
For example, an 8 core / 8 thread virtual machine should be able to sustain a load average of 8.0 without being considered oversubscribed. If you are using an 8 core, 16 thread configuration, your acceptable load average is more like 12.0 because not all 16 threads are backed by physical cores.
Mitigation in this case consists of either reducing the load on your DCE (fewer devices, longer poll period) or allocating more CPU resources to your DCE to get your load average to a more acceptable level.
Make sure to review the DCE sizing guide for insight on the best starting values for CPU configuration to use based on the system workload.
This represents the amount of time your system is stopped waiting for the underlying storage device to service requests. DCE is extremely sensitive to IO path delays, so even a slightly elevated value for wait average that persists for any extended period of time can be an issue for the system.
Ideally you want to see this value listed by CPU core. If you see any one individual core with a %wa continually over 20, your storage is likely not keeping up with DCE. If the system is allowed to stay in this state for an extended period of time, you usually start to see the missed sensor update processing symptom listed above. Bear in mind that if you are reviewing the average top output instead of by core, this value can deceptively appear to be much lower due to averaging and the number of cores in the system.
Mitigation requires a deeper dive into your storage path. If you are using network storage, you want to review the latency and utilization of the storage array. If using local ESXi storage, you can review the Host performance data in VMWare. Usually, either reducing the load on the DCE by decreasing device count or increasing poll period will help.
If the storage is truly subpar, upgrading to SSDs, removing other load from the storage system, or improving the network path between the DCE and storage may be required. Reference the DCE sizing guide for more details on appropriate storage sizing.
This is a statistic that the DCE server keeps track of. It represents the amount of sensor processing the server is doing every hour. This value can be monitored at:
http://<dce server ip>/nbc/compress/support/sensorqstats
The dataset can also be retrieved by technical support with a capture server logs gather request.
Regardless of where you view the data, this statistic will publish once an hour every hour. This metric is good to monitor because it shows whether the DCE is keeping up with the current work load or if it is falling behind.
These values are of particular interest:
Processed
This is the number of sensor updates that the server has processed in the last hour. This value is directly impacted by the number of devices in your system, your poll period, and the number of sensor changes that are occurring.
This value represents the total number of unique events that the system completed within that 1-hour period of time. It is best observed during steady system processing. Events like discovering a large quantity of new devices can skew this number for a period or two. Use this value when you review the DCE sizing guide to determine CPU / RAM / Storage sizing.
Dropped
This value should always be zero on a healthy system. Any non-zero value indicates a sensor data point that was dropped because a component of the system cannot keep up. When this value is not zero, we often see %wa elevated in top output.
Remember, DCE is very intolerant of storage latency. If the value for dropped is a recurring non-zero value, some amount of data is constantly lost. If there is a non-zero value occasionally, look into the system during those times; it is likely running near the edge of its capabilities and is pushed beyond its limits. Events such as a large alarm storm, a discovery pulling in a large number of devices, or similar high load events can all push the system temporarily into this state.
A properly configured system should always have zero drops. Anything dropped will be lost forever, so it’s important to monitor and adjust resources accordingly to prevent this.
Remaining
This value represents the amount of sensor data still in the queue to be processed when the qstats report was run. This is not dropped data; it is data that had not yet finished being processed. On smaller systems, this will likely always be zero. As the workload increases, this value could start to become non-zero.
By itself, having some non-zero values here is not cause for alarm. If you are regularly seeing non-zero values, or the value is growing in size every hour, it’s a sign that the system is starting to have trouble keeping up.
The primary focus of this section is specific to DCE run as a virtual machine. Information that is not hypervisor-centric technically applies to DCE physical servers also.
Sometimes, there are delays for reasons not readily seen from the DCE or DCE OS point of view. In these cases, it helps to review the performance data from the hypervisor side of things to see if there are any performance issues.
It is good to verify whether any resource limits are defined for the DCE virtual machine. Resource limits are a throttling mechanism that allow a VM administrator to restrict the amount of resources a virtual machine can consume. These limits can be imposed on CPU, RAM, and storage resources, effectively restricting the virtual machines use of these resources.
If there are resource limits in place, try removing them or raising them to a higher value. Monitor the system utilization values as a result of the change to monitor for improvements.
To start, identify the DCE VM from within the hypervisor and review the details of the virtual machine. Specifically, look for the disk drive(s) of the DCE, and the storage backing that drive.
If your DCE has more than one disk drive, they should ALL be located on the same storage destination. Splitting DCE drives among multiple storage backings almost always results in decreased VM performance and should be avoided as a general rule.
VMWare
In VMWare, you can monitor the real-time disk performance results of the storage that is backing the DCE VM. The specifics of finding this data differs a bit between versions of VMWare, and whether you investigate from the ESXi locally or from within vCenter. All the versions have support for monitoring the disk latency.
Look for the Advanced Performance Monitoring section of the ESXi host running the DCE VM. In that section, you can view the real time latency of all IO operations that host is sending to disk.
Hyper-V
In Hyper-V, you can use the windows Performance Monitor to track the Latency of the target VM. Latency can be found under the Hyper-V Virtual Storage Device category when you add counters to the Performance Monitor. Just like VMware
DCE is very sensitive to disk latency. Make sure the latency value, in ms, is less than 1 for the datastore backing the DCE VM. While some short-lived spikes can be tolerated, it is best to make sure the steady state and average response time remains below 1ms.
If response times exceed 1ms, look for ways to lower that value. You can reduce the number of systems that also use that shared volume, isolate the DCE VM to be the only system using that volume, or upgrade the target volume to have more disks, faster disks, or preferably SSDs.
Drilling into another level of the hypervisor, you can run esxtop, a real-time performance analysis tool provided by VMWare. This utility is very similar to Linux top, and its usage is the same.
To start, SSH must be enabled on the ESXi running your DCE VM, and you must have the proper credentials to SSH into the ESXi. This is a real time analysis, so the information gathered will only be applicable if your DCE is in the performance degraded state while you run this tool. For intermittent issues, you should run this tool and then cause the event that triggers the degraded system state.
As an example, the following steps cover how to perform a 30-minute esxtop capture from the ESXi. There is additional documentation about running esxtop interactively in Additional resources below.
To capture a 30-minute data set from esxtop:
You can now choose which data from the log collection you want to graph to determine signs of stress from typical system resources: CPU, RAM, drives. Some values of interest are:
For additional analysis of the esxtop data, see Additional resources below.
To track CPU usage in Windows Performance Monitor, follow these steps:
Additional resources to help you better understand some of the performance tools, what they mean, and how to use them:
ESXTOP quick overview
http://www.running-system.com/wp-content/uploads/2015/04/ESXTOP_vSphere6.pdf
ESXTOP metrics
https://www.virten.net/vmware/esxtop/
ESXTOP interpretation
https://communities.vmware.com/docs/DOC-9279
VMWare KB: Troubleshooting ESXi virtual machine performance issues
https://kb.vmware.com/s/article/2001003
VMWare KB: Troubleshooting ESXi storage performance issues
https://kb.vmware.com/s/article/1008205
https://www.smikar.com/troubleshooting-hyper-v/
Use these questions as a starting point for data that you should gather from the site if you suspect a DCE VM performance issue. If you open a case to diagnose this problem, technical support and engineering will request this data. Proactively gathering the data will help expedite issue resolution.
The questions are written to gain a better understanding of the environment hosting the DCE virtual machine. The goal is to understand the capabilities of the hypervisor, the storage supporting DCE, and resource utilization.
If your DCE VM is leveraging network storage for their disk backing:
Please describe the network topology where this DCE is deployed. Link speeds between nodes of the system are of specific interest.
While running your typical DCE workload, use the esxtop tool to collect a snapshot of your system. Ideally, the collection should cover the period of time where you are experiencing the performance issue.
Esxtop collection
Link copied. Please paste this link to share this article on your social media post.
You’ve reached the end of your document
Create your free account or log in to subscribe to the board - and gain access to more than 10,000+ support articles along with insights from experts and peers.