The Home page displays details about virtual machines and hosts that are in a critical state in your environment. This is the first place to look to see if there's any trouble in your environment.
To check that you are receiving data, you can look at the Proactive Monitoring view and see that the topology tree is built from your environment, or you can look at the App Install Health page.
All of the panels in this dashboard (except for Recent Alarms) are driven by performance metrics.
Understanding the gauges
The first two panels in this dashboard report on this set of key metrics that enable you to monitor the health of the virtual machines and hosts in your environment.
As the value changes over time, the gauge marker changes position within this range. Gauges provide a dynamic visualization of saved searches.
Each of the gauges in the Virtual Machine Health panel and Host System Health panel are graphical representations of the entities (hosts and virtual machines) that are in critical states in your environment for the specified metrics. The gauges measure how your entire environment performs for the critical level. Each gauge displays a percentage (of the total number) of virtual machines and hosts that are in a critical state, over the time period specified, for the specific metric. This value is a numerical representation of the display on the gauge and is based on the same search used to drive the gauge. It displays the current result for that search and the display changes as the results of the search change. The numerical value is mapped against a range of colors. As the value changes over time, the gauge marker changes position within this range.
A gauge can be in one of the following states:
- Red - when virtual machines or hosts are in a critical state for the metric.
- Orange - when virtual machines or hosts are in a warning state for the metric.
- Green - when virtual machines or hosts are in a normal state for the metric.
In addition to specifying a percent value, each gauge can also display:
- 0% - A gauge that displays 0% indicates that none of the virtual machines or hosts in your environment are in a critical state for that metric.
- " no data" - A gauge displays this message when performance data is not collected from your environment and is not coming into Splunk.
Looking at the gauges you can identify if there are problems with hosts or virtual machines that need immediate attention in your environment.
A performance metric for a virtual machine or for host system can be in one of three states: normal, warning, or critical and are driven by the thresholds set for them. You can modify the thresholds for the metrics on the Threshold Configuration page of the app.
For example, in this dashboard a gauge that shows a value of 0% for High CPU Usage indicates that none of the virtual machines in your environment are in a critical state for that metric. This means that of all the performance data that is collected for all of the virtual machines, none of the virtual machines in your environment have a performance metric that meets the critical threshold level set for it. Each of the metrics measured have default thresholds defined for them in the app. You can see the default values or configure other values using the Threshold Configuration dashboard in the app. For information on how to configure critical and warning thresholds for the metrics, see "Add, edit, and delete threshold settings" in the Splunk App for VMware Configuration Guide.
Virtual Machine Health
The gauges displayed in this panel are a measure of how your virtual machines perform for the critical level.
The following key metrics, used to show the health of the virtual machines, drive the gauges in the Virtual Machine Health panel:
- High CPU usage - The threshold for the metric average_cpu_usage_percent drives this gauge. This is the virtual machines's average cpu usages as a percent value.
- High memory usage - The threshold for the metric average_mem_usage_percent drives this gauge. This is the average of the amount of memory the virtual machine uses, as a percent value.
- High CPU Sum Ready time: The threshold for the metric summation_cpu_ready_millisecond drives this gauge. This metric is measured in milliseconds and it is a measure of how long a virtual machine has been waiting for processing time from the host. The virtual machine is ready, but it can't do anything as the host has not allocated any resources to it. Sometimes a virtual machine that has too many resources allocated to it does not get scheduled to run by the host and is left waiting.
- Total VMs: This is a count of the total number of virtual machines in your environment. Click on the number for Total VMs to see more details about each of the virtual machines in your environment, such as the host system that it is on and the associated vCenter.
- Total VM Migrations: This is the total number of virtual machines that migrated. Click on the number for Total VM Migrations get more details about the virtual machines that migrated in the last four hours. To see the virtual machines that migrated the most, you can re-order this list by "TotalMigrations".
Host System health
The gauges displayed in this panel measure how your host systems perform for the critical level.
The following key metrics, used to show the health of the host systems, drive the gauges in the Host System Health panel:
- High Memory Ballooning - The threshold for the metric average_mem_vmmemctl_kiloBytes drives this gauge. This is the sum of all values from VMware's ballooning driver for all powered-on virtual machines. The host memory must be large enough to support the active memory of all virtual machines on the host. This number should be 0. Balloon drivers activate when memory is scarce. It's best not to have any ballooning activity.
- High Memory swapping - The threshold for the metric average_mem_llSwapUsed_kiloBytes drives this gauge. This is the amount of memory from all virtual machine that has been swapped by the host. This is a host swapping memory and is always a sign of the host being in a stressed state. Whenever this threshold is triggered, the host has no memory, and cannot reclaim it from the ballooning driver. This number should be 0.
- High CPU usage: The threshold for the metric average_cpu_usage_percent drives this gauge. This is the host systems average cpu usages as a percent value.
- Total Hosts: This is a count of the total number of hosts in your environment. Click on the value displayed for Total Hosts to see more details about each individual host.
Look at this panel to get information about all of the datastores in your environment. The data is measured in Mega bytes. (It is not a percentage value.) The indicator shows the amount of free space and the amount of storage committed.
You can quickly see if a datastore is close to capacity and in a critical state. Datastores can be in critical, warning, or normal operational states. If the app cannot gather sufficient information about a datastore then the datastore is represented in gray, indicating that the data for the datastore is unavailable or that the entity is not powered on.
Recent VMware alarms
In this panel you can see events that occurred in your environment that triggered alarms. Alarms can be triggered for a number of reasons such as memory usage reaching a critical level for a virtual machine or cpu usage for a host reaching a critical level.
For example, click on an alarm for "virtual machine memory usage" to see the event that triggered it. The Virtual Machine detail page is displayed. You can now see details about the event that triggered the alarm:
Max CPU usage during the selected time range peaked at a critical level with value of 112%. This VM may be undercommited on CPU or the host is stressed.
The source type
vmware:events drives the data that is displayed in this panel.
Navigation and operation
This documentation applies to the following versions of Splunk® App for VMware: 3.4.1, 3.4.2, 3.4.3, 3.4.4, 3.4.5, 3.4.7