Splunk® App for NetApp Data ONTAP (Legacy)

Deploy and Use the Splunk App for NetApp Data ONTAP

On June 10, 2021, the Splunk App for NetApp Data ONTAP will reach its end of life and Splunk will no longer maintain or develop this product.

Reports

About reports

This topic describes each of the reports provided in this app. Searches saved as reports are listed on the Reports page. When you run a search you can save it as a report, an alert, a dashboard, or an event type. In each case, the format of the saved results determines where you can find the search in Splunk Web.

To get a list of the reports, click Reports in the app menu. You can use the default reports or you can modify the reports to generate specific results for your environment.

Connection problems in the last hour

Description

Use this search to get insight to the connection problems between this app and your NetApp filers. Connection issues can prevent data from coming into the app. When the Splunk App for NetApp Data ONTAP attempts to collect data from the ONTAP filer and experiences a connection problem, all of the events that result in a time out or that result in an unsuccessful login are returned.

search

index=_internal source=*hydra* OR source=*splunk_ta_ontap_api* ("*[Errno 8]*" OR "timed out" OR "Could not login")

Unhealthy cluster nodes in the last hour

Description

This search queries the ONTAP data for an event containing the string "Node is not healthy". The search returns the name of the "unhealthy" node and a timestamp for when the message was sent. Healthy nodes in a cluster can communicate with each other. When nodes are unhealthy the cluster looses the ability to successfully and reliably perform cluster operations.

Search

index=_internal (source="*hydra*" OR source="*splunk_ta_ontap_api*") "Node is not healthy" node=* | table _time,node dispatch.earliest_time = -1h

Missing filer capability collection errors in the past hour

Description

The search returns a count of the API permissions errors. It queries all events containing errors that relate to having an incorrect set of capabilities to invoke the NetApp API. "Missing filer capability" is a specific type of collection error that indicates that a permissions error prevents the collection of data from the filers.

search

index=_internal source=*hydra* "does not have capability" ERROR dispatch.earliest_time = -1h

Volume Capacity Delta Table

Description

Use this search to be proactive regarding the storage changes in your volumes. Volume events provide you with information about the status of your volumes so that you can proactively monitor for potential storage problems. This search compares the storage on volumes between two different point in time (posterior storage used and prior storage) over the last 24 hours, and shows the change in capacity on the volumes. Computing the difference shows the growth trend that has happened between the two points calculated as a percent or based on capacity of storage on the volumes. For example, if your storage capacity is growing at 6% per day, then you can estimate how long it will take before you use up all available capacity. The legend on the chart indicates the data points that we compute in GB. A percent format is also provided. Prior storage for the particular volume is the event recorded earlier in time. Posterior storage for the volume is the event recorded later in time.

search

sourcetype=ontap:volume storage_used=* | eval name=if(isnull(name),$volume-id-attributes.name$,name) | table _time,host, storage_used,storage_used_percent,name | stats first(storage_used) as posterior_storage_used last(storage_used) as prior_storage_used first(storage_used_percent) as posterior_storage_used_percent last(storage_used_percent) as prior_storage_used_percent last(_time) as prior_time first(_time) as _time by host,name | eval percent_change=posterior_storage_used_percent-prior_storage_used_percent | eval capacity_change=posterior_storage_used-prior_storage_used | convert ctime(prior_time) | table prior_time,_time,host,name,prior_storage_used,posterior_storage_used,prior_storage_used_percent,posterior_storage_used_percent,percent_change,capacity_change dispatch.earliest_time = -24h

Total events in the past hour

Description

This search provides a total count of the number of syslog or Event Management System (EMS) events processed in the last hour. You can look at system logs to proactively monitor your environment for configuration or system changes. If there is a dramatic increase in the number of syslog events coming in from a filer, it can indicate that there is a problem. As a user look here to see if there is a problem with syslog data coming in.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems"| stats count dispatch.earliest_time = -1h

Total error events in the past hour

Description

This search returns a total count of the number of syslog or Event Management System (EMS) error events processed in the last hour. You can look at system logs to proactively monitor your environment for configuration or system changes. The search queries the ONTAP syslog data for the string "error". As a user you can monitor this input for spikes in the events. A spike can identify error events or possible states of the filer that are prone to errors. You can then drill down and find out more details about the problem.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems" error | stats count dispatch.earliest_time = -1h

Total events by filer in the past hour

Description

This search returns a total count of the number of events, broken down by filer, processed in the last hour. You can see if one filer in particular is causing problems. You can use this search to proactively monitor your environment for configuration or system changes. You can see trends or investigate spikes in the data coming in.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems" | stats count by host dispatch.earliest_time = -1h

Total alert and critical events in the past hour

Description

The search queries the error string of the ONTAP syslog data for the strings "alert" or "critical". It reports the number of alert of critical events that occurred in the last hour for a host.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems" (alert OR critical) | stats count by host 
dispatch.earliest_time = -1h
 

Count of total disk and controller events by filer in the past hour

Description

This search returns a total count of disk and controller events. Use this information to determine the health of your system. Drill down on the chart or result table to get more detailed information. Spikes in the number of controller or disk events reported can indicate problems that need to be investigated.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems" (controller OR disk) | stats count by host
dispatch.earliest_time = -1h

Count of disk events over time by filer

Description

Use this search to proactively monitor your environment. This search returns a count of the number of disk events on a particular filer, processed in the last hour. The search queries the syslog data for a string containing "disk". An increase in the number of events coming from a particular disk on a filer can indicate a problem. As an admin you have an established baseline for normal behavior in your environment. Compare the numbers reported against the baseline activity for the filer to identify a potential problem. Look at the chart to see spikes in events that can determine high or low disk usage for a particular filer that is outside the normal range for your environment. You can look at the trend in your data and be proactive in managing your environment. Click on the chart or click on the table to drill down to see the events.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems" disk | timechart count by host
dispatch.earliest_time = -1h

Count of error events over time by filer

Description

This search returns a count of the number of error events, broken down by filer, processed in the last hour. Use this search to proactively monitor your environment. The search queries the syslog data for a string containing "error". Compare the numbers reported by the search against the baseline activity for the filer. Look at the chart to see spikes in error events. Drill down on the chart or values in the table to get to individual error events. Examine the error events for severity and the impact the problem has on your system. Look at the chart to see trends in your data and be proactive in managing your environment.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems" error | timechart count by host
dispatch.earliest_time = -1h

Count of disk error events over time by filer

Description

Use this search to proactively monitor your environment. This search returns a count of the number of disk error events (such as disk failures, problems with disk assignment) on a particular filer, in the last hour. The search queries the syslog data for strings containing "disk" and "error". As an admin you have an established baseline for normal behavior in your environment. Compare the numbers reported against the baseline activity for the filer. Look at the chart to see spikes in error events. Drill down on the chart or values in the table to get to individual events. Examine the error events for severity and the impact the problem has on your system. Look at the chart to see trends in your data.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems" error disk | timechart count by host
dispatch.earliest_time = -1h

Count of alert and critical events over time by filer

Description

Use this search to proactively monitor your environment. This search returns a count of the number of ONTAP syslog events that are of an alert or critical status, that happened in the last hour, on a particular filer. Drill down on the chart or values in the table to examine individual events. Looking at the chart you can see trends in your data and be proactive in managing your environment.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems" (alert OR critical) | timechart count by host
dispatch.earliest_time = -1h

Count of read error events on disks by filer

Description

Use this search to proactively monitor your environment. This search returns a count of the number of disk read error events on a particular filer, in the last hour. The search queries the ONTAP syslog data for strings containing "disk", "read", and "error". Look at the chart to see trends in your data and to investigate spikes in read error events, indicating problem areas. Drill down on the chart or values in the table to get to individual events. Examine the error events for severity and the impact on your system.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems" disk read error | timechart count by host
dispatch.earliest_time = -1h

Count of aggregate events over time by filer

Description

Use this search to proactively monitor your environment for potential problems. The search returns a count of the number of events found that contain the term "aggregate". Aggregate events provide status information about the aggregates. Drill down on the chart or the results table to get more detail information about the event including host, source type, and severity details.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems" aggregat* | timechart count by host
dispatch.earliest_time = -1h

Count of volume events over time by filer

Description

Use this search to proactively monitor your environment for potential problems. The search returns a count of the number of events found that contain the term "volume". Volume events provide status information about the volumes. Drill down on the chart or the results table to get more detail information about the event including host, source type, and severity details. See the NetApp documentation for a list of volume events.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems" volume* | timechart count by host
dispatch.earliest_time = -1h

Count of snapshot events on aggregates over time by filer

Description

Use this search to proactively monitor your environment for potential problems. The search returns a count of the number of snapshot events found on aggregates over the time range specified. Drill down on the chart or the results table to get more detail information about the event including host, source type, and severity details. Snapshots require storage space on volumes. You can plan to allocate space for the snapshots by looking at the chart over time to see the trend.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems" snapshot* aggregat* | timechart count by host
dispatch.earliest_time = -1h

Count of error snapshot events over time by filer

Description

Use this search to proactively monitor your environment for potential problems. The search queries the data for "snapshot" and "error" and returns a count of the number of error snapshot events found per filer. Drill down on the chart or the results table to get more detail information about the event including host, source type, and severity details.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems" snapshot error | timechart count by host
dispatch.earliest_time = -1h

Count of SnapMirror error events over time by filer

Description

Use this search to proactively monitor your environment for potential problems. The search returns a count of the number of SnapMirror error events found per filer over the time range specified. Drill down on the chart or the results table to get more detail information about the event including host, source type, and severity details.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems" snapmirror error | timechart count by host
dispatch.earliest_time = -1h

Count of Monitoring and Host Configuration events over time by filer

Description

Use this search to proactively monitor your environment. This search returns a count of the monitoring and host configuration events that appear in your syslog data, per filer, over a one hour timeframe. Compare these events against the expected normal behavior for your environment. An increase in configuration events can indicate that something unexpected is happening to some element of your environment that you need to investigate. Drill down on the chart or the results table to investigate individual events and proactively respond to configuration issues and prevent issues that can lead to a degradation in performance and system unavailability.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems" (monitor*  OR config*) | timechart count by host dispatch.earliest_time = -1h

Count of Backup and Restore events over time by filer

Description

Use this search to proactively monitor your environment. This search returns a count of the backup and restore events that appear in your syslog data, per filer, over a one hour timeframe. Look for spikes in the data and monitor the trend over time. An increase in events can indicate a problem in your environment that you need to investigate. Drill down on the chart or the results table to investigate individual events and proactively respond to the issue.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems" (backup OR restor*) | timechart count by host
dispatch.earliest_time = -1h

Count of Optimization and Migration events over time by filer

Description

Use this search to proactively monitor your environment. This search returns a count of the optimization and migration events that appear in your syslog data, per filer, over a one hour timeframe. Look for spikes in the data and monitor the trend over time. Drill down on the chart or the results table to investigate individual events and proactively respond to the issue.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems" (optimiz* OR migrat*) | timechart count by host
dispatch.earliest_time = -1h

Count of Provisioning and Cloning events over time by filer

Description

Use this search to proactively monitor your environment. This search returns a count of the provisioning and cloning events that appear in your syslog data, per filer, over a one hour timeframe. Look for spikes in the data and monitor the trend over time. Drill down on the chart or the results table to investigate individual events and proactively respond to the issue.

search

sourcetype="ontap:syslog" OR sourcetype="ontap:ems" (provision* OR clon*) | timechart count by host
dispatch.earliest_time = -1h

NFS Volumes used by VMware

Description

This search enables the correlation of NetApp ONTAP Data with VMware data. You must have the Splunk App for VMware installed. Run the search to get a table displaying all of the NFS volumes used in your VMware environment. See the topic Correlate NetApp and VMware data for more information about data correlation in this app.

Search

sourcetype="ontap:volume" (source=volume-get-iter OR source=volume-list-info-iter-start) | rename  volume-id-attributes.name as name | stats values(name) as volname by host | lookup dnslookup clienthost AS host OUTPUT clientip AS ip | mvexpand volname | table * | join type=inner ip, volname [search `VmwareNFSMounts`] | rename name as "Datastore name", path as "Path", volname as Volume, filer as "Filer (VMware data)", host as Filer, ip as IP, vcenter as VCenter

Aggregates with over 90% capacity used

Description

This report shows all of the Aggregates that have over 90% capacity used, in the last 24 hours.

Search

sourcetype="ontap:aggr" (source="aggr-list-info" OR source="aggr-get-iter") | `CoalesceAggrFields` | search size-percentage-used > 90 | dedup name, host |   eval "gb-total"=`BytesToGigaBytes(sz_total)` | eval "gb-free"=`BytesToGigaBytes(sz_free)` | table name, host, volume-count, size-percentage-used, "gb-total", "gb-free"

Disk block transfer rates by Filer and RPM

Description

This report shows the block transfer rates and RPM for all disks associated with a filer.

Search

index=ontap source="diskperfhandler" objname=* | stats avg(total_transfers_rate), avg(user_read_blocks_rate), avg(user_write_blocks_rate) by host, display_name, disk_speed | rename disk_speed AS rpm, display_name AS disk_name

Failed Disks

Description

This report shows the disks that have raid-state as "broken".

Search

sourcetype="ontap:disk" raid-state="broken" | rename physical-space as pspace | eval phys-space-gb=`BytesToGigaBytes(pspace)` | table host, serial-number, name, raid-state, raid-type, disk-type, firmware-revision, rpm, phys-space-gb, aggregate, shelf, bay, pool

Top 10 Busiest Filers - 7 mode and Cluster mode

Description

This report shows the top ten filers with highest total_ops_rate.

Search

sourcetype=ontap:perf source="SystemPerfHandler" | stats  first(total_ops_rate) AS total_ops_rate, first(read_ops_rate) AS read_ops_rate, first(write_ops_rate) AS write_ops_rate, first(cpu_busy_percent) AS cpu_busy_percent, by host | sort - total_ops_rate | head 10

Unhealthy cluster nodes in the past hour

Description

This report shows any node that returns a message of "Node is not healthy" in the specified time period.

Search

index=_internal (source="*hydra*" OR source="*splunk_ta_ontap_api*") "Node is not healthy" node=* | table _time,node

Volumes with latency higher than 25ms over 5% of the time

Description

This report shows all Volumes that over 5% of the time have a latency higher than 25ms.

Search

sourcetype=ontap:perf source=VolumePerfHandler objname="*" |  eval ismatch=if(latency>25, 1, 0) | stats count, sum(ismatch) AS matchCount, max(latency) AS max_latency, avg(latency) AS avg_latency by host, objname | eval percentage=round(100*matchCount/count,0) | search percentage > 5 | fields - count, matchCount | rename objname AS volume

Volumes with over 75% capacity used

Description

This report shows all Volumes that have over 75% capacity used, in the last 24 hours. The table displays the name, containing aggregate, percent used, GB-total, GB-used, and Snapshot-percent-reserved.

Search

sourcetype="ontap:volume" (source=volume-get-iter) OR (source=volume-list-info-iter-start) | `CoalesceVolumeFields` | search percentage-used >= 75 | dedup name | eval "gb-total"=`BytesToGigaBytes(sz_total)` | eval "gb-used"=`BytesToGigaBytes(sz_used)` | table host, name, containing-aggregate, percentage-used, "gb-total", "gb-used", snapshot-percent-reserved | sort - percentage-used
Last modified on 03 April, 2017
Proactive Monitoring dashboards   Settings dashboards

This documentation applies to the following versions of Splunk® App for NetApp Data ONTAP (Legacy): 2.1.6, 2.1.7, 2.1.8, 2.1.91


Was this topic useful?







You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters