KV store dashboards
This topic is a reference for the KV store: Deployment and KV store: Instance dashboards in the Monitoring Console. See About the Monitoring Console.
What do these dashboards show?
The deployment-wide and instance-scoped KV store dashboards track much of the same statistics collected from KV store.
The KV store: Deployment dashboard in the Monitoring Console provides information aggregated across all KV stores in your Splunk Enterprise deployment. Instances are grouped by values of different metrics. For an instance to be included in this dashboard, it must be set with the server role of KV store. Do this in the Monitoring Console Setup page.
The instance level KV store view in the Monitoring Console shows performance information about a single Splunk Enterprise instance running the app key-value store. If you have configured the Monitoring Console in distributed mode, you can select which instance in your deployment to view.
Collection metrics come from the KVStoreCollectionStats component in the _introspection index, which is a historical record of the data at the
/services/server/introspection/kvstore/collectionstats REST endpoint. The metrics are:
- Application. The application the collection belongs to.
- Collection. The name of the collection in KV store.
- Number of objects. The count of data objects stored in collection.
- Accelerations. The count of accelerations set up on the collection. Note: These are traditional database-style indexes used for performance and search acceleration.
- Accelerations size. The size in MBs of the indexes set up on the collection.
- Collection size. The size in MBs of all data stored in the collection.
Snapshots are collected through REST endpoints, which deliver the most recent information from the pertinent introspection components. The KV store instance snapshots use the endpoint
- Lock percentage. The percentage of KV store uptime that the system has held either global read or write locks. A high lock percentage has impacts across the board. It can starve replication or even make application calls slow, time out, or fail.
- Page fault percentage. The percentage of KV store operations that resulted in a page fault. A percentage close to 1 indicates poor system performance and is a leading indicator of continued sluggishness as KV store is forced to fallback on disk I/O rather than access data store efficiently in memory.
- Memory usage. The amount of resident, mapped, and virtual memory in use by KV store. Virtual memory usage is typically twice that of mapped memory for KV store. Virtual memory usage in excess of 3X mapped might indicate a memory leak.
- Network traffic. Total MBs in and out of KV store network traffic.
- Flush percentage. Percentage of a minute it takes KV store to flush all writes to disk. Closer to 1 indicates difficulty writing to disk or consistent large write operations. Some OSes can flush data faster than 60 seconds. In that case, this number can be small even if there is a writing bottleneck.
- Operations. Count of operations issued to KV store. Includes commands, updates, queries, deletes, getmores, and inserts. The introspection process issues a command to deliver KV store stats so the commands counter is typically higher than most other operations.
- Current connections. Count of connections open on KV store.
- Total queues. Total operations queued waiting for the lock.
- Total asserts. Total number of asserts raised by KV store. A non-negative number can indicate a need to check KV store logs.
Many of the statistics in this section are present in the Snapshots section. The Historical view presents trend information for the metrics across a set span of time. These stats are collected in KVStoreServerStats. By default the Historical panels show information for the past 4 hours. Gaps in the historical graphs typically indicate a point at which KV store or Splunk Enterprise was unreachable.
- Memory usage - see above.
- Replication lag. The amount of time between the last operation recorded in the Primary OpLog and the last operation applied to a secondary node. Replication lag in excess of the primary opLog window could result in data not being properly replicated across all nodes of the replication set. In standalone instances without replication this panel does not return any results. Note: Replication lag is collected in the KVStoreReplicaSetStats component in the _introspection index.
- Operation count (average by minute) - see above. This panel shows individual operation types (for example, commands, updates, and deletes) or for all operations.
- Asserts - see above. This panel allows for filtering based on type of assert - message, regular, rollovers, user, warning.
- Lock percentage. Percentage of KV store uptime that the system has held global, read, or write locks. Filter this panel by type of lock held:
- Read. Lock held for read operations.
- Write. Lock held for write operations. KV store locking is "writer greedy," so write locks can make up the majority of the total locks on a collection.
- Global. Lock held by the global system. KV store implements collection-level locks, reducing the need for aggressive use of the global lock.
- Page faults as a percentage of total operations - see above.
- Network traffic - see above. Added to this panel are requests made to the KV store.
- Queues over time. The number of queues, broken down by:
- Read. Count of read operations waiting for a read lock to open.
- Write. Count of write operations waiting for a write lock to open.
- Connections over time.
- Percent of each minute spent flushing to disk - see above.
- Slowest operations. The ten slowest operations logged by KV store in the selected time frame. If profiling is off for all collections, this could have no results even if you have very slow operations running. Enable profiling on a per collection basis in collections.conf.
Deployment Snapshot Statistics access the
/services/server/introspection/kvstore/serverstatus REST endpoint.
Where do these dashboards get their data from?
KV store collects data in the _introspection index.
These statistics are broken into the following components:
- KVStoreServerStats. Information about how the KV store process is performing as a whole. Polled every 27 seconds.
- KVStoreCollectionStats. Information about collections within the KV store. Polled every 10 minutes.
- KVStoreReplicaSetStats. Information about replication data across KV store Instances. Polled every 60 seconds.
- KVProfilingStats. Information about slow operations. Polled every 5 seconds. Only available when profiling is enabled. Note: Enable profile only on development systems or for troubleshooting issues with KV store performance beyond what is available in the default panels. Profiling can negatively affect system performance and so should not be enabled in production environments.
In addition, KV store produces entries in a number of internal logs collected by Splunk Enterprise.
Interpret these dashboards
|Page faults per operation|| 1.3+
Reads require heavy disk I/O, which could indicate a need for more RAM.
Reads regularly require disk I/O.
Reads rarely require disk I/O.
|Measures how often read requests are not satisfied by what Splunk Enterprise has in memory, requiring Splunk Enterprise to contact the disk. Windows counts soft page faults, so Windows machines exhibit more page faults. Use lock percentage and queues instead.|
|Lock percentage||50%+||30%–50%||0–30%||High lock percentage can starve replication and/or cause application calls to be slow, time out, or fail. High lock percentage typically means that heavy write activity is occurring on the node.|
|Network traffic||N/A||N/A||N/A||Network traffic should be commensurate with system use and application expectations. No default thresholds apply.|
|Replication latency||>30 seconds||10–30 seconds||0–10 seconds||Replication needs are system dependent. Generally, replica set members should not fall significantly behind the KV captain. Replication latency over 30 seconds can indicate a mounting replication problem.|
|Primary operations log window||N/A||N/A||N/A||Provided for reference. This is the amount of data, in terms of time, a system saved in the operations log for restoration.|
|Flushing rate||50%–100%||10%–50%||0–10%||A high flush rate indicates heavy write operations or sluggish system performance.|
Troubleshoot these dashboards
The historical panels get data from the _introspection and _internal indexes. Gaps in time in these panels indicate a time when KV store or Splunk Enterprise was unreachable. If a panel is completely blank or missing data from specific Splunk Enterprise instances, check:
Resource usage dashboards
Search head clustering dashboards
This documentation applies to the following versions of Splunk® Enterprise: 6.5.0, 6.5.1, 6.5.1612 (Splunk Cloud only), 6.5.2, 6.5.3, 6.5.4, 6.5.5, 6.5.6, 6.5.7, 6.5.8, 6.5.9, 6.5.10, 6.6.0, 6.6.1, 6.6.2, 6.6.3, 6.6.4, 6.6.5, 6.6.6, 6.6.7, 6.6.8, 6.6.9, 6.6.10, 6.6.11, 6.6.12, 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.0.6, 7.0.7, 7.0.8, 7.1.0, 7.1.1, 7.1.2, 7.1.3, 7.1.4, 7.1.5, 7.1.6, 7.2.0, 7.2.1, 7.2.2, 7.2.3, 7.2.4