Splunk Cloud Platform

Splunk Cloud Platform Admin Manual

How Splunk monitors Splunk Cloud Platform

The Splunk Cloud Service Level Schedule describes Splunk's service level commitment for Splunk Cloud Platform. This topic describes some of the monitoring efforts that Splunk performs in support of that service level commitment. Splunk monitors the service with the following goals:

  • Detect issues
  • Restore service as quickly as possible
  • Keep customers and their stakeholders informed about outages

Splunk Cloud Platform is monitored 24x7 worldwide by our Network Operations Center (NOC). During U.S. business hours, specialized teams work to resolve and identify causes for novel issues.

Splunk Network Operations Center

The NOC takes action in response to automated alerts. For consistency and repeatability, the NOC uses runbooks to respond to alerts and files proactive incidents when novel issues occur.

The NOC manages more than 100 priority-one automated alerts that monitor the following components in particular. See the Splunk Cloud Platform Monitoring Matrix for more detail.

  • Disk usage
  • Indexers, search heads, cluster manager, KV store, or Inputs Data Managers (IDMs) down
  • User Interface unresponsive
  • Search head synchronization issues

Specialized Teams

Specialized teams monitor critical product functionality and resolve issues during U.S. business hours. These specialized teams investigate to determine and remediate root causes and feed back into the development process to improve code resilience.

Specialized teams work on critical functions such as the following:

  • Search
  • Ingest
  • Login

See Splunk Cloud Platform Service Details for more information.

Splunk Cloud Platform Monitoring Matrix

The following table lists Splunk Cloud features that are monitored and the Splunk response when issues are detected. This is a representative list of Splunk Cloud Platform monitoring features. It is not exhaustive and is subject to change without notice. This document does not describe Splunk Cloud Platform for FedRAMP Moderate, Splunk Cloud Platform for FedRAMP High, or Splunk Cloud Platform for DoD IL5.

Feature Issue Support Splunk Action
Federated Search Federated search issues U.S. business hours Investigate cause.
Inputs Data
Manager (IDM)

(Note 1)
IDM requires upsizing 24x7 Schedule maintenance window to upsize IDM.
Indexing Indexer down 24x7 Check bundle push and detention status. Verify available disk space and potentially restart service. Create proactive incident if required.
Indexing latency >5 minutes (Note 2) U.S. business hours Investigate cause.
Indexing queues blocked U.S. business hours Investigate cause.
Infrastructure Disk space full 24x7 Rotate logs to clear old backups or expand disk space (Note 3). Create proactive incident if required.
Ingestion Splunk-to-Splunk (S2S) ingestion port down 24x7 Check bundle push and detention status. Verify available disk space and potentially restart service. Create proactive incident if required.
Ingestion HTTP Event Collector U.S. business hours Investigate cause.
Ingestion S2S connection acceptance U.S. business hours Investigate cause.
KV store KV store down 24x7 Check data store health, certificates, disk space, and potentially restart service. Create proactive incident if required.
Login Splunk native authentication U.S. business hours Investigate cause.
Identity provider authentication U.S. business hours Investigate cause.
Search Search Head Cluster (SHC) out of sync 24x7 Check knowledge object replication and potentially re-sync cluster members. Create proactive incident if required.
Search peer isolated 24x7 Check for unavailable or stuck search peers, potentially restart service or remove unresponsive peer. Create proactive incident if required.
Search initiation 24x7 Check health of the indexer running searches, bundle synchronization, down peers, and cluster manager health, and potentially restart services. Create proactive incident if required.
Search execution U.S. business hours Investigate cause.
Search performance U.S. business hours Note search performance reductions relative to customer's historical performance. Investigate cause.
Skipped search percentage U.S. business hours Investigate cause.
API unavailable 24x7 Check for system overload and disk space issues, and potentially restart the service. Create proactive incident if required.
Splunk Web user interface unavailable (Search Head or Enterprise Security Search Head) 24x7 Check certificates and potentially restart processes or instances. Create proactive incident if required.

Notes

  1. IDM applies to Classic Experience only.
  2. Indexing latency applies to Victoria Experience only.
  3. Disk expansion limited by entitlement.
Last modified on 19 April, 2024
Monitor your deployment with the splunkd health report   Manage your Splunk Cloud Platform capacity

This documentation applies to the following versions of Splunk Cloud Platform: 9.0.2303, 9.0.2305, 9.1.2308, 9.1.2312, 9.2.2403, 9.2.2406 (latest FedRAMP release)


Was this topic useful?







You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters