Best practices for configuring your Splunk Cloud Platform environment for disaster recovery
Cross-region disaster recovery is in the Early Access release phase. In the Early Access release phase, Splunk products might have limitations on customer access, features, maturity, and regional availability. Additionally, its documentation might receive frequent updates, or be incomplete or incorrect. For additional information on Early Access, contact your Splunk representative.
Both you and Splunk share responsibility for disaster recovery. Splunk does not perform any kind of disaster recovery of components that exist outside of Splunk Cloud Platform. Ensuring the continuity of external Splunk components is your responsibility. You must evaluate the disaster recovery of your data/event forwarding, network egress, and firewall infrastructure that resides outside of Splunk Cloud. For example, if a Universal Forwarder runs in a cloud solution provider (CSP) region, you must properly implement its failover to achieve end to end system resiliency.
To achieve minimal disruption in event of a disaster, follow these best practices when configuring and managing your cross-region disaster recovery (CRDR)-enabled Splunk Cloud Platform (SCP) deployment.
Refresh Splunk Cloud Platform IP address cache after a failover
When Splunk declares a qualified regional disaster and begins failing over your Splunk Cloud Platform deployment to a secondary cloud service provider region, it updates the DNS for your deployment to point to the secondary site. It's possible you might have cached the original IP network address to your deployment, either in your browser, an application, or a network component. To ensure access to your SCP environment when it has failed over, refresh this cache as quickly as possible so that data ingestion and searches route to the new set of IP addresses.
Use indexer acknowledgment on forwarders
Indexer acknowledgment is active on the ingestion path for Splunk Cloud Platform instances with the CRDR service active. Where applicable, use indexer acknowledgment on forwarders to buffer incoming data at the forwarders. This acknowledgment ensures that the forwarding tier saves the data if Splunk Cloud Platform cannot accept it due to failure before the ingestion is redirected to the secondary site. If possible, do not use intermediate universal forwarders (IUF) if you want to buffer data on forwarders, as IUFs are not good candidates for indexer acknowledgment.
Buffer incoming data during disaster recovery operations
During a failover, there is a period of time when Splunk Cloud Platform cannot ingest data. During that time, Splunk Cloud Platform does not perform indexer acknowledgment of the event data. You must configure your data collection and forwarding tiers to buffer this data. Allow for up to 4 hours of storage buffering at the data collection tier to ensure that you don't lose data.
Repopulate dashboard configurations
Splunk does not replicate the results of previously run searches to the secondary CSP region. As a result, any dashboards that use the results of previously run saved searches do not populate after a failover until the next scheduled run of the saved search.
Where applicable, when you design dashboards in your Splunk Cloud Platform environment, use the ref
reference attribute for the search
Simple XML dashboard element rather than the loadjob
search command. If you use the ref
reference attribute, the search runs and populates the dashboard until the next scheduled run of the saved search.
Confirm size of new indexes
If you create new indexes in your SCP environment, wait at least 30 minutes after you create the index before you send data to it. If you begin sending data to your indexes sooner, that data might end up in the "last chance" index instead of the index you specify. The "last chance" index is the index of last resort that Splunk configures for Splunk Cloud Platform instances.
After Splunk fails over your SCP environment, it enforces the maximum size limits for indexes. Any historical data that is outside of the index size limits is deleted, oldest first. Where possible, confirm that the size you configured for indexes meets the use case requirements.
Cross-region disaster recovery service level agreements and limitations | Implement cross-region disaster recovery in your Splunk Cloud Platform environment |
Feedback submitted, thanks!