Splunk® Center of Excellence

Splunk Center of Excellence Handbook

Download manual as PDF

Download topic as PDF

In depth: Disaster recovery for the utility tier

Splunk provides product features to increase availability and recovery options for the search tier (search head clustering) and the indexing tier (indexer clusters and index replication). Administrative functions, such as the deployment server, deployer, and licensing server, rely on best practices to provide their resiliency. This article refers to those functions as the utility tier, and outlines the best practices to ensure their recoverability.


Impact of failures on the utility tier

The components of the Splunk utility tier are used for Splunk administration. If any of these components are unavailable or destroyed, the respective functions and resources become unavailable. Note this does not include search heads, indexers, nor forwarders.

Component Impact if offline Impact if destroyed
Deployment server No impact to search and indexing functions Source of truth of environment's configuration destroyed
Deployer No impact to search and indexing functions Default configuration for search head cluster is lost, but can be mostly rebuilt from a SHC member
Master node No data redundancy requirements

See Managing Indexers and Clusters of Indexers in the "Splunk Enterprise Managing Indexers and Clusters of Indexers Manual" for more information.

Default configuration for indexer cluster member is lost, but can be mostly rebuilt from a member.
License server No impact to indexing functions

72 concurrent hours shuts down search functions See About the connection between the license master and license slaves in the "Splunk Enterprise Admin Manual" for more information.

System would need to be rebuilt. No impact to end users if the rebuild happens within 72 hours.
Monitoring console No impact to search and indexing functions. Lost health and performance visibility and monitoring for search and indexing functions. System would need to be rebuilt. Risk to operations if health and performance visibility and monitoring for search and indexing functions is offline for a long period. Built-in summary data showing insights and long term patterns would be lost. No lasting impact to end users or the overall Splunk platform.

If any of these components are destroyed, it requires effort to rebuild a new instance and update references to the new host's information throughout the environment. You can mitigate these laborious and error-prone efforts by applying best practices.

Preserve component's state

Many customers use virtual machines instead of bare-metal hardware for utility-tier components because virtual machines provide two features that are valuable for utility-tier components:

Dynamic resource sizing
VMs change the hardware specifications of the host as load increases.
State preservation and transition
VMs provide host snapshots that preserve an image of the instance. Some VMs, such as VMotion from VMWare, enable you to instantiate the host image on a new virtual machine.

If you are unable to leverage these benefits from virtual machines, consider putting a configuration backup plan in place. For more information about configuration backups, see In depth: Configuration backups for the Splunk CoE.

Preserve networking using DNS entries

When a utility instance fails or is destroyed, the administrative task to update networking details to all clients, such as host name and IP, can be impractical in large and distributed data center environments. You may be able to avoid that labor by rebuilding a utility component with the same networking details the previous one used used, but this is usually not possible. A best practice is to use DNS CName (canonical name) records as a translation service.

When you establish DNS CNames for your utility instances, you can direct all clients to those DNS entries, and thus never need to rely on the true host and IP of the host hardware. If you have to replace the host hardware, you do not have to try to reuse the same hostname and IP. This also enables you to build new utility instances in parallel to the old with a simple DNS toggle as a cutover.

Applications for load balancing

You can use a similar practice for load balancing on the data collection tier or search tier. In such scenarios, a DNS A record distributes traffic to multiple hosts, which provides you an easy way to scale. Even if you have a single instance acting as your search head or data collection tier, you can use this kind of networking for scalability and easy management.

For load balancing the indexing tier, however, Splunk's native load balancing feature is the best practice for forwarding data to indexers. For more information, see Set up load balancing in the Splunk Enterprise Forwarding Data Manual.

Next steps

Partner with someone who oversees networking at your organization and make sure they understand the goal and the technical details. Draft the disaster recovery plan and verify it with a non-impacting/non-production environment before implementing it in production.

In depth: Showback plan for the Splunk CoE
In depth: Staffing model for the Splunk CoE

This documentation applies to the following versions of Splunk® Center of Excellence: current

Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters