What happens when a master node goes down
The master is essential to the proper running of a cluster, with its role as coordinator for much of the cluster activity. However, if the master goes down, the peers and search head have default behaviors that allow them to function fairly normally, at least for a while. Nevertheless, you should treat a downed master as a serious failure.
When a master goes down
If a master goes down, the cluster can continue to run as usual, as long as there are no other failures. Peers can continue to ingest data, stream copies to other peers, replicate buckets, and respond to search requests from the search head.
When a peer rolls a hot bucket, it normally contacts the master to get a list of target peers to stream its next hot bucket to. However, if a peer rolls a hot bucket while the master is down, it will just start streaming its next hot bucket to the same set of peers that it used as targets for the previous hot bucket.
Eventually, problems will begin to arise. For example, if a peer goes down and the master is still down, there will be no way to coordinate the necessary remedial bucket-fixing activity. Or if, for some reason, a peer is unable to connect with one of its target peers, it has no way of getting another target.
The search head can also continue to function without a master, although eventually the searches will be accessing incomplete sets of data. (For example, if a peer with primary bucket copies goes down, there's no way to transfer primacy to copies on other peers, so those buckets will no longer get searched.) The search head will use the last generation ID that it got before the master went down. It will display a warning if one or more peers in the last generation are down.
To deal with the possibility of a downed master, you can configure a stand-by master that can take over if needed. For details, see "Configure a stand-by master".
When the master comes back up
Peers continue to send heartbeats indefinitely, so that, when the master comes back up, they will be able to detect it and reconnect.
When the master comes back up, it waits for a quiet period of 60 seconds, so that all peers have an opportunity to re-register with it. Once the quiet period ends, the master has a complete view of the state of the cluster, including the state of peer nodes and buckets. Assuming that at least the replication factor number of peers have registered with it, the master will initiate any necessary bucket-fixing activities to ensure that the cluster is valid and complete. In addition, it will update the generation ID as needed.
This activity can take some time to complete. In particular, the process of making non-searchable copies of buckets searchable can be slow, especially if a peer node has to process a large quantity of data. For help estimating the time needed to make non-searchable copies searchable, look here.
After the 60 second quiet period is over, you can view the master dashboard for accurate information on the status of the cluster.
Note: You must make sure there are at least replication factor number of peers running when you restart the master.
What happens when a peer node comes back up
This documentation applies to the following versions of Splunk® Enterprise: 5.0, 5.0.1, 5.0.2, 5.0.3, 5.0.4, 5.0.5, 5.0.6, 5.0.7, 5.0.8, 5.0.9, 5.0.10, 5.0.11, 5.0.12, 5.0.13, 5.0.14, 5.0.15, 5.0.16, 5.0.17, 5.0.18