What happens when the manager node goes down
The manager is essential to the proper running of an indexer cluster, with its role as coordinator for much of the cluster activity. However, if the manager goes down, the peers and search head have default behaviors that allow them to function fairly normally, at least for a while. Nevertheless, you should treat a downed manager as a serious failure.
To deal with the possibility of a downed manager, you can configure a stand-by manager that can take over if needed. For details, see "Replace the manager node on the indexer cluster".
When a manager goes down
If a manager goes down, the cluster can continue to run as usual, as long as there are no other failures. Peers can continue to ingest data, stream copies to other peers, replicate buckets, and respond to search requests from the search head.
When a peer rolls a hot bucket, it normally contacts the manager to get a list of target peers to stream its next hot bucket to. However, if a peer rolls a hot bucket while the manager is down, it will just start streaming its next hot bucket to the same set of peers that it used as targets for the previous hot bucket.
Eventually, problems will begin to arise. For example, if a peer goes down and the manager is still down, there will be no way to coordinate the necessary remedial bucket-fixing activity. Or if, for some reason, a peer is unable to connect with one of its target peers, it has no way of getting another target.
The search head can also continue to function without a manager, although eventually the searches will be accessing incomplete sets of data. (For example, if a peer with primary bucket copies goes down, there's no way to transfer primacy to copies on other peers, so those buckets will no longer get searched.) The search head will use the last generation ID that it got before the manager went down. It will display a warning if one or more peers in the last generation are down.
When the manager comes back up
Peers continue to send heartbeats indefinitely, so that, when the manager comes back up, they will be able to detect it and reconnect.
When the manager comes back up, it waits for a quiet period, so that all peers have an opportunity to re-register with it. Once the quiet period ends, the manager has a complete view of the state of the cluster, including the state of peer nodes and buckets. Assuming that at least the replication factor number of peers have registered with it, the manager initiates any necessary bucket-fixing activities to ensure that the cluster is valid and complete. In addition, it rebalances the cluster and updates the generation ID as needed.
Bucket fixing can take some time to complete, because it involves copying buckets and making non-searchable copies searchable. For help estimating the time needed to complete the bucket-fixing activity, look here.
After the quiet period is over, you can view the manager dashboard for accurate information on the status of the cluster.
Note: You must make sure that at least replication factor number of peers are running when you restart the manager.
What happens when a peer node comes back up
This documentation applies to the following versions of Splunk® Enterprise: 8.1.0, 8.1.1, 8.1.2, 8.1.3, 8.1.4, 8.1.5, 8.1.6, 8.1.7, 8.1.8, 8.1.9, 8.1.10, 8.1.11, 8.1.12, 8.2.0, 8.2.1, 8.2.2, 8.2.3, 8.2.4, 8.2.5, 8.2.6, 8.2.7, 8.2.8, 8.2.9, 9.0.0, 9.0.1, 9.0.2, 9.0.3