Perform an automated rolling upgrade of a search head cluster

Splunk Enterprise version 9.1.0 and higher supports automated rolling upgrade of search head clusters using the custom splunk-rolling-upgrade app, which comes with the Splunk Enterprise product. A rolling upgrade performs a phased upgrade of cluster members to a new version of Splunk Enterprise with minimal interruption of ongoing searches. The splunk-rolling-upgrade app automates the manual rolling upgrade steps described in Perform a rolling upgrade of a search head cluster.

Requirements and considerations

Review the following requirements and considerations before you configure and initiate an automated rolling upgrade.

The splunk-rolling-upgrade app requires Linux OS. Mac OS and Windows are not supported.
Automated rolling upgrade only applies to upgrade from version 9.1.x and higher to subsequent versions of Splunk Enterprise. To determine your upgrade path and confirm the compatibility of the upgraded search head cluster version with existing Splunk Enterprise components and applications, see the Splunk products version compatibility matrix.
Automated search head cluster rolling upgrade supports the following installation package formats:
- .tgz: default file format
- .deb and .rpm: requires a custom script that can run with elevated privileges. See Create a custom installation hook.
To use the splunk-rolling-upgrade app, you must hold a role that contains these capabilities:
- upgrade_splunk_shc
- list_search_head_clustering
- list_settings
- use_remote_proxy

The admin role contains all of the capabilities required by default. However, to limit access, it is a best practice to create a dedicated role/user with only the capabilities required to run the rolling upgrade.

How an automated rolling upgrade works

Use the splunk-rolling-upgrade app to perform an automated rolling upgrade of a search head cluster. You initiate the rolling upgrade with a single request to a REST endpoint or by specifying the corresponding CLI command. The app then downloads a new Splunk Enterprise install package and installs it on each cluster member one by one. By default the app handles only .tgz packages by unpacking the contents in the $SPLUNK_HOME directory, which is typically /opt/splunk.

For more flexibility with installation, the splunk-rolling-upgrade app does the following:

Implements the package installation process as a custom hook (shell script). You can write and plug in the installation logic, which is required for deb and rpm package types.
Provides additional separate endpoints for monitoring the upgrade process and remediating failures.

The splunk-rolling-upgrade app provides the following REST endpoints and corresponding CLI commands to perform an automated search head cluster rolling upgrade.

For cluster upgrade, you can run these operations on any cluster member.
For deployer upgrade, you must run these operation on the deployer.

REST endpoint	CLI command	Description
`upgrade/shc/upgrade`	`splunk rolling-upgrade shc-upgrade`	Initiate the automated rolling upgrade process.
`upgrade/shc/status`	`splunk rolling-upgrade shc-status`	Monitor automated rolling upgrade status.
`upgrade/shc/recovery`	`splunk rolling-upgrade shc-recovery`	Return the cluster to a ready state after automated rolling upgrade failure.

Perform an automated rolling upgrade

This section shows you how to configure and use the splunk-rolling-upgrade app to run an automated rolling upgrade of a search head cluster.

Configure the rolling upgrade app

Before you can run an automated rolling upgrade, configure the splunk-rolling-upgrade app for your deployment:

On the deployer, create a folder in $SPLUNK_HOME/etc/shcluster/apps which contains the required configurations.
Give a meaningful name to the folder, for example, "splunk-rolling-upgrade-config".
Distribute the add-on to search head cluster members using the deployer.

The default splunk-rolling-upgrade installation script supports .tgz packages only. To perform an upgrade using .rpm or .deb package formats, take these steps:

Create a custom script that contains installation instructions for the specific package type.
Specify the path to the package in the install_script_path field under the [hook] stanza in rolling_upgrade.conf.

For more information, see Create a custom installation hook.

To configure the splunk-rolling-upgrade app:

On the deployer, create the $SPLUNK_HOME/etc/apps/splunk-rolling-upgrade-config/default directory.
In $SPLUNK_HOME/etc/apps/splunk-rolling-upgrade-config/default, create a new rolling_upgrade.conf file containing the following contents, where package_path points to the installation package for the new version to which you are upgrading:
```
[downloader] 
package_path = <path to a package>
```
The package_path setting supports URI paths to local files, for example file://path/to/package.tgz, and remote links that require no authentication.
On the deployer, in $SPLUNK_HOME/etc/apps/splunk-rolling-upgrade-config/default, create a new file called inputs.conf, containing the following scripted input, where <splunk_user> is the name of the user the app uses to send requests to REST endpoints.
```
[script://$SPLUNK_HOME/etc/apps/splunk-rolling-upgrade/bin/complete.py] 
passAuth=<splunk_user>
```
Splunk Enterprise passes the authentication token for the specified user to the splunk-rolling-upgrade app and does not store the token. The specified user must hold a role that contains all of the capabilities required to run the splunk-rolling-upgrade app. For more information, see Requirements and considerations.
(Optional) If you plan to use rpm or deb packages, run the chmod +x command to set execution permissions for the associated hook (script) that you wrote. Next, create the $SPLUNK_HOME/etc/apps/splunk-rolling-upgrade-config/hooks/default directory, and copy your hook there. Then, in $SPLUNK_HOME/etc/apps/splunk-rolling-upgrade-config/default/rolling-upgrade.conf, under the hook stanza, set the install_script_path value to the location of the hook. For example:
```
[hook]
install_script_path = $SPLUNK_HOME/etc/apps/splunk-rolling-upgrade-config/hooks/<hook_file_name>
```
The install_script_path setting supports only local paths and environment variable expansions.
On the deployer, copy $SPLUNK_HOME/etc/apps/splunk-rolling-upgrade-config to the configuration bundle under SPLUNK_HOME/etc/shcluster/apps.
On the deployer, distribute the configuration bundle to all search head cluster members using the following command:
```
splunk apply shcluster-bundle -target <uri-to-shc-peer>:<management port> -auth admin:<password>
```
For more information on how to apply the configuration bundle, see Use the deployer to distribute apps and configuration updates.

For detailed information on rolling_upgrade.conf settings, see the rolling_upgrade.conf.spec file located in $SPLUNK_HOME/etc/apps/splunk-rolling-upgrade/README/.

Run the automated rolling upgrade

After you configure the splunk-rolling-upgrade app, follow these steps to run the automated rolling upgrade of your search head cluster, using the REST API or corresponding CLI commands:

CLI commands for automated rolling upgrade do not return error messages.

Identify the URI and management port of any search head cluster member.

On any cluster member, send an HTTP POST request to the upgrade/shc/upgrade endpoint to initiate the rolling upgrade process. For example:

curl -X POST -u admin:pass -k "https://localhost:8089/services/upgrade/shc/upgrade?output_mode=json"

The request first triggers basic health checks to ensure the search head cluster is in a healthy state to perform the rolling upgrade. If all health checks pass, the endpoint initiates the rolling upgrade. For more information, see steps 1 and 2 in Perform a rolling upgrade.

A successful request returns an "Upgrade initiated" message. For example:

{
    "updated":"2022-11-24T17:25:54+0000",
    "author":"Splunk",
    "layout":"props",
    "entry":[
        {
            "title":"upgrade",
            "id":"/services/upgrade/shc/upgrade",
            "updated":"2022-11-24T17:25:54+0000",
            "links":{
                "alternate":{
                    "href":"shc/upgrade"
                }
            },
            "content":{
                "message":"Upgrade initiated",
                "status":"succeeded"
            }
        }
    ]
}

In some cases the request can fail and return an error, for example, if health checks fail or if a rolling upgrade is already running. To troubleshoot the cause of a failure, review the HTTP return codes and check log files for details. The upgrade/shc/upgrade endpoint returns the following HTTP status codes:

Code	Description
200	Upgrade operation successfully initiated.
400	Configuration error.
403	An upgrade is already running. Upgrade is not required. The search head cluster is not ready. Wait for the cluster to fully initialize.
500	Internal Server Error. Something went wrong with the upgrade. Check log files for more information. Possible reasons: The upgrade could not be triggered on a given member.
501	Attempted to upgrade an unsupported deployment. (Rolling upgrade supports search head clusters, search heads and deployers only.)
503	KV store is not ready.

For more troubleshooting information, including relevant log files, see Troubleshoot and recover from automated rolling upgrade failure.

For endpoint details, see upgrade/shc/upgrade in the REST API Reference Manual.

Alternatively, on any cluster member, run the splunk rolling-upgrade shc-upgrade command to initiate the automated rolling upgrade.

Monitor the status of the rolling upgrade until all cluster members are sucessfully upgraded. To monitor the rolling upgrade status, send an HTTP GET request to the upgrade/shc/status endpoint. For example:

curl -u admin:pass -k "https://localhost:8089/services/upgrade/shc/status?output_mode=json"

The response shows the current status of the rolling upgrade, including the upgrade status of the entire cluster, the status of each individual cluster member, and the total number and percentage of members upgraded. For example:

{
    "updated":"2022-11-24T17:33:28+0000",
    "author":"Splunk",
    "layout":"props",
    "entry":[
        {
            "title":"status",
            "id":"/services/upgrade/shc/status",
            "updated":"2022-11-24T17:33:28+0000",
            "links":{
                "alternate":{
                    "href":"shc/status"
                }
            },
            "content":{
                "message":{
                    "upgrade_status":"completed",
                    "statistics":{
                        "peers_to_upgrade":3,
                        "overall_peers_upgraded":3,
                        "overall_peers_upgraded_percentage":100
                    },
                    "peers":[
                        {
                            "name":"sh2",
                            "status":"upgraded",
                            "last_modified":"Thu Nov 24 17:29:41 2022"
                        },
                        {
                            "name":"sh1",
                            "status":"upgraded",
                            "last_modified":"Thu Nov 24 17:28:07 2022"
                        },
                        {
                            "name":"sh3",
                            "status":"upgraded",
                            "last_modified":"Thu Nov 24 17:31:15 2022"
                        }
                    ]
                }
            }
        }
    ]
}

The upgrade/shc/status endpoint returns the following HTTP status codes:

Code	Description
200	Unable to get the latest SHC status.
400	Configuration error.
500	Internal error. Check log files for more information on the error.
501	Attempted to get the status of an unsupported deployment.
503	Unable to access KV store. KV store probably still initializing.

For endpoint details, see upgrade/shc/status in the REST API Reference Manual.

Alternatively, run the splunk rolling-upgrade shc-status command to monitor the automated rolling upgrade.

If you get a "Couldn't connect to server" response, such as the following, when monitoring the rolling upgrade status:

% curl -u admin:pass -k https://10.225.218.144:8089/services/shc/status    

curl: (7) Failed to connect to 10.225.218.144 port 8089 after 1212 ms: Couldn't connect to server

it means that this cluster member is being restarted as a part of the upgrade process.

You can get this response when trying to monitor the status of a machine that is temporarily down because the rolling upgrade process stops, unpacks the package, and restarts splunkd. In this case, monitor the status from a different cluster member, or wait until that cluster member is up and running again.

Upgrade the deployer. When the upgrade/shc/status endpoint response shows "upgrade_status":"completed" for the entire cluster, repeat step 2 to upgrade the deployer.

Create a custom installation hook

An installation hook is a custom binary or script that installs the Splunk package on every machine. The splunk-rolling-upgrade app downloads the package specified in package_path in rolling_upgrade.conf, then sends a request to the hook to install the package on the cluster member.

The app passes the package path to the hook as the first parameter, and $SPLUNK_HOME as the second parameter. The hook must contain installation instructions for the package, and must have executable permissions, which you can set using the chmod+x command. For example, the following shows the default installation hook for .tgz packages:

#!/bin/bash
set -e
splunk_tar="$1"
dest_dir="$2"
tar -zvxf "$splunk_tar" --strip-components 1 -C "$dest_dir"

Custom hooks for deb and rpm package installation

Installation of deb and rpm packages requires sudo permissions, while the Splunk instance typically runs under the 'splunk' user without those privileges. To perform an automated rolling upgrade using deb or rpm packages, create a custom installation hook. Before you run installation commands, such as sudo rpm --upgrade for rpm packages, take these steps:

Acquire elevated privileges for the installation hook for deb and rpm packages.
Install the correct package manager on your machine:
- dpkg for deb packages
- rpm for rpm packages.

Troubleshoot and recover from automated rolling upgrade failure

Using the splunk-rolling-upgrade app, you can return a search head cluster to a ready state, where you can run the automated rolling upgrade again, after a rolling upgrade failed. Before you initiate the recovery process, make sure that the rolling upgrade has failed or crashed.

When a rolling upgrade fails, you can see the following status of the "upgrade_status" field in the upgrade/shc/status endpoint response:

"failed", in most cases
"in_progress", in some cases, for example, if the upgrade crashes while the Splunk instance is stopped.

To investigate the cause of the rolling upgrade failure, take these steps:

Find the last instance that was upgraded at the time of failure. To do it, check the upgrade/shc/status endpoint response for the member whose "status" field is set to different values than "READY" or "UPGRADED".
Check the logs for errors.

The splunk-rolling-upgrade app writes to 3 log files under splunk/var/log/splunk:

splunk_shc_upgrade_upgrader_script.log
splunk_shc_upgrade_rest_endpoints.log
splunk_shc_upgrade_completion_script.log

If the request response shows "no_upgrade", look for errors in the splunk_shc_upgrade_rest_endpoints.log file on the member where you ran the request. Address the issues that you find in the logs. Make sure the issues do not repeat on other cluster members during future rolling upgrade attempts.

After you address the issues that caused the failure, prepare the cluster for another rolling upgrade attempt, as follows:

If the cluster member where the issue occurred is down, manually perform the installation of the package on that machine. Remove splunk/var/run/splunk/trigger-rolling-upgrade (if it exists), and start Splunk on that member.

Send an HTTP POST request to the upgrade/shc/recovery endpoint. For example:

curl -X POST -u admin:pass -k "https://localhost:8089/services/upgrade/shc/recovery"

This operation returns the cluster to the ready state, where you can run the automated rolling upgrade again after failure. It also sets the current upgrade status to "failed". Note that it can take some time for the KV store to initialize after startup.

The upgrade/shc/recovery endpoint returns the following HTTP status codes:

Code	Description
200	Recovery was executed successfully.
400	Configuration error.
500	Internal error. Check log files for more information on the error.
501	Attempted to run a recovery on an unsupported deployment.

For endpoint details, see upgrade/shc/recovery in the REST API Reference Manual.

Alternatively, run the splunk rolling-upgrade shc-recovery command to initiate the recovery process.

If the upgrade/shc/recovery endpoint response contains a message such as the following:
```
{
    "message":"SHC partially recovered. Please turn off manual detention mode on the following peers: ['sh1']",
    "status":"succeeded"
}
```
then send an HTTP POST request to the /shcluster/member/control/control/set_manual_detention endpoint, turning off manual detention on the search head specified in the response. For example:
```
curl -u admin:pass -k "https://localhost:8089/servicesNS/admin/search/shcluster/member/control/control/set_manual_detention -d manual_detention=off"
```
For endpoint details, see shcluster/member/control/control/set_manual_detention in the REST API Reference Manual.

Resume the upgrade by sending an HTTP POST request to the upgrade/shc/upgrade endpoint. For example:

curl -X POST -u admin:pass -k "https://localhost:8089/services/upgrade/shc/upgrade?output_mode=json"

For details on how to run the automated rolling upgrade, see Run the automated rolling upgrade.

Distributed Search

Related Answers

Perform an automated rolling upgrade of a search head cluster

Requirements and considerations

How an automated rolling upgrade works

Perform an automated rolling upgrade

Configure the rolling upgrade app

Run the automated rolling upgrade

Create a custom installation hook

Custom hooks for deb and rpm package installation

Troubleshoot and recover from automated rolling upgrade failure

Comments

Perform an automated rolling upgrade of a search head cluster