Splunk® Enterprise

Managing Indexers and Clusters of Indexers

Download manual as PDF

This documentation does not apply to the most recent version of Splunk. Click here for the latest version.
Download topic as PDF

Upgrade a cluster

Important: All cluster nodes (master, peers, and search head) must be running the same version of Splunk Enterprise. When you upgrade one node, upgrade them all.

The upgrade process differs depending on whether you're upgrading from 5.x to 6.x or merely upgrading from one maintenance release to the next (for example, 6.0 to 6.0.1).

Upgrade from 5.x to 6.x

When you upgrade from a 5.x cluster to a 6.x cluster, you must take down all cluster nodes (master, peers, and search heads). You cannot perform a rolling, online upgrade.

Perform the following steps:

1. On the master, run the safe_restart_cluster_master script with the --get_list option:

splunk cmd python safe_restart_cluster_master.py <master_uri> --auth <username>:<password> --get_list 

Note: For the master_uri parameter, use the URI and port number of the master node. For example: https://10.152.31.202:8089

This command puts a list of all cluster bucket copies and their states into the file $SPLUNK_HOME/var/run/splunk/cluster/buckets.xml. This list gets fed back to the master after the master upgrade.

Important: To obtain a copy of this script, copy and paste it from here: "The safe_restart_cluster_master script".

For information on why this step is needed, see "Why the safe_restart_cluster_master script is necessary".

2. Stop the master.

3. Stop all the peers and search head(s).

Note: When bringing down the peers, use the splunk stop command, not splunk offline.

4. Upgrade the master node, following the normal procedure for any Splunk Enterprise upgrade, as described in "How to upgrade Splunk" in the Installation Manual. Do not upgrade the peers yet.

5. Start the master, accepting all prompts, if it is not already running.

6. Run the splunk apply cluster-bundle command, using the syntax described in "Update common peer configurations and apps". (This step is necessary to avoid extra peer restarts, due to a 6.0 change in how the configuration bundle checksum is calculated.)

7. Run splunk enable maintenance-mode on the master. To confirm that the master has entered maintenance mode, run splunk show maintenance-mode. This step prevents unnecessary bucket fix-ups.

8. Upgrade the peer nodes and search head, following the normal procedure for any Splunk Enterprise upgrade, as described in "How to upgrade Splunk" in the Installation Manual.

9. Start the peer nodes and search head, if they're not already running.

10. On the master, run the safe_restart_cluster_master script again, this time with the freeze_from option, specifying the location of the bucket list created in step 1:

splunk cmd python safe_restart_cluster_master.py <master_uri> --auth <username>:<password> --freeze_from <path_to_buckets_xml>

For example:

splunk cmd python safe_restart_cluster_master.py <master_uri> --auth admin:changeme --freeze_from $SPLUNK_HOME/var/run/splunk/cluster/buckets.xml 

This feeds the master the list of frozen buckets obtained in step 1.

11. Run splunk disable maintenance-mode on the master. To confirm that the master has left maintenance mode, run splunk show maintenance-mode.

You can view the master dashboard to verify that all cluster nodes are up and running.

Why the safe_restart_cluster_master script is necessary

The safe_restart_cluster_master script takes care of a problem in the way the 5.x master node handles frozen bucket copies. This problem is fixed starting with release 6.0; however, it is still an issue during master upgrades from 5.x. This section provides detail on the issue.

When a peer freezes a copy of a bucket, the master stops doing fix-ups on that bucket. It operates under the assumption that the other peers will eventually freeze their copies of that bucket as well.

This works well as long as the master continues to run. However, because (in 5.x) the knowledge of frozen buckets is not persisted on either the master or the peers, if you subsequently restart the master, the master will treat frozen copies (in the case where unfrozen copies of that bucket still exist on other peers) as missing copies and will perform its usual fix-up activities to return the cluster to a complete state. If the cluster has a lot of partially frozen buckets, this process can be lengthy. Until the process is complete, the master will not be able to commit the next generation.

To prevent this situation from occurring when you restart the master after upgrading to 6.0, you must run the safe_restart_cluster_master script on the master. As described in the upgrade procedure, when you initially run this script on the 5.x master with the --get_list option, it creates a list of all cluster bucket copies and their states, including whether they are frozen. When you then rerun it after upgrading the master to 6.x, using the freeze_from option, it feeds the list to the upgraded master so that it doesn't attempt fix-up of the frozen buckets.

The safe_restart_cluster_master script

To perform steps 1 and 9 of the upgrade procedure, you must run the safe_restart_cluster_master script. This script does not currently ship with the product. To obtain the script, copy the listing directly below and save it to $SPLUNK_HOME/bin/safe_restart_cluster_master.py.

Important: You must also copy and save the parse_xml_v3 script, as described in the next section, "The parse_xml_v3 script".

Here are the contents of the script:

import httplib2
from urllib import urlencode
import splunk, splunk.rest, splunk.rest.format
from parse_xml_v3 import *
import json
import re
import time
import os
import subprocess 

#before restarting the master, store the buckets list in /var/run/splunk/cluster
BUCKET_LIST_PATH = os.path.join(os.environ['SPLUNK_HOME'] , 'var','run','splunk',' cluster', 'buckets.xml')

def get_buckets_list(master_uri, auth):
    f = open(BUCKET_LIST_PATH,'w')
    atom_buckets = get_xml_feed(master_uri +'/services/cluster/master/buckets?count=-1',auth,'GET')
    f.write(atom_buckets)
    f.close()

def change_quiet_period(master_uri, auth):
    args={'quite_period':'600'}
    return get_response_feed(master_uri+'/services/cluster/config/system?quiet_period=600',auth, 'POST')

def num_peers_up(master_uri, auth):
    count = 0 
    f= open('peers.xml','w')
    atom_peers = get_xml_feed(master_uri+'/services/cluster/master/peers?count=-1',auth,'GET')
    f.write(atom_peers)
    regex= re.compile('"status">Up')
    f.close()
    file = open('peers.xml','r')
    for line in file:
        match = regex.findall(line)
        for line in match:
            count = count + 1 
    file.close()
    os.remove('peers.xml')
    return count

def wait_for_peers(master_uri,auth,original_number):
    while(num_peers_up(master_uri,auth) != original_number):
        num_peers_not_up = original_number - num_peers_up(master_uri,auth)
        print "Still waiting for " +str(num_peers_not_up) +" peers to join ..."
        time.sleep(5)
    print "All peers have joined"

def get_response_feed(url, auth, method='GET', body=None):
    (user, password) = auth.split(':')
    h = httplib2.Http(disable_ssl_certificate_validation=True)
    h.add_credentials(user, password)
    if body is None:
        body = {}
    response, content = h.request(url, method, urlencode(body))
    if response.status == 401:
        raise Exception("Authorization Failed", url, response)
    elif response.status != 200:
        raise Exception(url, response)

    return splunk.rest.format.parseFeedDocument(content)

def get_xml_feed(url, auth, method='GET', body=None):
    (user, password) = auth.split(':')
    h = httplib2.Http(disable_ssl_certificate_validation=True)
    h.add_credentials(user, password)
    if body is None:
        body = {}
    response, content = h.request(url, method, urlencode(body))
    if response.status == 401:
        raise Exception("Authorization Failed", url, response)
    elif response.status != 200:
        raise Exception(url, response)

    return content

def validate_rest(master_uri, auth):
    return get_response_feed(master_uri + '/services/cluster/master/info', auth)
    
def freeze_bucket(master_uri, auth, bid):
    return get_response_feed(master_uri + '/services/cluster/master/buckets/' + bid + '/freeze', auth, 'POST')

def freeze_from_file(master_uri,auth,path=BUCKET_LIST_PATH):    
    file = open(path) #read the buckets.xml from either path supplied or BUCKET_LIST_PATH

    handler = BucketHandler()
    parse(file, handler)
    buckets = handler.getBuckets()

    fcount = 0
    fdone = 0
    for bid, bucket in buckets.iteritems():
        if bucket.frozen:
            fcount += 1
            try:
                freeze_bucket(master_uri,auth, bid)
                fdone += 1
            except Exception as e:
                print e

    print "Total bucket count:: ", len(buckets), "; number frozen: ", fcount, "; number re-frozen: ", fdone

def restart_master(master_uri,auth):
    change_quiet_period(master_uri,auth)
    original_num_peers = num_peers_up(master_uri,auth)
    
    print "\n" + "Issuing restart at the master" +"\n"
    subprocess.call([os.path.join(os.environ["SPLUNK_HOME"],"bin","splunk"), "restart"])

    print "\n"+ "Master was restarted" + "\n" 

    print "\n" + "Waiting for all " +str(original_num_peers) + " peers to come back up" +"\n"

    wait_for_peers(master_uri,auth,original_num_peers)
    
    print "\n" + "Making sure we have the correct number of frozen buckets" + "\n"

if __name__ ==  '__main__':
    usage = "usage: %prog [options] <master_uri> --auth admin:changeme"
    parser = OptionParser(usage)
    parser.add_option("-a","--auth", dest="auth", metavar="user:password", default=':',
                    help="Splunk authentication parameters for the master instance");
    parser.add_option("-g","--get_list", action="store_true",help="get a list of frozen buckets and strore them in buckets.xml");
    parser.add_option("-f", "--freeze_from",dest="freeze_from", 
                    help="path to the file that contains the list of buckets to be frozen. ie path to the buckets.xml generated by the get_list option above");

    (options, args) = parser.parse_args()

    if len(args) ==  0:
        parser.error("master_uri is required")
    elif len(args) > 1:
        parser.error("incorrect number of arguments")

    master_uri = args[0]
    try:
        validate_rest(master_uri, options.auth)
    except Exception as e:
        print "Failed to access the master info endpoint make sure you've supplied the authentication credentials"
        raise
    # Let's get a list of frozen buckets, stored in
    if(options.get_list):
        print "Only getting the list of buckets and storing it at" + BUCKET_LIST_PATH
        get_buckets_list(master_uri,options.auth)
    elif(options.freeze_from):
        print "Reading the list of buckets from" + options.freeze_from + "and refreezing them"
        freeze_from_file(master_uri,options.auth,options.freeze_from)
    else:
        print "Restarting the master safely to preserve knowledge of frozen buckets"
        get_buckets_list(master_uri,options.auth)
        restart_master(master_uri,options.auth)
        freeze_from_file(master_uri,options.auth,BUCKET_LIST_PATH)

The parse_xml_v3 script

The parse_xml_v3 script contains certain helper functions needed by the safe_restart_cluster_master script. This script does not currently ship with the product. To obtain the script, copy the listing directly below and save it to $SPLUNK_HOME/bin/parse_xml_v3.py.

Here are the contents of the script:

import sys
from xml.sax import ContentHandler, parse
from optparse import OptionParser

class PeerBucketFlags:
    def __init__(self):
        self.primary = False
        self.searchable = False

class Bucket:
    def __init__(self):
        self.peer_flags = {}  # key is peer guid
        self.frozen = False
        
class BucketHandler(ContentHandler):
    def __init__(self):
        ContentHandler.__init__(self)
        self.buckets = {}
        self.in_entry = False
        self.in_peers = False
        self.save_title = False
        self.save_frozen = False
        self.peer_nesting = 0
        self.current_peer_flags = {}
        self.current_guid = None
        self.current_frozen_flag = ''
        self.current_peer_field = None
        self.current_peer_field_value = ''
        self.current_bucket = ''

    def getBuckets(self):
        return self.buckets
            
    def startDocument(self):
        pass

    def endDocument(self):
        pass
        
    def startElement(self, name, attrs):
        if name == 'entry':
            self.in_entry = True
        elif self.in_entry and name == 'title':
            self.save_title = True
        elif self.in_entry and name == 's:key' and attrs.get('name') == 'frozen':
            self.save_frozen = True
        elif name == 's:key' and attrs.get('name') == 'peers':
            self.in_peers = True
        elif self.in_peers and name == 's:key':
            self.peer_nesting += 1
            if self.peer_nesting == 1:
                self.current_peer_flags = PeerBucketFlags()
                self.current_guid = attrs.get('name').encode('ascii')
            elif self.peer_nesting == 2:
                self.current_peer_field = attrs.get('name').encode('ascii')
                self.current_peer_field_value = ''

    def endElement(self,name):
        if name == 'entry':
            self.in_entry = False
            self.current_bucket=''
        elif self.save_title:
            try:
                (idx, local_id, origin_guid) = self.current_bucket.split('~')
            except ValueError as e:
                print "Invalid? ", self._locator.getLineNumber()
                print self.current_bucket
                print e
                raise
            self.buckets[self.current_bucket] = Bucket()
            self.save_title = False
        elif self.save_frozen:
            if self.current_frozen_flag in [1, '1', 'True', 'true']:
                 self.buckets[self.current_bucket].frozen = True
            self.current_frozen_flag = ''
            self.save_frozen = False
        elif self.peer_nesting == 2 and name == 's:key':
            if self.current_peer_field == 'bucket_flags':
                self.current_peer_flags.primary = (self.current_peer_field_value == '0xffffffffffffffff')
            elif self.current_peer_field == 'search_state':
                self.current_peer_flags.searchable = self.current_peer_field_value == 'Searchable'
            # Nesting level goes down in either case.
            self.peer_nesting -= 1
        elif self.peer_nesting == 1 and name == 's:key':
            self.buckets[self.current_bucket].peer_flags[self.current_guid] = self.current_peer_flags
            self.peer_nesting -= 1
        elif self.in_peers and self.peer_nesting == 0 and name == 's:key':
            self.in_peers = False
            
    def characters(self, content):
        if self.save_title:
            self.current_bucket += content.encode('ascii').strip()
        elif self.save_frozen:
            self.current_frozen_flag += content.encode('ascii').strip()
        if self.peer_nesting > 0:
            s = content.encode('ascii').strip()
            if s:
                self.current_peer_field_value += s

Upgrade to a new maintenance release

To upgrade a cluster to a new maintenance release (for example, from 6.0 to 6.0.1), you do not need to bring down the entire cluster at once. Instead, you can perform a rolling, online upgrade, in which you upgrade the nodes one at a time.

Important: Even with a rolling upgrade, you must upgrade all components expeditiously. Proper functioning of the cluster depends on all components running the same version of Splunk Enterprise, as stated in "System requirements and other deployment considerations". Therefore, do not begin the upgrade process until you are ready to upgrade all cluster components. If you upgrade the master but not the peers, the cluster might start to generate various errors and warnings. This is generally okay for a short duration, but you should complete the upgrade of all nodes as quickly as possible.

To upgrade a cluster node, follow the normal procedure for any Splunk Enterprise upgrade, with the few exceptions described below. For general information on upgrading Splunk Enterprise instances, see "How to upgrade Splunk".

Perform the following steps:

1. Upgrade the master node

Upgrade the master node first.

For information on what happens when the master goes down, as well as what happens when it comes back up, read "What happens when a master node goes down".

2. Put the master into maintenance mode

Run splunk enable maintenance-mode on the master. To confirm that the master has entered maintenance mode, run splunk show maintenance-mode. This step prevents unnecessary bucket fix-ups.

3. Upgrade the peer nodes

When upgrading peer nodes, note the following:

  • Peer upgrades can disrupt ongoing searches.
  • To minimize downtime and limit any disruption to indexing and searching, upgrade the peer nodes one at a time.
  • During the interim between when you upgrade the master and when you finish upgrading the peers, the cluster might generate various warnings and errors.

4. Upgrade the search head(s)

The only impact to the cluster when you upgrade the search head is the expected disruption to searches during that time.

5. Take the master out of maintenance mode

Run splunk disable maintenance-mode on the master. To confirm that the master has left maintenance mode, run splunk show maintenance-mode.

Upgrade from 5.0.1 or earlier

During an upgrade from 5.0.1 or earlier, the /cluster directory under $SPLUNK_HOME/etc/master-apps (on the master) and $SPLUNK_HOME/etc/slave-apps (on the peers) gets renamed to /_cluster. This happens automatically. For more details on this directory, see "Update common peer configurations".

When the master restarts after an upgrade from 5.0.1 or earlier, it performs a rolling restart on the set of peer nodes, to push the latest version of the configuration bundle (with the renamed /_cluster directory).

PREVIOUS
Migrate non-clustered indexers to a clustered environment
  NEXT
Configure the master

This documentation applies to the following versions of Splunk® Enterprise: 6.0.1, 6.0.2, 6.0.3, 6.0.4, 6.0.5, 6.0.6, 6.0.7, 6.0.8, 6.0.9, 6.0.10, 6.0.11, 6.0.12, 6.0.13, 6.0.14, 6.0.15


Comments

Ed - Not necessarily, although you do need read/write permissions in that directory. Please see <br /><br />http://docs.splunk.com/Documentation/Splunk/latest/installation/RunSplunkasadifferentornon-rootuser

Sgoodman
February 21, 2014

Does the script need to be run as root ? on Linux...

Ed Alias
February 21, 2014

Was this documentation topic helpful?

Enter your email address, and someone from the documentation team will respond to you:

Please provide your comments here. Ask a question or make a suggestion.

You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters