An overview of the Splunk SOAR (On-premises) clustering feature

A cluster consists of a minimum of three instances of and its supporting external services; file shares, a PostgreSQL database or database cluster, a Splunk Enterprise or Splunk Cloud deployment, and a load balancer.

Splunk SOAR (On-premises) clusters provide horizontal scaling and redundancy. Larger clusters provide the capacity to handle larger workloads than a single Splunk SOAR (On-premises) instance, and provide benefits in terms of redundancy and reduced downtime for upgrades or other maintenance.

Splunk SOAR (On-premises) does not have the ability to automatically scale, or automatically add or remove cluster nodes through external systems such as Kubernetes, AWS, or Azure.

Elements of a Splunk SOAR (On-premises) cluster

The primary elements of a Splunk SOAR (On-premises) cluster are:

a load balancer, such as HAProxy or Elastic Load Balancer
three or more Splunk SOAR (On-premises) nodes
a PostgreSQL database
file shares
either a Splunk Enterprise or Splunk Cloud deployment

A Splunk SOAR (On-premises) cluster uses other key technologies:

Consul, for action locking and coordination.
RabbitMQ, for a messaging bus and queues.

This diagram shows an example of a cluster, with the connections marked.

The role of Consul

Splunk SOAR (On-premises) clusters use Consul to elect a cluster leader and manage distributed locks for the system. The first three Consul nodes in a Splunk SOAR cluster are servers. Any additional nodes are Consul client nodes. Consul uses a quorum of server nodes to elect a cluster leader. If a majority of the Consul server nodes are unavailable or unreachable, no cluster leader can be elected.

The Splunk SOAR (On-premises) daemons ActionD and ClusterD depend on Consul electing a cluster leader.

ClusterD relies on Consul's leader elections to set the leader for the Splunk SOAR (on-premises) cluster in the PostgreSQL database. ClusterD also routes all RabbitMQ messages from other nodes to the appropriate processes within its cluster node. If no leader has been or can be elected, no messages are routed to cluster nodes, effectively stopping all automation.
ActionD is responsible for managing action concurrency for each asset and triggering action runs across the cluster, which is accomplished using Consul's distributed locks. If no messages or actions can be processed, all automation is effectively stopped.

You can verify that a problem with actions processing or automation is related to cluster leadership by checking these things:

The System Health page shows ActionD and ClusterD services are stopped on all cluster nodes.
Reviewing the Consul logs at <$PHANTOM_HOME>/var/log/phantom/consul-stdout.log.

Recovering a Consul cluster that is unable to elect a leader

When recovering a Consul cluster, you will need to identify Consul server nodes. You can determine whether a Consul node is a server or a client by using a phenv command.

On any cluster node, run this command:

phenv cluster_management --status

The first stanza of the output will contain either Server: True or Server: False for each node in the ClusterNode table.

Example output:

[phantom@localhost phantom]$ phenv cluster_management --status
Splunk SOAR Cluster State:
ClusterNodes found in the database:

  ID: f62911b-3613-4662-b439-b67711a13927
  Name: 10.1.10.1
  Server: True
  Status: Enabled=True Online=True

  ID: e8388de0-fc65-4980-a006-57bae5d8a236
  Name: 10.1.10.2
  Server: True
  Status: Enabled=True Online=True

  ID: 0c27dd73-7318-473b-b10a-f0d960d8ffa9
  Name: 10.1.10.3
  Server: True
  Status: Enabled=True Online=True

  ID: f181c689-3833-4335-b655-d61f95919291
  Name: 10.1.10.4
  Server: False
  Status: Enabled=True Online=True
    
  ID: 07cff7d7-0aa3-4906-87b0-c85c7c83f808
  Name: 10.1.10.5
  Server: False
  Status: Enabled=True Online=True

  ...

The output in this example is truncated.

Recover the Consul cluster by doing the following steps:

Using SSH, on every cluster node, stop Consul and clear Consul's raft database.

phenv phantom_supervisord ctl stop consul

rm -r <$PHANTOM_HOME>/local_data/consul/*

Choose a cluster node defined as a Consul server node to be your bootstrap server node.
In a three-node Splunk SOAR (On-premises) cluster, all nodes are server nodes.
1. (Conditional) If your Splunk SOAR (On-premises) cluster is larger than three nodes, check <$PHANTOM_HOME>/etc/consul/config.json to identify a cluster node that Consul has configured as a server node.
On your bootstrap server node, create a bootstrap file. This file will set and keep this node as the Consul cluster leader when Consul is restarted.
```
echo '{"bootstrap": true}' > <$PHANTOM_HOME>/etc/consul/bootstrap.json
```
Using SSH, on the server node with the bootstrap file, start Consul.
```
phenv phantom_supervisord ctl start consul
```
Using SSH, on the remaining two server nodes, start Consul.
```
phenv phantom_supervisord ctl start consul
```
Using SSH, on any remaining nodes, start Consul.
```
phenv phantom_supervisord ctl start consul
```
After all Consul nodes are back online, SSH to your bootstrap server then delete the bootstrap file. This allows Consul to hold elections as normal.
```
rm <$PHANTOM_HOME>/etc/consul/bootstrap.json
```

The role of RabbitMQ

Splunk SOAR (On-premises) clusters use a RabbitMQ cluster to send messages between nodes.

Network failures between cluster nodes may result in network partitions (also sometimes called a split-brain), and by default, RabbitMQ requires manual intervention to recover from a network partition. When the RabbitMQ cluster is in a network partitioned state, messages sent between Splunk SOAR cluster nodes may be lost in undefined ways, which can disrupt automation and ingestion.

Splunk SOAR (On-premises) releases 6.1.0 and higher implements RabbitMQ's built-in cluster_partition_handling setting strategy of autoheal by default.

RabbitMQ nodes operate as either RAM nodes or DISC nodes. DISC node persist more state information to disk than RAM nodes. In Splunk SOAR (On-premises) 6.1.0 and later releases, all RabbitMQ nodes are set to DISC.

Set RabbitMQ nodes to disc mode

RabbitMQ nodes in Splunk SOAR (On-premises) releases earlier than 6.1.0 are a mix of RAM and DISC nodes.

The fist three Splunk SOAR (On-premises) cluster nodes have RabbitMQ nodes in DISC mode.
Any additional Splunk SOAR (On-premises) clusters nodes have RabbitMQ nodes in RAM mode.

Example:
In a five-node Splunk SOAR (On-premises) cluster, three RabbitMQ nodes are in DISC mode and two RabbitMQ nodes in RAM mode.

Improve reliability and recoverability for your Splunk SOAR (On-premises) cluster by setting all RabbitMQ nodes to disc mode.

On each cluster node, do the following steps:

Using SSH, connect to a Splunk SOAR (On-premises) cluster node as the user account that runs Splunk SOAR.

Run the following commands:

phenv rabbitmqctl stop_app

phenv rabbitmqctl change_cluster_node_type disc

phenv rabbitmqctl start_app

An overview of the Splunk SOAR (On-premises) clustering feature

Elements of a Splunk SOAR (On-premises) cluster

The role of Consul

Recovering a Consul cluster that is unable to elect a leader

The role of RabbitMQ

Set RabbitMQ nodes to disc mode

See also

Comments

An overview of the Splunk SOAR (On-premises) clustering feature

Was this topic useful?