Distributed Splunk overview
This manual describes how to distribute various components of Splunk functionality across multiple machines. By distributing Splunk, you can scale its functionality to handle the data needs for enterprises of any size and complexity.
In single-machine deployments, one instance of Splunk handles the entire end-to-end process, from data input through indexing to search. A single-machine deployment can be useful for testing and evaluation purposes and might serve the needs of department-sized environments. For larger environments, however, where data originates on many machines and where many users need to search the data, you'll want to distribute functionality across multiple Splunk instances. This manual describes how to deploy and use Splunk in such a distributed environment.
How Splunk scales
Splunk performs three key functions as it moves data through the data pipeline. First, Splunk consumes data from files, the network, or elsewhere. Then it indexes the data. (Actually, it first parses and then indexes the data, but for purposes of this discussion, we consider parsing to be part of the indexing process.) Finally, it runs interactive or scheduled searches on the indexed data.
You can split this functionality across multiple specialized instances of Splunk, ranging in number from just a few to thousands, depending on the quantity of data you're dealing with and other variables in your environment. You might, for example, create a deployment with many Splunk instances that only consume data, several other instances that index the data, and one or more instances that handle search requests. These specialized instances of Splunk are known collectively as components. There are several types of components.
For a typical mid-size deployment, for example, you can deploy lightweight versions of Splunk, called forwarders, on the machines where the data originates. The forwarders consume data locally and then forward the data across the network to another Splunk component, called the indexer. The indexer does the heavy lifting; it indexes the data and runs searches. It should reside on a machine by itself. The forwarders, on the other hand, can easily co-exist on the machines generating the data, because the data-consuming function has minimal impact on machine performance. This diagram shows several forwarders sending data to a single indexer:
As you scale up, you can add more forwarders and indexers. For a larger deployment, you might have hundreds of forwarders sending data to a number of indexers. You can use load balancing on the forwarders, so that they distribute their data across some or all of the indexers. Not only does load balancing help with scaling, but it also provides a fail-over capability if one of the indexers goes down. The forwarders automatically switch to sending their data to any indexers that remain alive. In this diagram, each forwarder load-balances its data across two indexers:
To coordinate and consolidate search activities across multiple indexers, you can also separate out the functions of indexing and searching. In this type of deployment, called distributed search, each indexer just indexes data and performs searches across its own indexes. A Splunk instance dedicated to searching, called the search head, coordinates searches across the set of indexers, consolidating the results and presenting them to the user:
For the largest environments, you can deploy a pool of several search heads sharing a single configuration set. With search head pooling, you can coordinate simultaneous searches across a large number of indexers:
These diagrams illustrate a few basic deployment topologies. You can actually combine the Splunk functions of data input, indexing, and search in a great variety of ways. For example, you can set up the forwarders so that they route data to multiple indexers, based on specified criteria. You can also configure forwarders to process data locally before sending the data on to an indexer for storage. In another scenario, you can deploy a single Splunk instance that serves as both search head and indexer, searching across not only its own indexes but the indexes on other Splunk indexers as well. You can mix-and-match Splunk components as needed. The possible scenarios are nearly limitless.
This manual describes how to scale a deployment to fit your exact needs, whether you're managing data for a single department or for a global enterprise... or for anything in between.
Manage your Splunk deployment
Splunk provides a few key tools to help manage a distributed deployment:
- Deployment server. This Splunk component provides a way to centrally manage configurations and content updates across your entire deployment. See "About deployment server" for details.
- Deployment monitor. This app can help you manage and troubleshoot your deployment. It tracks the status of your forwarders and indexers and provides early warning if problems develop. See "About deployment monitor" for details.
What comes next
The rest of this Overview section covers:
- How data moves through Splunk: the data pipeline
- Scale your deployment: Splunk components
- Components and roles
It starts by describing the data pipeline, from the point that the data enters Splunk to when it becomes available for users to search on. Next, the overview describes how Splunk functionality can be split into modular components. It then correlates the available Splunk components with their roles in facilitating the data pipeline.
The remaining sections of this manual describe the Splunk components in detail, explaining how to use them to create a distributed Splunk deployment.
For information on capacity planning based on the scale of your deployment, read "Hardware capacity planning for your Splunk deployment" in the Installation manual.
How data moves through Splunk: the data pipeline