Recommendations
The current recommendations for scaling the Solution are based upon the size and the desired responsiveness of your environment. We measure environment size by the number of ESX/i hosts being managed by a given VC instance and by the number VMs being monitored. We detarmine environment size as follows:
Scaling approaches based on environment size
- Small environments (less than 10 hosts): A single
engine.conf
file is recommended for small environments that do not require real time monitoring. It is suitable for demo and test environments. The file maintenance cost is low.
- Medium Environments (from 10 to 50 hosts): For a larger environment with greater demand and the need to be more responsive, organizing by action is recommended. To improve collection of the different data types while still minimizing the number of parallel engines that must be run, at a minimum, separate the collection of performance data and hierarchy data (Hierarchy controls many menu systems) into parallel stanzas. We recommend that you do this for virtual center environments that have 10 - 50 hosts. The maintenance requirements for
engine.conf
are reduced and the load on the forwarder appliances is reduced while still giving reasonable parallelization. You can remove ESX/i hosts that require special treatment from the action specific stanzas and run their actions (or a subset thereof) inside of a host specific stanza.
- Large Environments (>50 hosts): You can approach a large environment in a number of ways.
- organize your environment into two smaller segments, across multiple forwarder environments (each with around 20-50 hosts, following the settings above).
- Split your environment by action and host to get more responsive data. This prevents long running inventory stanzas from blocking other stanzas or to get quick response from performance gathering stanzas. The filters for Performance and Inventory are very useful as you can break up the actions to only look for specific types of data. For example, if Host1 runs a number of virtual machines you can split up the stanzas using perfManagedEntityWhitelist, where one Engine.pm process collects only virtual machine performance data while the other collects host data only. The inventory black list (invBlacklist) can also limit the data that you collect from inventory, shortening the overall collection time.
Scaling examples
In the largest environments that have VCs with more than 50 hosts under management in each one, the scale out structure could look something like this:
- A single FA dedicated to collecting hierarchy data for all VCs and ESX/i hosts. This is needed if one or more VCs are very large (manages many more than 50 hosts). You may need to use a dedicated FA to collect hierarchy data in a large environment.
- A single FA dedicated to collecting inventory data for each VC. This is needed if one or more VCs are very large (manages many more than 50 hosts). If fewer hosts are managed per VC, you could use one FA to collect data from multiple VCs.
- An FA dedicated to getting performance data from some number of hosts up to the point where the CPU / memory tops out, or data gathering time becomes too long and gaps show up in the data. If dedicated to gathering only performance data, a single FA can collect data from 30 to 50 hosts (or possibly even more). The actual maximum ratio of FAs to ESX/i hosts monitored is a number that needs to be calculated as the time and data volumes depends on the size of the ESX/i host's inventory (the number of running VMs).
- An FA dedicated to getting log data from some number of hosts up to the point where the CPU / memory tops out, or data gathering time becomes too long and gaps show up in the data. If dedicated to gathering only log data, a single FA can handle from 30 to 50 hosts or more. The actual maximum ratio of FAs to ESX/i hosts monitored is a number that needs to be calculated as the time and data volumes depends on the amount of activity on the ESX/i host (number of VMs, task execution, administrative operations, DRS activity, and so on)
- A single FA dedicated to getting task and event data from all VCs, and task data from all unmanaged ESX/i hosts. The volume of tasks and events tends to be low enough that a single FA can collect data from all VCs up to targeted max solution limits (500 hosts, 10K VMs).
Other approaches for scaling
These are some other approaches that are not discussed in detail here:
Increase the default number of CPUs and memory for the FA VM instead of deploying multiple FAs. For example, if you find that in your environment a single instance of the engine cannot gather data from 80 ESX/i hosts fast enough (see example 5) then the FA may be CPU-bound. It would make sense to try scaling the FA VM itself rather than running more FA VMs or engines in parallel.
Split the ESX/i host stanzas across multiple instances, multiple engine.conf files, or even multiple FA instances. If a single instance of the engine cannot gather data because the target machines cannot respond with the data fast enough, then you can further split the ESX/i host stanzas across multiple instances, multiple engine.conf files, or even multiple FA instances. Use actions like PerfDiscovery, LogDiscovery, Tasks, and Events to do this.
Use invBlacklist to split inventory collection across different engine instances. If InventoryDiscovery takes too long to gather data from the VC, use invBlacklist to split inventory collection across different engine instances. For example, if your environment has a huge number of VMs (1000) on a small number of ESX/i hosts (10), then you could collect inventory data for the “VirtualMachine” managed entity in a separate engine and gather other kinds of inventory data from a different engine instance.
There is no single way to structure engine.conf
files that will scale for every kind environment. Understanding your environment - the resources and the data you want to collect from them - is key to how to structure your engine.conf
files.
Also, it is the number of ESX/i hosts managed by a given VC that is the key determining factor of scale. If this number is low for each VC (< 50 hosts), it will be easier to structure the solution.
Considerations | Auto load balancing |
This documentation applies to the following versions of Splunk® App for VMware (Legacy): 1.0, 1.0.1, 1.0.2, 1.0.3, 2.0
Feedback submitted, thanks!