Splunk® Validated Architectures

Splunk Validated Architectures

Edge Processor Validated Architecture

The Edge Processor solution is a data processing engine that works at the edge of your network. Use the Edge Processor solution to filter, mask, and transform your data close to its source before routing the processed data to Splunk and S3. Edge Processors are hosted on your infrastructure so data doesn't leave the edge until you want it to.

Overall Benefits

The Edge Processor solution:

  • Provides users the capability to process and/or route data at arbitrary boundaries.
  • Enables users to apply real-time processing pipelines to data en-route.
  • Uses SPL2 to author processing pipelines.
  • Can scale both vertically and horizontally to meet processing demand.
  • Can be deployed in a highly available and load-balanced configuration as part of your data availability strategy.
  • Enables flexibility for scaling data and infrastructure at scale.
  • Use cases and enablers:
    • Organizational requirement for centralized data egress
    • Routing, forking, or cloning events to multiple Splunk deployments and/or Amazon S3
    • Masking, transforming, enriching, or otherwise modifying data in motion
    • Reducing certificate or security complexity
    • Remove or reduce ingestion load on the indexing tier due to offloading initial data processing, such as line breaking.
    • Relocate data processing away from the indexer to reduce the need for rolling restarts and other availability disruptions.

Architecture and topology

Refer to the Splunk docs for the latest system architecture for a high level view of the components of a deployment of the Edge Processor solution. However, there are several key points related to the default topology, which are critical to understand for a successful Edge Processor solution deployment:

  • The term Edge Processor is a logical grouping of instances that share the same configuration.
  • All pipelines deployed to an Edge Processor will run concurrently on all instances in that Edge Processor.
  • Edge Processor instances are not aware of one another. There is no intra-instance communication, synchronization, balancing, or other behavior.
  • All Edge Processor instances are managed by the Splunk Cloud control plane.
  • Edge Processors collect and send telemetry and analytics data to Splunk Cloud separately from the processed data flow.
  • Raw data is processed and routed only as described by each Edge Processor's deployed configuration, that is, customer data isn't included in any telemetry.

Edges and domains

An edge in the context of Edge Processor and Splunk data routing refers to any step between the source and the destination where event control is required. Some common edges are data centers, clouds and cloud availability zones, network segments, vlans, and infrastructure that manages regulated or compliance-related data (PII, HIPAA, PCI, GDPR, etc). Data domains represent the relationships between the data source and the edge. When data traverses an edge, the data has left the originating data domain.

In Splunk terms, edges are generally correlated with other "intermediate forwarding" concepts and you may often see edges referred to with Edge Processor, heavy and universal forwarders, and OTEL which can all serve as intermediate forwarders and affect change on events.

Intermediate Routing - close to source

Edge Processors can be deployed close to the source, logically within the same data domain as the data source.

Screenshot1png.png

Benefits Limitations
  • Edge Processors are immediate "neighbors" of data, with generally fewer firewall or network configuration considerations.
  • Easier to apply data domain specific enrichment.
  • Less susceptible to data loss due to networking interruption.
  • Modify events prior to egress from a data domain.
  • May reduce network and firewall complexity by acting as a data funnel.
  • Provisioning hardware for Edge Processor instances may be challenging in very small or very large distributed environments.
  • Potentially more infrastructure to manage.
  • Can result in larger data payloads, which may be undesirable if next hop networks are expensive or otherwise constrained.

Intermediate routing - close to destination

Edge Processors can be deployed close to the destination, logically in a different data domain than the data source, often within the same data domain as the destination.

Example1.png

Benefits Limitations
  • Reduce distributed hardware sprawl via centralized scale up and out
  • Can act as a catch-all or final point of contact for data administrators
  • Centralized routing and destination management
  • Higher risk of network disruption during transit
  • Potentially larger number of pipelines or pipeline complexity to account for all streams from all data domains
  • Filtering or enriching based on origin domain may not be possible

Multi-hop

There is no restriction on the number of Edge Processors events can travel through. In situations where it is desirable to have event processing near the data as well as near the destination Edge Processors can send and receive from one another.

Example2.png

Benefits Limitations
  • Benefits from both styles of control can be utilized.
  • Fine grained event control end to end.
  • Event processing and routing can be optimized for complex infrastructures.
  • Additional hardware required, potentially 2-3x.
  • Will result in pipeline sprawl requiring more administration and naming complexity.
  • More hops can complicate debugging.

When data spans multiple hops as shown, it can be helpful to include markers in the data to indicate the systems that have been traversed. For example, adding the following SPL2 to your pipeline will build a list of traversed Edge Processors into a field called "hops". | eval hops=hops+"[your own marker here]"

Management Strategy

Once you have decided on where Edge Processor instances will be provisioned you have to decide how you want to manage the deployment of pipelines to those instances.

Per Domain Configuration

In most cases, each data domain that requires event processing will have an Edge Processor and the instances belonging to that Edge Processor will be specific to that data domain. Organizing in this manner generally results in easy to understand naming and data flow, and offers the best reporting fidelity. When organized this way, all instances in each Edge Processor belong to the same Splunk output group for purposes of load balancing.

Example3.png

Benefits Limitations
  • Data domain and Edge Processor names are easily correlated.
  • Pipeline and Edge Processor metrics are aligned with hardware deployment.
  • Pipelines can be deployed on any number of Edge Processors.
  • Can result in a large number of Edge Processors in large distributed environments when repetitive or duplicate data domains are present.

Stretched Configuration

When event processing data source types, ports, requirements, and destinations are the same across more than one data domain, all of the instances can belong to the same Edge Processor. There is no requirement that each data domain must have its own Edge Processor.

Example4.png

Benefits Limitations
  • Reduce Edge Processor sprawl in large distributed but identical infrastructures, such as in retail, manufacturing, or banking.
  • Reduced reporting fidelity
  • All Edge Processor settings must be valid for all domains, specifically network ports and certificates.

Input Load Balancing

Edge Processor supports three input types: Splunk (s2s), HTTP Event Collector, and syslog via UDP and TCP. Edge Processor has no specific configuration or support for load balancing these protocols, and best practices for each protocol should be followed using the Edge Processor instances as targets. High level strategies and architectures for each input type will be covered, but specific protocol optimization and tuning is out of scope of this validated architecture.

Splunk

Splunk universal and heavy forwarders should continue to use outputs.conf with the list of Edge Processor instances as a server group to enable output load balancing. The behavior of the forwarders that is expected when using indexers or other forwarders as output targets can be expected when Edge Processors are output targets.

Example5.png

Benefits Limitations
  • No significant changes to normal forwarder behavior.
  • Availability and queueing managed by the forwarder.
  • Usually no certificate requirements for the forwarder.
  • S2S Ack is not supported.
  • There is no EP instance auto discovery or auto configuration by the forwarder.
  • Edge Processor does not support httpout from Universal Forwarders, only S2S.

HTTP Event Collector (HEC)

Where more than one Edge Processor instance is configured to receive HEC events, some sort of independent load balancing must be used. There is no mechanism in Edge Processor that load balances HEC traffic.

Some examples of load balancing HEC traffic to Edge Processor:

Network or Application load balancers

The most common approach to load balancing HEC is the same technology used to balance most other HTTP traffic - Classic network load balancers or application load balancers. Purpose-built load balancers offer a managed solution for providing a single endpoint that can be used to intelligently distribute HEC traffic to one or more Edge Processors. You can see a simple example here.

Benefits Limitations
  • Most complete and reliable solution.
  • Most load balancers include health checks to prevent connections to unhealthy servers.
  • Can offload TLS workload.
  • Many different load balancing strategies.
  • Requires load balancer infrastructure.

Example6.png

DNS Round Robin

DNS Round Robin is a simple load balancing method used to distribute network traffic across multiple servers. In this approach, DNS is configured to rotate through a list of Edge Processor IP addresses associated with a single domain name. When a HEC source makes a request to the domain name, the DNS server responds with one of the IP addresses from the list.

Benefits Limitations
  • Easy to implement.
  • Does not require extra infrastructure or complex configuration.
  • Simple solution when source can only reference a single URL for sending events.
  • Sources will often cache DNS results resulting in pinned connections.
  • DNS is unaware of Edge Processor instance availability. Clients may be directed to offline servers and must be tolerant of connection failures.

Example7.png

Client-side load balancing

In scenarios where HEC integration is scripted or access to the code is otherwise available, the responsibility for load balancing, queuing, retrying, and monitoring availability can be managed by the sending client or mechanism. This design decentralizes the complexity from the central network infrastructure to the individual sending entities. Each sender can dynamically adjust to downstream Edge Processor instance availability by implementing retry mechanisms that respond to errors and timeouts. The client implementation can be made as reliable as required.

Benefits Limitations
  • Data handling functionality is customizable.
  • Scale and reliability of integration can be built to fit data requirements.
  • Does not require extra routing infrastructure.
  • Edge Processor instances support health endpoint.
  • Managing retry and ack queues can be complex and challenging to scale.
  • Increased technical debt.

Example8.png

Syslog

In many cases, systems that use syslog can only specify a single server name or IP in their syslog configuration. There are several approaches to distributing or load balancing the syslog traffic among the Edge Processors to avoid a single point of failure. The guidance provided in this document as it relates to load balancing the syslog protocol is not specific to Edge Processor and is intentionally brief. Refer to About Splunk Validated Architectures or Splunk Connect for Syslog for more information. See https://datatracker.ietf.org/doc/html/rfc5424 for a relevant syslog request for comments (RFC).

DNS Round Robin

Using a single friendly name to represent a list of possible listeners. The same considerations as with DNS Round Robin for HEC apply to syslog. In particular, caching of DNS records can result in stalled data even when healthy instances are available.

Network Load Balancer

The same considerations as with Network load balancers for HEC apply to syslog with some notable differences:

  • Syslog often relies on the sender's IP address, which is typically lost with a load balancer. If the sending IP address is needed, special configuration on the load balancer must be made.
  • There is no specific syslog health check that can be used by a load balancer.
  • BGP or other layer 3 load balancing strategies may be considered for very large or sensitive syslog environments. Consult with a Splunk Architect or Professional Services in this case.

For further guidance and best practices for load balancing syslog traffic, refer to the Syslog Validated Architecture.

Port Mapping

Similar to Splunk Heavy Forwarders using TCP or UDP listeners to receive syslog data, Edge Processor can be configured to open one or more arbitrary ports on which syslog data is received. RFC compliance, source, and sourcetype are assigned to each port and to each event arriving on that port. The data sources, use cases, pipeline structure, and destinations all need to be considered when choosing a syslog port assignment strategy as port assignment can directly affect pipeline structure and performance.

Considerations

There are several configuration patterns to consider when building your Edge Processor syslog topology.

While the following examples are the most common implementations, it's important to keep in mind that these port and pipeline configurations are not mutually exclusive and in practice the actual topology tends to evolve over time.

Example9.png

In this configuration, each unique device and sourcetype is assigned a specific port and each sourcetype is processed by a unique pipeline. This results in one pipeline per sourcetype, and multiple ports may supply data to the same pipeline when assigning the same sourcetype.

Benefits Limitations
  • 1:1 mapping allows for more granular control and management of data.
  • Can be easier to manage time zones and time zone inconsistencies.
  • Often easier to adapt to new data sources.
  • Pipeline complexity is reduced.
  • Some syslog sources force specific ports.
  • Can lead to port and pipeline sprawl at large scale with many device types.
  • Potentially more complex syslog load balancing implementation.

Example10.png

In this configuration, one or more syslog sources share a single sourcetype and a single pipeline is responsible for processing that single sourcetype into distinct sourcetype events.

Benefits Limitations
  • May help when syslog port availability is restricted.
  • Reduced network port and load balancing complexity.
  • Each syslog event is processed only once.
  • Increased pipeline complexity.
  • All syslog flows are dependent on operation of a single pipeline (single point of failure).
  • Event processing debugging can be complex.

Example11.png

As with the prior configuration, one or more syslog sources share a single sourcetype and the pipeline(s) are responsible for detecting and processing distinct sourcetype events. However in this configuration a distinct pipeline is used for each unique sourcetype, where the initial partition is the generic sourcetype and some initial filter is used to select specific sourcetype events.

Benefits Limitations
  • May help when syslog port availability is restricted.
  • Reduced network port and load balancing complexity.
  • Per-sourcetype logic is constrained to a single pipeline.
  • Less complex pipelines and event debugging,
  • Increased system resource consumption.
  • All data that shares the same sourcetype is processed by all pipelines that use that sourcetype.
  • Care must be taken to prevent data duplication. Event filters must be well tested.

Splunk Destination

When sending events to Splunk, both S2S and HEC are validated and supported protocols. The S2S protocol is default and is configured as part of the first time Edge Processor setup.

S2S - Splunk To Splunk

Benefits Limitations
  • Smaller payload on the wire
  • Events can be delivered either parsed or unparsed depending on the initial data source
  • Non-standard firewall port requirements, rather than HTTP
  • S2S ack is not supported

HEC - HTTP Event Collector

Benefits Limitations
  • Standard, well-known port requirement (HTTP).
  • In-place support for existing HEC tokens and HEC data flows. Tokens can be passed through.
  • Edge Processor only supports the HEC Event endpoint. It does not support raw or httpout.
  • Events received by EP via HEC will always arrive as parsed and not subject to TRANSFORM operations.
  • Outbound HEC ack is not supported.

Asynchronous Load Balancing From Splunk Agents

To learn more about Splunk asynchronous load balancing see the Splunkd intermediate forwarding validated architecture.

Asynchronous load balancing is utilized to spread events more evenly across all available servers in the tcp output group of a forwarder. Traditionally the list of output servers are indexers, but the configuration is also valid when the output servers are Edge Processors. When configuring outputs from high volume forwarders to Edge Processors, configuring asynchronous load balancing can improve throughput.

There are no specific asynchronous load balancing settings related to the output of Edge Processors being sent to Splunk.

Funnel with Caution/Indexer Starvation

To learn more about indexer starvation see the intermediate forwarding validated architecture.

The use of Edge Processors introduces the same funneling concerns as when heavy forwarders are used as intermediate forwarders. Consolidation of forwarder connections and events into a small intermediate tier can introduce bottlenecks and performance degradation when compared to environments where intermediate forwarding is not used.

When building architectures that have large numbers of agents and indexers, consider scaling your Edge Processor infrastructure horizontally as a primary approach. Vertical scaling of the Edge Processor infrastructure will not result in a wider funnel.

Resiliency And Queueing

Pipelines, queuing, and data loss

Data received by Edge Processors are stored in memory while it passes through the processing pipeline. Once an event has left the processor and is delivered to the exporter, it is queued on disk. The event will remain in the disk-backed queue until the exporter successfully sends it to the destination.

Data from a source is only ever processed by a single Edge Processor instance. Even in scenarios where events are eligible for multiple pipelines, any given event is only processed by a single Edge Processor instance. Edge Processor instances are unaware of other Edge Processor instances and data is never synchronized or otherwise reconciled between instances. Because of this, the domain for data loss is the amount of unprocessed data in memory on any given instance.


Example12.png

HTTP Event Collector Acknowledgement

The HEC data input for Edge Processor supports acknowledgement of events. From the client perspective, this feature operates the same as HEC Acknowledgement on Splunk. However, whereas the Splunk implementation of HEC Ack can monitor the true indexing status, Edge Processor will consider the event acknowledged successfully once the event has been received by the instance's exporter queue. It may be some time between the delivery of the event to the queue and the receipt of the event by the destination index, so sending agents may register the event as delivered before the data is indexed or searchable.

Benefits Limitations
  • Enable compatibility with several add-ons, in particular the add-on for AWS.
  • Can address data resiliency and reporting requirements.
  • Acks are local to each instance.
  • Requests from client must be sticky from end to end to retrieve ack, such as client to LB, and LB to EP.
  • The instances must maintain the ack queue and will consume more system resources.
  • Ack only represents delivery to the output queue and does not guarantee delivery or indexing.
  • Acks are stored in memory. If an Edge Processor instance is restarted or crashes ack state for those events will be lost and may result in some duplicate events.

Size and scaling

Monitoring

There are many dimensions available for monitoring an Edge Processor and its pipelines. You can review all of the various metrics available using mcatalog. The list of metrics will grow over time so it's best to review all available metrics and dimensions in your environment:

| mcatalog values(metric_name) WHERE index=_metrics AND sourcetype="edge-metrics"

There's no one metric that can tell you it's time to scale up or down, instead monitor key metrics across your Edge Processors in order to establish baseline, expected usage metrics. In particular:

  • Throughput in and out
  • Event counts in and out
  • Exporter queue size
  • CPU & Memory consumption
  • Unique sourcetypes and agents sending data

Additionally, consider measuring event lag by comparing index time vs. event time as a general practice for GDI health, irrespective of the Edge Processor.

Scale up/down

As data processing requirements change you'll have to decide whether to scale by altering the available resources for your Edge Processor instances, or by altering the number of instances doing the processing. The following are some common scenarios and the most common scale result:

As is the case with most technology, Edge Processor instances scale both vertically and horizontally depending on the circumstances, and scaling one way vs. the other can lead to different outcomes.

Scenario Examples Scale Example

Scale out

Number of data sending clients increase

Need to improve indexer event distribution and avoid funneling

Spread out persistent queues, require less disk space per instance

Improve resiliency, reduce impact of instance failures

Can address data resiliency and reporting requirements.

Example13.png

to

Example14.png

Scale up

Data pipeline complexity increases such as: More complex regular expressions Multi-value evals and mvexpand Branched pipelines More destinations

Significant event size or event volume increases

Long persistent queue requirements

Example15.png

to

Example16.png

For most purposes consider any substantial change to any of the following as cause to evaluate scale:

  • Event volume, both the number and size of events.
  • Number of forwarders or data sources.
  • Number and complexity of pipelines.
  • Change in target destinations.
  • Risk tolerance.

Any change to these factors will play a role in the overall resource consumption and processing speed of Edge Processor instances.

Last modified on 14 October, 2024
Ingest Actions for Splunk platform   Splunk OpenTelemetry Collector for Kubernetes

This documentation applies to the following versions of Splunk® Validated Architectures: current


Was this topic useful?







You must be logged into splunk.com in order to post comments. Log in now.

Please try to keep this discussion focused on the content covered in this documentation topic. If you have a more general question about Splunk functionality or are experiencing a difficulty with Splunk, consider posting a question to Splunkbase Answers.

0 out of 1000 Characters