Get data in with the Collect service and a pull-based connector

The Collect service is a scalable data collection service that pulls large volumes of data from external data sources and ingests it into your data pipeline. The Collect service can be integrated into the Splunk Data Stream Processor (DSP) through pull-based connectors.

Connector concepts and terminology

A connector is an extension of the Collect service that can extract events or metrics from a data source. To understand how a connector works, you need to know the following terms:

Scheduled job: Defines when, what, and how to collect data
Execution: One iteration of a scheduled job
Worker: A software module that collects and processes data, and then sends the results to the downstream destination

Configuration parameters

The following common configuration parameters are used by all connectors:

name: The name of your job.
connectorID: The registered connector image name. The connectorID cannot be changed once it is assigned.
schedule: The CRON job setting for your job, given in UTC format. The configuration uses standard CRON syntax: min hour day/month month day/week.
scheduled: Optional. Default true. Set this parameter to false to stop the scheduled job from automatically executing on the next CRON cycle. Jobs that are currently running are not affected.
eventExtraFields: Optional. An array of custom name-value pairs that can be used to annotate events, allowing one DSP pipeline to process events from multiple Collect service jobs. If a field in eventExtraFields conflicts with a field in an event, the eventExtraFields field takes precedence.
parameters: The parameters used to configure the connector.
workers: The number of workers you want to use to collect data.

Limitations of the Collect service

The Collect service has the following job limitations:

A maximum of 20 workers per job

The Collect service has the following scheduling limitations:

Jobs must be scheduled to run at least once per week. If a job runs less than once per week, you might see duplicate data.
Jobs must be scheduled to run no more than once every 5 minutes. If a job is scheduled to run more frequently than once every 5 minutes, some scheduled jobs might be skipped. For example, if a job is scheduled to run once per minute, it runs only once in a 5 minute time period and 4 scheduled jobs are skipped.

Data ingested can be delayed

The data ingested can be delayed because of the following reasons:

The latency in the data provided by the data source
The volume of raw data ingested, for example ingesting 1 GB of data takes longer than ingesting 1 MB of data
The volume of data ingested from upstream sources such as from the Ingest REST API or a forwarder

Adding additional workers might improve the data ingestion rates, but external factors will still influence the speed of data ingestion.

Permissions

By default, users only have rights to view and change their own pipelines. Users can't see pipelines created by other users in their tenant or the user list for the tenant. Administrators have full rights to view all pipelines and users in each tenant.

See Manage users and admins for more information on the permissions assigned to the user and administrator roles.

Use a pull-based connector with Splunk DSP

Pull-based connectors collect events from external sources and send them into your DSP pipeline though the Collect service.

To use a pull-based connector, do the following steps:

Choose the data source that you would like to create a connection to:
Create your connection and use it in your data pipeline.

Get data in with the Collect service and a pull-based connector

Connector concepts and terminology

Configuration parameters

Limitations of the Collect service

Data ingested can be delayed

Permissions

Use a pull-based connector with Splunk DSP

Comments

Get data in with the Collect service and a pull-based connector

Was this topic useful?