Get data in with the Collect service and a pull-based connector
The Collect service is a scalable data collection service that pulls large volumes of data from external data sources and ingests it into your data pipeline. The Collect service can be integrated into the Splunk Data Stream Processor (DSP) through pull-based connectors.
Connector concepts and terminology
A connector is an extension of the Collect service that can extract events or metrics from a data source. To understand how a connector works, you need to know the following terms:
- Scheduled job: Defines when, what, and how to collect data
- Execution: One iteration of a scheduled job
- Worker: A software module that collects and processes data, and then sends the results to the downstream destination
Configuration parameters
The following common configuration parameters are used by all connectors:
name
: The name of your job.connectorID
: The registered connector image name. TheconnectorID
cannot be changed once it is assigned.schedule
: The CRON job setting for your job, given in UTC format. The configuration uses standard CRON syntax:min hour day/month month day/week
.scheduled
: Optional. Defaulttrue
. Set this parameter tofalse
to stop the scheduled job from automatically executing on the next CRON cycle. Jobs that are currently running are not affected.eventExtraFields
: Optional. An array of custom name-value pairs that can be used to annotate events, allowing one DSP pipeline to process events from multiple Collect service jobs. If a field ineventExtraFields
conflicts with a field in an event, theeventExtraFields
field takes precedence.parameters
: The parameters used to configure the connector.workers
: The number of workers you want to use to collect data.
Limitations of the Collect service
The Collect service has the following job limitations:
- A maximum of 20 workers per job
The Collect service has the following scheduling limitations:
- Jobs must be scheduled to run at least once per week. If a job runs less than once per week, you might see duplicate data.
- Jobs must be scheduled to run no more than once every 5 minutes. If a job is scheduled to run more frequently than once every 5 minutes, some scheduled jobs might be skipped. For example, if a job is scheduled to run once per minute, it runs only once in a 5 minute time period and 4 scheduled jobs are skipped.
Data ingested can be delayed
The data ingested can be delayed because of the following reasons:
- The latency in the data provided by the data source
- The volume of raw data ingested, for example ingesting 1 GB of data takes longer than ingesting 1 MB of data
- The volume of data ingested from upstream sources such as from the Ingest REST API or a forwarder
Adding additional workers might improve the data ingestion rates, but external factors will still influence the speed of data ingestion.
Permissions
By default, users only have rights to view and change their own pipelines. Users can't see pipelines created by other users in their tenant or the user list for the tenant. Administrators have full rights to view all pipelines and users in each tenant.
See Manage users and admins for more information on the permissions assigned to the user and administrator roles.
Use a pull-based connector with Splunk DSP
Pull-based connectors collect events from external sources and send them into your DSP pipeline though the Collect service.
To use a pull-based connector, do the following steps:
- Choose the data source that you would like to create a connection to:
- Create your connection and use it in your data pipeline.
Create a connection for the DSP Kafka SSL Connector | Use the Amazon CloudWatch Metrics connector with Splunk DSP |
This documentation applies to the following versions of Splunk® Data Stream Processor: 1.0.0
Feedback submitted, thanks!