Internal metrics of the Collector đź”—
Find the complete list of the Collector’s internal metrics and what to use them for.
Use internal metrics to monitor your Collector instance đź”—
You can use the Collector’s internal metrics to monitor the behavior of the Collector and identify performance issues.
Monitor data flow and detect data loss đź”—
To ensure data is flowing correctly, use the otelcol_receiver_accepted_spans
, otelcol_receiver_accepted_metric_points
, and otelcol_receiver_accepted_logs
metrics for information about the data ingested by the Collector, and otecol_exporter_sent_spans
, otelcol_exporter_sent_metric_points
, and otelcol_exporter_sent_logs
for information about exported data.
Use otelcol_processor_dropped_spans
, otelcol_processor_dropped_metric_points
, and otelcol_processor_dropped_logs
to detect data loss. Small losses shouldn’t be considered outages, so depending on your requirements, set up a minimal time window before alerting.
Detect receive failures đź”—
Sustained rates of otelcol_receiver_refused_spans
, otelcol_receiver_refused_metric_points
, and otelcol_receiver_refused_logs
indicate too many errors returned to clients. Depending on the deployment and the client’s resilience this may indicate data loss at the clients.
Sustained rates of otelcol_exporter_send_failed_spans
, otelcol_exporter_send_failed_metric_points
, and otelcol_exporter_send_failed_logs
indicate that the Collector is not able to export data as expected. It doesn’t necessarily imply data loss since there could be retries but a high rate of failures could indicate issues with the network or the back-end receiving the data.
Control queue length đź”—
Use the queue-retry mechanism (available in most exporters) as the retry mechanism for the Collector:
To check if your queue capacity is enough, compare otelcol_exporter_queue_capacity
, which indicates the capacity of the retry queue in batches, and otelcol_exporter_queue_size
, which indicates the current size of retry queue.
otelcol_exporter_enqueue_failed_spans
, otelcol_exporter_enqueue_failed_metric_points
and otelcol_exporter_enqueue_failed_log_records
indicate the number of span/metric points/log records failed to be added to the sending queue. If your queue is full, decrease your sending rate or horizontally scale collectors.
The queue-retry mechanism also supports logging for monitoring. Check your logs for messages like “Dropping data because sending_queue is full”.
List of internal metrics of the Collector đź”—
These are the Collector’s internal metrics.
Metric name |
Metric description |
---|---|
|
Number of log records failed to be added to the sending queue |
|
Number of metric points failed to be added to the sending queue |
|
Number of spans failed to be added to the sending queue |
|
Capacity of the exporter queue |
|
Current size of the retry queue, in batches |
|
Number of log records failed to be sent to destination |
|
Number of metrics point failed to be sent to destination |
|
Number of log records successfully sent to destination |
|
Number of metric points successfully sent to destination |
|
Number of spans successfully sent to destination |
|
Number of namespace add events received |
|
Number of namespace update events received |
|
Number of pod add events received |
|
Number of pod delete events received |
|
Size of table containing pod info |
|
Total CPU user and system time, in seconds |
|
Total physical memory (resident set size) |
|
Bytes of allocated heap objects |
|
Total bytes of allocated objects |
|
Cumulative bytes allocated for heap objects |
|
Uptime of the process |
|
Number of log records successfully pushed into the next component in the pipeline |
|
Number of metric points successfully pushed into the next component in the pipeline |
|
Number of spans successfully pushed into the next component in the pipeline |
|
Number of units in the batch |
|
Number of units in the batch histogram bucket |
|
Number of units in the batch histogram count |
|
Number of units in the batch histogram sum |
|
Number of times the batch was sent due to a timeout trigger |
|
Number of log records that were dropped |
|
Number of metric points that were dropped |
|
Number of spans that were dropped |
|
Distribution of groups extracted for logs |
|
Distribution of groups extracted for logs bucket histogram |
|
Distribution of groups extracted for logs count histogram |
|
Distribution of groups extracted for logs sum histogram |
|
Number of refused log records |
|
Number of refused metric points |
|
Number of refused spans |
|
Number of log records successfully pushed into the pipeline |
|
Number of metric points successfully pushed into the pipeline |
|
Number of spans successfully pushed into the pipeline |
|
Number of log records that could not be pushed into the pipeline |
|
Number of metric points that could not be pushed into the pipeline |
|
Number of spans that could not be pushed into the pipeline |
|
Number of metric points that couldn’t be scraped |
|
Number of metric points successfully scraped |