Fluentd
Search…
Monitoring by Prometheus
This article describes how to monitor Fluentd via Prometheus.
Since both Prometheus and Fluentd are under CNCF (Cloud Native Computing Foundation), Fluentd project is recommending to use Prometheus by default to monitor Fluentd.

Installation

Install fluent-plugin-prometheus gem:
1
$ fluent-gem install fluent-plugin-prometheus
Copied!
For td-agent, use td-agent-gem for installation:
1
$ sudo td-agent-gem install fluent-plugin-prometheus
Copied!
This GitHub repository contains a fully working configuration for this article.

Example Fluentd Configuration

To expose Fluentd metrics to Prometheus, we need to configure three (3) parts:
    Step 1: Counting Incoming Records by Prometheus Filter Plugin
    Step 2: Counting Outgoing Records by Prometheus Output Plugin
    Step 3: Expose Metrics by Prometheus Input Plugin via HTTP

Step 1: Counting Incoming Records by Prometheus Filter Plugin

Configure the <filter> section to count the incoming records per tag:
1
# source
2
<source>
3
@type forward
4
bind 0.0.0.0
5
port 24224
6
</source>
7
8
# count the number of incoming records per tag
9
<filter company.*>
10
@type prometheus
11
<metric>
12
name fluentd_input_status_num_records_total
13
type counter
14
desc The total number of incoming records
15
<labels>
16
tag ${tag}
17
hostname ${hostname}
18
</labels>
19
</metric>
20
</filter>
Copied!
With this configuration, the prometheus filter plugin starts adding the internal counter as the record comes in.

Step 2: Counting Outgoing Records by Prometheus Output Plugin

Configure the copy plugin with prometheus output plugin to count the outgoing records per tag:
1
# count the number of outgoing records per tag
2
<match company.*>
3
@type copy
4
5
<store>
6
@type forward
7
<server>
8
name myserver1
9
host 192.168.1.3
10
port 24224
11
weight 60
12
</server>
13
</store>
14
15
<store>
16
@type prometheus
17
<metric>
18
name fluentd_output_status_num_records_total
19
type counter
20
desc The total number of outgoing records
21
<labels>
22
tag ${tag}
23
hostname ${hostname}
24
</labels>
25
</metric>
26
</store>
27
28
</match>
Copied!
With this configuration, the prometheus output plugin starts adding the internal counter as the record goes out.

Step 3: Expose Metrics by Prometheus Input Plugin via HTTP

Configure prometheus input plugin to expose internal counter information via HTTP:
1
# expose metrics in prometheus format
2
3
<source>
4
@type prometheus
5
bind 0.0.0.0
6
port 24231
7
metrics_path /metrics
8
</source>
9
10
<source>
11
@type prometheus_output_monitor
12
interval 10
13
<labels>
14
hostname ${hostname}
15
</labels>
16
</source>
Copied!

Check the Configuration

After you have done these three (3) changes, restart fluentd:
1
# For stand-alone Fluentd installations
2
$ fluentd -c fluentd.conf
3
4
# For td-agent users
5
$ sudo systemctl restart td-agent
Copied!
Let's send some records:
1
$ echo '{"message":"hello"}' | bundle exec fluent-cat company.test1
2
$ echo '{"message":"hello"}' | bundle exec fluent-cat company.test1
3
$ echo '{"message":"hello"}' | bundle exec fluent-cat company.test1
4
$ echo '{"message":"hello"}' | bundle exec fluent-cat company.test2
Copied!
Access http://localhost:24231/metrics to receive the metrics in Prometheus format:
1
curl http://localhost:24231/metrics
2
# TYPE fluentd_input_status_num_records_total counter
3
# HELP fluentd_input_status_num_records_total The total number of incoming records
4
fluentd_input_status_num_records_total{tag="company.test",host="KZK.local"} 3.0
5
fluentd_input_status_num_records_total{tag="company.test2",host="KZK.local"} 1.0
6
# TYPE fluentd_output_status_num_records_total counter
7
# HELP fluentd_output_status_num_records_total The total number of outgoing records
8
fluentd_output_status_num_records_total{tag="company.test",host="KZK.local"} 3.0
9
fluentd_output_status_num_records_total{tag="company.test2",host="KZK.local"} 1.0
10
# TYPE fluentd_output_status_buffer_queue_length gauge
11
# HELP fluentd_output_status_buffer_queue_length Current buffer queue length.
12
fluentd_output_status_buffer_queue_length{hostname="KZK.local",plugin_id="object:3fcbccc6d388",type="forward"} 1.0
13
....
Copied!

Example Prometheus Configuration

Prepare the configuration file (prometheus.yml):
1
global:
2
scrape_interval: 10s # Set the scrape interval to every 10 seconds. Default is every 1 minute.
3
4
# A scrape configuration containing exactly one endpoint to scrape:
5
# Here it's Prometheus itself.
6
scrape_configs:
7
- job_name: 'fluentd'
8
static_configs:
9
- targets: ['localhost:24231']
Copied!
Launch prometheus:
1
$ ./prometheus --config.file="prometheus.yml"
Copied!
Now, open this URL http://localhost:9090/ in your browser.

How to use Prometheus to monitor Fluentd?

List of Fluentd Nodes

Go to http://localhost:9090/targets to see the list of Fluentd nodes and their status.
Prometheus Targets

List of Fluentd Metrics

Visit http://localhost:9090/graph to explore Fluentd's internal metrics. You'll see eight (8) metrics in the metric list:
Prometheus Metrics
    fluentd_input_status_num_records_total
    fluentd_output_status_buffer_queue_length
    fluentd_output_status_buffer_total_bytes
    fluentd_output_status_emit_count
    fluentd_output_status_num_errors
    fluentd_output_status_num_records_total
    fluentd_output_status_retry_count
    fluentd_output_status_retry_wait
Pick fluentd_input_status_num_records_total and you'll see the total incoming records per tag.
Prometheus Graph

Example Prometheus Queries

Since fluentd_input_status_num_records_total and fluentd_output_status_num_records_total are monotonically increasing numbers, it requires a little bit of calculation by PromQL (Prometheus Query Language) to make them meaningful.
Here are the example PromQLs for common metrics:
1
# number of available nodes
2
up
3
4
# incoming records / sec / host
5
sum(rate(fluentd_input_status_num_records_total[1m])) by (hostname)
6
7
# incoming records / sec / tag
8
sum(rate(fluentd_input_status_num_records_total[1m])) by (tag)
9
10
# outgoing records / sec / host
11
sum(rate(fluentd_output_status_num_records_total[1m])) by (hostname)
12
13
# outgoing records / sec / tag
14
sum(rate(fluentd_output_status_num_records_total[1m])) by (tag)
15
16
# emit count / sec
17
rate(fluentd_output_status_emit_count[1m])
Copied!

Metrics to Monitor

In addition to the traffic metrics introduced above, it is important to monitor the queue length and error count.
If these values are increasing, it means Fluentd cannot flush the buffer to the destination. Thus you will lose the data once the buffer becomes full.
1
# maximum buffer length in last 1min
2
max_over_time(fluentd_output_status_buffer_queue_length[1m])
3
4
# maximum buffer bytes in last 1min
5
max_over_time(fluentd_output_status_buffer_total_bytes[1m])
6
7
# maximum retry wait in last 1min
8
max_over_time(fluentd_output_status_retry_wait[1m])
9
10
# retry count / sec
11
rate(fluentd_output_status_retry_count[1m])
Copied!

Grafana for Advanced Visualization / Alerting

For more advanced visualization and alerting, we recommend Grafana as a visualization frontend for Prometheus.
Prometheus + Grafana

Further Readings

If this article is incorrect or outdated, or omits critical information, please let us know. Fluentd is an open-source project under Cloud Native Computing Foundation (CNCF). All components are available under the Apache 2 License.
Last modified 4mo ago