Monitoring by Prometheus
Since both Prometheus and Fluentd are under CNCF (Cloud Native Computing Foundation), Fluentd project is recommending to use Prometheus by default to monitor Fluentd.
Install
fluent-plugin-prometheus
gem:$ fluent-gem install fluent-plugin-prometheus
For
td-agent
, use td-agent-gem
for installation:$ sudo td-agent-gem install fluent-plugin-prometheus
To expose Fluentd metrics to Prometheus, we need to configure three (3) parts:
- Step 1: Counting Incoming Records by Prometheus Filter Plugin
- Step 2: Counting Outgoing Records by Prometheus Output Plugin
- Step 3: Expose Metrics by Prometheus Input Plugin via HTTP
Configure the
<filter>
section to count the incoming records per tag:# source
<source>
@type forward
bind 0.0.0.0
port 24224
</source>
# count the number of incoming records per tag
<filter company.*>
@type prometheus
<metric>
name fluentd_input_status_num_records_total
type counter
desc The total number of incoming records
<labels>
tag ${tag}
hostname ${hostname}
</labels>
</metric>
</filter>
With this configuration, the
prometheus
filter plugin starts adding the internal counter as the record comes in.Configure the
copy
plugin with prometheus
output plugin to count the outgoing records per tag:# count the number of outgoing records per tag
<match company.*>
@type copy
<store>
@type forward
<server>
name myserver1
host 192.168.1.3
port 24224
weight 60
</server>
</store>
<store>
@type prometheus
<metric>
name fluentd_output_status_num_records_total
type counter
desc The total number of outgoing records
<labels>
tag ${tag}
hostname ${hostname}
</labels>
</metric>
</store>
</match>
With this configuration, the
prometheus
output plugin starts adding the internal counter as the record goes out.Configure
prometheus
input plugin to expose internal counter information via HTTP:# expose metrics in prometheus format
<source>
@type prometheus
bind 0.0.0.0
port 24231
metrics_path /metrics
</source>
<source>
@type prometheus_output_monitor
interval 10
<labels>
hostname ${hostname}
</labels>
</source>
After you have done these three (3) changes, restart fluentd:
# For stand-alone Fluentd installations
$ fluentd -c fluentd.conf
# For td-agent users
$ sudo systemctl restart td-agent
Let's send some records:
$ echo '{"message":"hello"}' | bundle exec fluent-cat company.test1
$ echo '{"message":"hello"}' | bundle exec fluent-cat company.test1
$ echo '{"message":"hello"}' | bundle exec fluent-cat company.test1
$ echo '{"message":"hello"}' | bundle exec fluent-cat company.test2
curl http://localhost:24231/metrics
# TYPE fluentd_input_status_num_records_total counter
# HELP fluentd_input_status_num_records_total The total number of incoming records
fluentd_input_status_num_records_total{tag="company.test",host="KZK.local"} 3.0
fluentd_input_status_num_records_total{tag="company.test2",host="KZK.local"} 1.0
# TYPE fluentd_output_status_num_records_total counter
# HELP fluentd_output_status_num_records_total The total number of outgoing records
fluentd_output_status_num_records_total{tag="company.test",host="KZK.local"} 3.0
fluentd_output_status_num_records_total{tag="company.test2",host="KZK.local"} 1.0
# TYPE fluentd_output_status_buffer_queue_length gauge
# HELP fluentd_output_status_buffer_queue_length Current buffer queue length.
fluentd_output_status_buffer_queue_length{hostname="KZK.local",plugin_id="object:3fcbccc6d388",type="forward"} 1.0
....
Prepare the configuration file (
prometheus.yml
):global:
scrape_interval: 10s # Set the scrape interval to every 10 seconds. Default is every 1 minute.
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
- job_name: 'fluentd'
static_configs:
- targets: ['localhost:24231']
Launch
prometheus
:$ ./prometheus --config.file="prometheus.yml"
Now, open this URL
http://localhost:9090/
in your browser.Go to
http://localhost:9090/targets
to see the list of Fluentd nodes and their status.
Prometheus Targets
Visit
http://localhost:9090/graph
to explore Fluentd's internal metrics. You'll see eight (8) metrics in the metric list:
Prometheus Metrics
fluentd_input_status_num_records_total
fluentd_output_status_buffer_queue_length
fluentd_output_status_buffer_total_bytes
fluentd_output_status_emit_count
fluentd_output_status_num_errors
fluentd_output_status_num_records_total
fluentd_output_status_retry_count
fluentd_output_status_retry_wait
Pick
fluentd_input_status_num_records_total
and you'll see the total incoming records per tag.
Prometheus Graph
Since
fluentd_input_status_num_records_total
and fluentd_output_status_num_records_total
are monotonically increasing numbers, it requires a little bit of calculation by PromQL (Prometheus Query Language) to make them meaningful.Here are the example PromQLs for common metrics:
# number of available nodes
up
# incoming records / sec / host
sum(rate(fluentd_input_status_num_records_total[1m])) by (hostname)
# incoming records / sec / tag
sum(rate(fluentd_input_status_num_records_total[1m])) by (tag)
# outgoing records / sec / host
sum(rate(fluentd_output_status_num_records_total[1m])) by (hostname)
# outgoing records / sec / tag
sum(rate(fluentd_output_status_num_records_total[1m])) by (tag)
# emit count / sec
rate(fluentd_output_status_emit_count[1m])
In addition to the traffic metrics introduced above, it is important to monitor the queue length and error count.
If these values are increasing, it means Fluentd cannot flush the buffer to the destination. Thus you will lose the data once the buffer becomes full.
# maximum buffer length in last 1min
max_over_time(fluentd_output_status_buffer_queue_length[1m])
# maximum buffer bytes in last 1min
max_over_time(fluentd_output_status_buffer_total_bytes[1m])
# maximum retry wait in last 1min
max_over_time(fluentd_output_status_retry_wait[1m])
# retry count / sec
rate(fluentd_output_status_retry_count[1m])
For more advanced visualization and alerting, we recommend Grafana as a visualization frontend for Prometheus.

Prometheus + Grafana
If this article is incorrect or outdated, or omits critical information, please let us know. Fluentd is an open-source project under Cloud Native Computing Foundation (CNCF). All components are available under the Apache 2 License.
Last modified 1yr ago