Available languages | en | ja |

HDFS (WebHDFS) Output Plugin

The out_webhdfs Buffered Output plugin writes records into HDFS (Hadoop Distributed File System). By default, it creates files on an hourly basis. This means that when you first import records using the plugin, no file is created immediately. The file will be created when the time_slice_format condition has been met. To change the output frequency, please modify the time_slice_format value.

Table of Contents

Install

out_webhdfs is included in td-agent by default (v1.1.10 or later). Fluentd gem users will have to install the fluent-plugin-webhdfs gem using the following command.

$ fluent-gem install fluent-plugin-webhdfs

HDFS Configuration

Append operations are not enabled by default on CDH. Please put these configurations into your hdfs-site.xml file and restart the whole cluster.

<property>
  <name>dfs.webhdfs.enabled</name>
  <value>true</value>
</property>

<property>
  <name>dfs.support.append</name>
  <value>true</value>
</property>

<property>
  <name>dfs.support.broken.append</name>
  <value>true</value>
</property>

Example Configuration

<match access.**>
  type webhdfs
  host namenode.your.cluster.local
  port 50070
  path /path/on/hdfs/access.log.%Y%m%d_%H.${hostname}.log
  flush_interval 10s
</match>

Please see the Fluentd + HDFS: Instant Big Data Collection article for real-world use cases.

Please see the Config File article for the basic structure and syntax of the configuration file.

Parameters

type (required)

The value must be webhfds.

host (required)

The namenode hostname.

port (required)

The namenode port number.

path (required)

The path on HDFS. Please include ${hostname} in your path to avoid writing into the same HDFS file from multiple Fluentd instances. This conflict could result in data loss.

Buffer Parameters

For advanced usage, you can tune Fluentd’s internal buffering mechanism with these parameters.

buffer_type

The buffer type is memory by default (buf_memory). The file (buf_file) buffer type can be chosen as well. Unlike many other output plugins, the buffer_path parameter MUST be specified when using buffer_type file.

buffer_queue_limit, buffer_chunk_limit

The length of the chunk queue and the size of each chunk, respectively. Please see the Buffer Plugin Overview article for the basic buffer structure. The default values are 64 and 256m, respectively. The suffixes “k” (KB), “m” (MB), and “g” (GB) can be used for buffer_chunk_limit.

flush_interval

The interval between forced data flushes. The default is nil (don’t force flush and wait until the end of time slice + time_slice_wait). The suffixes “s” (seconds), “m” (minutes), and “h” (hours) can be used.

retry_wait and retry_limit

The interval between write retries, and the number of retries. The default values are 1.0 and 17, respectively. retry_wait doubles every retry (e.g. the last retry waits for 131072 sec, roughly 36 hours).

Further Reading

Last updated: 2013-04-28 10:01:10 UTC

Available languages | en | ja |

If this article is incorrect or outdated, or omits critical information, please let us know. Fluentd Project is sponsored by Treasure Data, Inc

comments powered by Disqus