In this article, we introduce several common data manipulation challenges faced by our users (such as filtering and modifying data) and explain how to solve each task using one or more Fluentd plugins.
Scenario: Filtering Data by the Value of a Field
Let's suppose our Fluentd instances are collecting data from Apache web server logs via in_tail. Our goal is to filter out all the 200 requests.
Solution: Use fluent-plugin-grep
fluent-plugin-grep is a plugin that can "grep" data according to the different fields within Fluentd events.
By using the add_tag_prefix option, we can prepend a tag in front of filtered events so that they can be matched to a subsequent section. For example, we can send all logs with non-200 status codes to Treasure Data, as shown below:
fluent-plugin-grep can filter based on multiple fields as well. The config below keeps all requests with status code 4xx that are NOT referred from yourdomain.com (a real world use case: figuring out how many dead links there are in the wild by filtering out internal links)
Scenario: Adding a New Field (such as hostname)
When collecting data, we often need to add a new field or change an existing field in our log data. For example, many Fluentd users need to add the hostname of their servers to the Apache web server log data in order to compute the number of requests handled by each server (i.e., store them in MongoDB/HDFS and run GROUP-BYs).
then we can add a new field with the hostname information as follows:
The modified events now look like
NOTE: The "#{Socket.gethostname}" placeholder is interpreted at configuration parsing phase. It inlines the host name of the server that the Fluentd instance is running on (in this example, our server's name is "our_server").