_posts/2015-12-16-storing_syslog_events_in_hbase.html (252 lines of code) (raw):
---
layout: post
status: PUBLISHED
published: true
title: Getting Syslog Events to HBase
id: c70286b5-7349-4b1e-9425-34066513d5f1
date: '2015-12-16 13:39:24 -0500'
categories: nifi
tags:
- syslog
- hbase
- logging
permalink: nifi/entry/storing_syslog_events_in_hbase
---
<h1>Getting Syslog Events to HBase</h1>
<p>
<span class="author">Bryan Bende - </span><br />
<span class="author"><a href="mailto:bbende@gmail.com">bbende@gmail.com</a> - </span><br />
<span class="author"><a href="https://twitter.com/BBende">@bbende</a></span></p>
<hr />
<p>
In the Apache NiFi 0.4.0 release there are several new integration points<br />
including processors for interacting with Syslog and HBase. In this post we'll<br />
demonstrate how to use NiFi to receive messages from Syslog over UDP,<br />
and store those messages in HBase.</p>
<p>
The flow described in this post was created using Apache NiFi 0.4.0,<br />
rsyslog 5.8.10, and Apache HBase 1.1.2.</p>
<h2>Setting up Syslog</h2>
<p>
In order for NiFi to receive syslog messages, rsyslog needs to forward messages<br />
to a port that NiFi will be listening on. Forwarding of messages can be<br />
configured in rsyslog.conf, generally located in /etc on most Linux operating<br />
systems.</p>
<p>
Edit rsyslog.conf and add the following line:</p>
<div>
<code></p>
<pre style="background-color: #f1f1f1">
*.* @localhost:7780
</pre>
<p></code>
</div></p>
<p>
This tells rsyslog to forward all messages over UDP to localhost port 7780.<br />
A double '@@' can be used to forward over TCP.</p>
<p>
Restart rsyslog for the changes to take effect:</p>
<div>
<code></p>
<pre style="background-color: #f1f1f1">
/etc/init.d/rsyslog restart
Shutting down system logger: [ OK ]
Starting system logger: [ OK ]
</pre>
<p></code>
</div></p>
<h2>Setting up HBase</h2>
<p>
In order to store the syslog messages, we'll create an HBase table called<br />
'syslog' with one column family called 'msg'. From the command line enter the<br />
following:</p>
<div>
<code></p>
<pre style="background-color: #f1f1f1">
hbase shell
create 'syslog', {NAME => 'msg'}
</pre>
<p></code>
</div></p>
<h2>Configure an HBase Client Service</h2>
<p>
The HBase processors added in Apache NiFi 0.4.0 use a controller service to<br />
interact with HBase. This allows the processors to remain unchanged when the<br />
HBase client changes, and allows a single NiFi instance to support multiple<br />
versions of the HBase client. NiFi's class-loader isolation provided in NARs,<br />
allows a single NiFi instance to interact with HBase instances of different<br />
versions at the same time.</p>
<p>
The HBase Client Service can be configured by providing paths to external<br />
configuration files, such as hbase-site.xml, or by providing several<br />
properties directly in the processor. For this example we will take the<br />
latter approach. From the Controller Services configuration window in NiFi,<br />
add an HBase_1_1_2_ClientService with the following configuration (adjusting<br />
values appropriately for your system):</p>
<p>
<a href="https://blogs.apache.org/nifi/mediaresource/d3ea48a8-e8db-46c0-9f2b-0896ef37f493"><img src="https://blogs.apache.org/nifi/mediaresource/d3ea48a8-e8db-46c0-9f2b-0896ef37f493?" alt="client-service-config.jpg"></img></a></p>
<p>
After configuring the service, enable it in order for it to be usable by<br />
processors:</p>
<p>
<a href="https://blogs.apache.org/nifi/mediaresource/eddbfa7e-0032-48d9-bad7-21e37b7736fc"><img src="https://blogs.apache.org/nifi/mediaresource/eddbfa7e-0032-48d9-bad7-21e37b7736fc?" alt="client-service-enabled.jpg"></img></a></p>
<h2>Building the Dataflow</h2>
<p>
The dataflow we are going build will consist of the following components:</p>
<ul>
<li><b>ListenSyslog</b> for receiving syslog messages over UDP</li>
<li><b>UpdateAttribute</b> for renaming attributes and creating a row id for HBase</li>
<li><b>AttributesToJSON</b> for creating a JSON document from the syslog attributes</li>
<li><b>PutHBaseJSON</b> for inserting each JSON document as a row in HBase</li>
</ul>
<p>
The overall flow looks like the following:</p>
<p>
<a href="https://blogs.apache.org/nifi/mediaresource/d4136c36-cea4-4e36-b072-f72944edbff3"><img src="https://blogs.apache.org/nifi/mediaresource/d4136c36-cea4-4e36-b072-f72944edbff3?" alt="syslog-hbase-flow.jpg"></img></a><br/><br />
Lets walk through the configuration of each processor...</p>
<h3>ListenSyslog</h3>
<p>
<a href="https://blogs.apache.org/nifi/mediaresource/e9f4b991-8c41-4c26-8b60-e8a7e6d4c7dc"><img src="https://blogs.apache.org/nifi/mediaresource/e9f4b991-8c41-4c26-8b60-e8a7e6d4c7dc?" alt="config-listensyslog.jpg"></img></a></p>
<p>
Set the <i>Port</i> to the same port that rsyslog is forwarding messages to,<br />
in this case 7780. Leave everything else as the default values.</p>
<p>
With a <i>Max Batch Size</i> of "1" and <i>Parse Messages</i> as "true", each syslog<br />
message will be emitted as a single FlowFile, with the content of the<br />
FlowFile being the original message, and the results of parsing the message<br />
being stored as FlowFile attributes.</p>
<p>
The attributes we will be interested in are:</p>
<ul>
<li>syslog.priority</li>
<li>syslog.severity</li>
<li>syslog.facility</li>
<li>syslog.version</li>
<li>syslog.timestamp</li>
<li>syslog.hostname</li>
<li>syslog.sender</li>
<li>syslog.body</li>
<li>syslog.protocol</li>
<li>syslog.port</li>
</ul>
<h3>UpdateAttribute</h3>
<p>
<a href="https://blogs.apache.org/nifi/mediaresource/c04e59e3-a444-4fd5-9ad3-7518a9438a4f"><img src="https://blogs.apache.org/nifi/mediaresource/c04e59e3-a444-4fd5-9ad3-7518a9438a4f?" alt="config-updateattr.jpg"></img></a></p>
<p>
The attributes produced by ListenSyslog all start with "syslog." which keeps<br />
them nicely namespaced in NiFi. However, we are going to use these attribute<br />
names as column qualifiers in HBase. We don't really need this prefix since<br />
we will already be with in a syslog table.</p>
<p>
Add a property for each syslog attribute to remove the prefix, and use the<br />
<i>Delete Attributes Expression</i> to remove the original attributes. In addition,<br />
create an <i>id</i> attribute of the form "timestamp_uuid" where timestamp is the<br />
long representation of the timestamp on the syslog message, and uuid is the<br />
uuid of the FlowFile in NiFi. This id attribute will be used as the row id in<br />
HBase.</p>
<p>
The expression language for the id attribute is:</p>
<div>
<code></p>
<pre style="background-color: #f1f1f1">
${syslog.timestamp:toDate('MMM d HH:mm:ss'):toNumber()}_${uuid}
</pre>
<p> </code>
</div></p>
<h3>AttributesToJSON</h3>
<p>
<a href="https://blogs.apache.org/nifi/mediaresource/e710ebb7-0289-4cb1-bc03-3a07121c2daf"><img src="https://blogs.apache.org/nifi/mediaresource/e710ebb7-0289-4cb1-bc03-3a07121c2daf" alt="config-attrstojson.jpg"></img></a></p>
<p>
Set the <i>Destination</i> to "flowfile-content" so that the JSON document<br />
replaces the FlowFile content, and set <i>Include Core Attributes</i> to<br />
"false" so that the standard NiFi attributes are not included.</p>
<h3>PutHBaseJSON</h3>
<p>
<a href="https://blogs.apache.org/nifi/mediaresource/efcfdfcf-e4f1-4894-ad79-b1deed8c42f3"><img src="https://blogs.apache.org/nifi/mediaresource/efcfdfcf-e4f1-4894-ad79-b1deed8c42f3" alt="config-puthbasejson.jpg"></img></a></p>
<p>
Select the <i>HBase Client Service</i> we configured earlier and set the<br />
<i>Table Name</i> and <i>Column Family</i> to "syslog" and "msg" based on the<br />
table we created earlier. In addition set the <i>Row Identifier Field Name</i><br />
to "id" to instruct the processor to use the id field from the JSON for the<br />
row id.</p>
<h2>Verifying the Flow</h2>
<p>
From a terminal we can send a test message to syslog using the logger utility:</p>
<div>
<code></p>
<pre style="background-color: #f1f1f1">
logger "this is a test syslog message"
</pre>
<p></code>
</div></p>
<p>
Using the HBase shell we can inspect the contents of the syslog table:</p>
<div>
<code></p>
<pre style="background-color: #f1f1f1">
hbase shell
hbase(main):002:0> scan 'syslog'
ROW COLUMN+CELL
29704815000_84f91b21-d35f-4a24-8e0e-aaed4a521c13 column=msg:body, timestamp=1449775215481,
value=root: this is a test message
29704815000_84f91b21-d35f-4a24-8e0e-aaed4a521c13 column=msg:hostname, timestamp=1449775215481,
value=localhost
29704815000_84f91b21-d35f-4a24-8e0e-aaed4a521c13 column=msg:port, timestamp=1449775215481,
value=7780
29704815000_84f91b21-d35f-4a24-8e0e-aaed4a521c13 column=msg:protocol, timestamp=1449775215481,
value=UDP
29704815000_84f91b21-d35f-4a24-8e0e-aaed4a521c13 column=msg:sender, timestamp=1449775215481,
value=/127.0.0.1
29704815000_84f91b21-d35f-4a24-8e0e-aaed4a521c13 column=msg:timestamp, timestamp=1449775215481,
value=Dec 10 19:20:15
29704815000_84f91b21-d35f-4a24-8e0e-aaed4a521c13 column=msg:version, timestamp=1449775215481,
value=
1 row(s) in 0.1120 seconds
</pre>
<p></code>
</div></p>
<h2>Performance Considerations</h2>
<p>
In some cases the volume of syslog messages being pushed to ListenSyslog may<br />
be very high. There are several options to help scale the processing<br />
depending on the given use-case.</p>
<h4>Concurrent Tasks</h4>
<p>
ListenSyslog has a background thread reading messages as fast<br />
as possible and placing them on a blocking queue to be de-queued and processed<br />
by the onTrigger method of the processor. By increasing the number of<br />
concurrent tasks for the processor, we can scale up the rate at which messages<br />
are processed, ensuring new messages can continue to be queued.</p>
<h4>Parsing</h4>
<p>
One of the more expensive operations during the processing of a message is<br />
parsing the message in order to provide the the attributes. Parsing messages<br />
is controlled on the processor through a property and can be turned off in<br />
cases where the attributes are not needed, and the original message just<br />
needs to be delivered somewhere.</p>
<h4>Batching</h4>
<p>
In cases where parsing the messages is not necessary, an additional option is<br />
batching many messages together during one call to onTrigger. This is<br />
controlled through the <i>Batch Size</i> property which defaults to "1". This<br />
would be appropriate in cases where having individual messages is not<br />
necessary, such as storing the messages in HDFS where you need them batched<br />
into appropriately sized files.</p>
<h4>ParseSyslog</h4>
<p>
In addition to parsing messages directly in ListenSyslog, there is also a<br />
ParseSyslog processor. An alternative to the flow described in the post<br />
would be to have ListenSyslog produce batches of 100 messages at a time,<br />
followed by SplitText, followed by ParseSyslog. The tradeoff here is that<br />
we can scale the different components independently, and take advantage of<br />
backpressure between processors.</p>
<h2>Summary</h2>
<p>
At this point you should be able to get your syslog messages ingested into<br />
HBase and can experiment with different configurations. The template for this flow can be found<br />
<a href="https://cwiki.apache.org/confluence/download/attachments/57904847/Syslog_HBase.xml?version=1&modificationDate=1449776959701&api=v2">here</a>.</p>
<p>
We would love to hear any questions, comments, or feedback that you may have!</p>
<p>
<a href="http://nifi.apache.org">Learn more about Apache NiFi</a> and feel<br />
free to leave comments here or e-mail us at dev@nifi.apache.org.</p>