_posts/2015-11-03-new_optimization_for_time_series.html (143 lines of code) (raw):
---
layout: post
status: PUBLISHED
published: true
title: New optimization for time series data in Apache Phoenix 4.6
id: 351fbd98-e9a8-437a-b51e-e9bcc16c8727
date: '2015-11-03 20:11:10 -0500'
categories: phoenix
tags:
- time
- optimization
- sql
- phoenix
- relational
- source
- series
- hbase
- open
- row_timestamp
- apache
- database
permalink: phoenix/entry/new_optimization_for_time_series
---
<p><em>Today's blog is brought to you by <strong>Samarth Jain</strong>, PMC member of Apache Phoenix, and Lead Member of the Technical Staff at <a href="www.salesforce.com" target="_blank">Salesforce.com</a>.</em></p>
<p><a href="https://phoenix.apache.org/" target="_blank" title="Apache Phoenix">Apache Phoenix</a> 4.6 now provides the capability of mapping a Phoenix primary key column to the native row timestamp of <a href="https://hbase.apache.org/" target="_blank" title="Apache HBase">Apache HBase</a>. The mapping is denoted by<br />
the keyword <strong>ROW_TIMESTAMP</strong> in the create table statement. Such a mapping<br />
provides the following two advantages: </p>
<ul>
<li>Allows Phoenix to set the min time range on scans since this column directly maps to the HBase cell timestamp.<br />
Presence of these time ranges lets HBase figure out which store files it<br />
should be scanning and which ones to skip. This comes in handy<br />
especially for temporal data when the queries are focused towards the<br />
tail end of the data.</li>
<li>Enables Phoenix to leverage the existing optimizations in place when querying against primary key columns.</li>
</ul>
<p>
Lets look at an example with some performance numbers to understand when a ROW_TIMESTAMP column could help.</p>
<h4 id="Sample_schema:" style="font-family: inherit; color: inherit; font-size: 24px; font-weight: 500; margin: 10.5px 0px; text-rendering: optimizeLegibility;">
Sample schema:</h4>
<p>For performance analysis, we created two identical tables, one with the new <strong>ROW_TIMESTAMP<span style="font-weight: normal;"> </span></strong>qualifier and one without. </p>
<p><font face="courier new, courier, monospace">CREATE TABLE <strong>EVENTS_RTS</strong> (<br /> EVENT_ID CHAR(15) NOT NULL,<br /> EVENT_TYPE CHAR(3) NOT NULL,<br /> EVENT_DATE DATE NOT NULL,<br /> APPLICATION_TYPE VARCHAR,<br /> SOURCE_IP VARCHAR<br /> CONSTRAINT PK PRIMARY KEY (<br /> EVENT_ID, <br /> EVENT_TYPE, <br /> EVENT_DATE <strong>ROW_TIMESTAMP</strong>))</font></p>
<p>The initial data load of 500 million records created data with the<br />
event_date set to dates over the last seven days. During the load,<br />
tables went through region splits and major compactions. After the<br />
initial load, we ran a mixed read/write workload with writes (new<br />
records) happening @500K records per hour. Each new row was created<br />
with EVENT_DATE as the current date/time.</p>
<p>
Three sets of queries were executed that filtered on the EVENT_DATE column:</p></p>
<ul>
<li>Newer than last hour's event data</li>
<li> Newer than last two day's event data</li>
<li> Outside of the time range of event data</li>
</ul>
<p>
For example, the following query would return the number of rows for the last hours worth of data:</p>
<p><font face="courier new, courier, monospace">SELECT COUNT(*) FROM EVENTS_RTS<br />WHERE EVENT_DATE > CURRENT_DATE() - 1/24<br />
</font></p>
<p>Below is the graph that shows variation of query times over the tail end of data (not major compacted) for the two tables</p>
<p> <img src="http://3.bp.blogspot.com/-IcwLAFSSqSY/VjggNsFgHfI/AAAAAAAAAdU/v3Xd2uoEWS0/s1600/Rowtimestamp.png" height="400" width="700" /></p>
<p>Below is a tabular summary of the various time ranges that were tested over the non-major compacted event data</p>
<table class="bodyTable table table-striped table-hover" style="border-collapse: collapse;">
<thead>
<tr class="a">
<th style="border-top-width: 0px; padding-right: 12px; text-align: center; vertical-align: bottom;">Time</th>
<th style="border-top-width: 0px; padding-right: 12px; text-align: center; vertical-align: bottom;">#</th>
<th colspan="2" style="border-top-width: 0px; padding-right: 12px; text-align: center; vertical-align: bottom;">Duration(ms)</th>
</tr>
</thead>
<tbody>
<tr class="a">
<th style="border-top-width: 0px; padding-right: 12px; text-align: center; vertical-align: top;">Range</th>
<th style="border-top-width: 0px; padding-right: 12px; text-align: center; vertical-align: top;">Rows Returned</th>
<th style="border-top-width: 0px; padding-right: 12px; text-align: center; vertical-align: top;">With Optimization</th>
<th style="border-top-width: 0px; padding-right: 12px; text-align: center; vertical-align: top;">Without Optimization</th>
</tr>
<tr class="b">
<td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; vertical-align: top; width: 300px; background-color: #f9f9f9;">CREATED IN LAST 1 MINUTE</td>
<td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top; background-color: #f9f9f9;">16K</td>
<td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top; background-color: #f9f9f9;">200</td>
<td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top; background-color: #f9f9f9;">4000</td>
</tr>
<tr class="a">
<td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; vertical-align: top;">CREATED IN LAST 15 MINUTES</td>
<td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top;">125K</td>
<td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top;">700</td>
<td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top;">130000</td>
</tr>
<tr class="b">
<td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; vertical-align: top; background-color: #f9f9f9;">CREATED IN LAST 1 HOUR</td>
<td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top; background-color: #f9f9f9;">500K</td>
<td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top; background-color: #f9f9f9;">2100</td>
<td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top; background-color: #f9f9f9;">500000</td>
</tr>
<tr class="a">
<td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; vertical-align: top;">CREATED BEFORE LAST 8 DAYS</td>
<td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top;">0</td>
<td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top;">100</td>
<td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top;">340000</td>
</tr>
</tbody>
</table>
<p>As you can see from the results, using a <strong>ROW_TIMESTAMP</strong> gives a huge perf<br />
boost when querying over data that hasn’t been major compacted. For<br />
already major compacted data, the two tables show the same performance<br />
(i.e. there is no degradation). The query returning 0 records is a<br />
special case in which the date range falls out of the data that was<br />
loaded to the tables. Such a query returns almost instantaneously for<br />
EVENTS_RTS (0.1 seconds). The same query on EVENTS_WITHOUT_RTS takes<br />
more than 300 seconds. This is because with the time range information<br />
available on scans, HBase was quickly able to figure out that no store<br />
files have data within the range yielding a near instant response.</p>
<h4 id="Effect_of_HBase_major_compaction" style="color: inherit; font-family: inherit; font-size: 24px; font-weight: 500; margin: 10.5px 0px; text-rendering: optimizeLegibility;">
Effect of HBase major compaction</h4>
<p>
The HBase store file (HFile) stores time range (min and max row<br />
timestamps) in its metadata. When a scan comes in, HBase is able to look<br />
at this metadata and figure out whether it should be scanning the store<br />
file for returning the records the query has requested. When writes are<br />
happening to an HBase table, after crossing a threshold size, contents<br />
of the memstore are flushed to an HFile. Now if the queries are against<br />
the newly created (tail-end of data) HFiles, one would see a huge perf<br />
boost when using the ROW_TIMESTAMP column. This is because, the scans<br />
issued by Phoenix would need to read only these newly created store<br />
files. On the other hand, queries not utilizing the row_timestamp column<br />
will have to potentially scan the entire table.</p>
<p>
The perf benefits are negated however, when HBase runs a major<br />
compaction on the table. In the default compaction policy, when number<br />
of HFiles exceeds a certain threshold or when a pre-determined time<br />
period crosses, HBase performs a major compaction to consolidate the<br />
number of store files in a region to one. This effectively ends up<br />
setting the time range of the lone store file to all the data contained<br />
within that region. As a result, scans are no longer able to filter out<br />
what store files to skip since the lone store file happens to contain<br />
all the data. Do note that in such a condition, the performance of the<br />
query with the row_timestamp column is the same as the one without.</p>
<p>In conclusion, if your table has a date based primary key and your<br />
queries are geared towards the tail-end of the data, you should think<br />
about using a row_timestamp column as it could yield huge performance<br />
gains.</p>
<h4 id="Potential_Future_Work" style="color: inherit; font-family: inherit; font-size: 24px; font-weight: 500; margin: 10.5px 0px; text-rendering: optimizeLegibility;">Potential Future Work</h4>
<p>One question you may be asking yourself is <em>Why does performance drop after a major compaction occurs? I thought performance was supposed to improve after compaction</em>. Time series data is different than other data in that it's typically write-once, append only. There are ways that this property of the data can be exploited such that better performance is maintained. For some excellent ideas along these lines, see Vladimir Rodionov's presentation from a previous HBase Meetup <a href="http://files.meetup.com/1350427/TimeSeriesHBase.pptx" target="_blank" title="presentation">here</a>.</p>