_posts/2015-11-03-new_optimization_for_time_series.html (143 lines of code) (raw):

--- layout: post status: PUBLISHED published: true title: New optimization for time series data in Apache Phoenix 4.6 id: 351fbd98-e9a8-437a-b51e-e9bcc16c8727 date: '2015-11-03 20:11:10 -0500' categories: phoenix tags: - time - optimization - sql - phoenix - relational - source - series - hbase - open - row_timestamp - apache - database permalink: phoenix/entry/new_optimization_for_time_series --- <p><em>Today's blog is brought to you by&nbsp;<strong>Samarth Jain</strong>, PMC member of Apache Phoenix, and Lead Member of the Technical Staff at <a href="www.salesforce.com" target="_blank">Salesforce.com</a>.</em></p> <p><a href="https://phoenix.apache.org/" target="_blank" title="Apache Phoenix">Apache Phoenix</a> 4.6 now provides the capability of mapping a Phoenix primary key column to the native row timestamp of <a href="https://hbase.apache.org/" target="_blank" title="Apache HBase">Apache HBase</a>. The mapping is denoted by<br /> the keyword <strong>ROW_TIMESTAMP</strong> in the create table statement. Such a mapping<br /> provides the following two advantages:&nbsp;</p> <ul> <li>Allows Phoenix to set the min time range on scans&nbsp;since this column directly maps to the HBase cell timestamp.<br /> Presence of these time ranges lets HBase figure out which store files it<br /> should be scanning and which ones to skip. This comes in handy<br /> especially for temporal data when the queries are focused towards the<br /> tail end of the data.</li> <li>Enables Phoenix to leverage the existing optimizations in place when querying against primary key columns.</li> </ul> <p> Lets look at an example with some performance numbers to understand when a ROW_TIMESTAMP column could help.</p> <h4 id="Sample_schema:" style="font-family: inherit; color: inherit; font-size: 24px; font-weight: 500; margin: 10.5px 0px; text-rendering: optimizeLegibility;"> Sample schema:</h4> <p>For performance analysis, we created two identical tables, one with the new&nbsp;<strong>ROW_TIMESTAMP<span style="font-weight: normal;"> </span></strong>qualifier and one without.&nbsp;</p> <p><font face="courier new, courier, monospace">CREATE TABLE&nbsp;<strong>EVENTS_RTS</strong> (<br />&nbsp; &nbsp; EVENT_ID CHAR(15) NOT NULL,<br />&nbsp; &nbsp; EVENT_TYPE CHAR(3) NOT NULL,<br />&nbsp; &nbsp; EVENT_DATE DATE NOT NULL,<br />&nbsp; &nbsp; APPLICATION_TYPE VARCHAR,<br />&nbsp; &nbsp; SOURCE_IP VARCHAR<br />&nbsp; &nbsp; CONSTRAINT PK PRIMARY KEY (<br />&nbsp; &nbsp; &nbsp; &nbsp; EVENT_ID, <br />&nbsp; &nbsp; &nbsp; &nbsp; EVENT_TYPE, <br />&nbsp; &nbsp; &nbsp; &nbsp; EVENT_DATE <strong>ROW_TIMESTAMP</strong>))</font></p> <p>The initial data load of 500 million records created data with the<br /> event_date set to dates over the last seven days. During the load,<br /> tables went through region splits and major compactions. After the<br /> initial load, we ran a mixed read/write workload with writes (new<br /> records) happening @500K records per hour. Each new row was created<br /> with EVENT_DATE as the current date/time.</p> <p> Three sets of queries were executed that filtered on the EVENT_DATE column:</p></p> <ul> <li>Newer than last hour's event data</li> <li> Newer than last two day's event data</li> <li> Outside of the time range of event data</li> </ul> <p> For example, the following query would return the number of rows for the last hours worth of data:</p> <p><font face="courier new, courier, monospace">SELECT COUNT(*) FROM EVENTS_RTS<br />WHERE EVENT_DATE > CURRENT_DATE() - 1/24<br /> </font></p> <p>Below is the graph that shows variation of query times over the tail end of data (not major compacted) for the two tables</p> <p> <img src="http://3.bp.blogspot.com/-IcwLAFSSqSY/VjggNsFgHfI/AAAAAAAAAdU/v3Xd2uoEWS0/s1600/Rowtimestamp.png" height="400" width="700" /></p> <p>Below is a tabular summary of the various time ranges that were tested over the non-major compacted event data</p> <table class="bodyTable table table-striped table-hover" style="border-collapse: collapse;"> <thead> <tr class="a"> <th style="border-top-width: 0px; padding-right: 12px; text-align: center; vertical-align: bottom;">Time</th> <th style="border-top-width: 0px; padding-right: 12px; text-align: center; vertical-align: bottom;">#</th> <th colspan="2" style="border-top-width: 0px; padding-right: 12px; text-align: center; vertical-align: bottom;">Duration(ms)</th> </tr> </thead> <tbody> <tr class="a"> <th style="border-top-width: 0px; padding-right: 12px; text-align: center; vertical-align: top;">Range</th> <th style="border-top-width: 0px; padding-right: 12px; text-align: center; vertical-align: top;">Rows Returned</th> <th style="border-top-width: 0px; padding-right: 12px; text-align: center; vertical-align: top;">With Optimization</th> <th style="border-top-width: 0px; padding-right: 12px; text-align: center; vertical-align: top;">Without Optimization</th> </tr> <tr class="b"> <td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; vertical-align: top; width: 300px; background-color: #f9f9f9;">CREATED IN LAST 1 MINUTE</td> <td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top; background-color: #f9f9f9;">16K</td> <td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top; background-color: #f9f9f9;">200</td> <td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top; background-color: #f9f9f9;">4000</td> </tr> <tr class="a"> <td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; vertical-align: top;">CREATED IN LAST 15 MINUTES</td> <td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top;">125K</td> <td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top;">700</td> <td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top;">130000</td> </tr> <tr class="b"> <td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; vertical-align: top; background-color: #f9f9f9;">CREATED IN LAST 1 HOUR</td> <td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top; background-color: #f9f9f9;">500K</td> <td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top; background-color: #f9f9f9;">2100</td> <td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top; background-color: #f9f9f9;">500000</td> </tr> <tr class="a"> <td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; vertical-align: top;">CREATED BEFORE LAST 8 DAYS</td> <td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top;">0</td> <td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top;">100</td> <td style="border-top-color: #dddddd; border-top-style: solid; border-top-width: 1px; line-height: 21px; padding: 8px; text-align: center; vertical-align: top;">340000</td> </tr> </tbody> </table> <p>As you can see from the results, using a <strong>ROW_TIMESTAMP</strong> gives a huge perf<br /> boost when querying over data that hasn&rsquo;t been major compacted. For<br /> already major compacted data, the two tables show the same performance<br /> (i.e. there is no degradation). The query returning 0 records is a<br /> special case in which the date range falls out of the data that was<br /> loaded to the tables. Such a query returns almost instantaneously for<br /> EVENTS_RTS (0.1 seconds). The same query on EVENTS_WITHOUT_RTS takes<br /> more than 300 seconds. This is because with the time range information<br /> available on scans, HBase was quickly able to figure out that no store<br /> files have data within the range yielding a near instant response.</p> <h4 id="Effect_of_HBase_major_compaction" style="color: inherit; font-family: inherit; font-size: 24px; font-weight: 500; margin: 10.5px 0px; text-rendering: optimizeLegibility;"> Effect of HBase major compaction</h4> <p> The HBase store file (HFile) stores time range (min and max row<br /> timestamps) in its metadata. When a scan comes in, HBase is able to look<br /> at this metadata and figure out whether it should be scanning the store<br /> file for returning the records the query has requested. When writes are<br /> happening to an HBase table, after crossing a threshold size, contents<br /> of the memstore are flushed to an HFile. Now if the queries are against<br /> the newly created (tail-end of data) HFiles, one would see a huge perf<br /> boost when using the ROW_TIMESTAMP column. This is because, the scans<br /> issued by Phoenix would need to read only these newly created store<br /> files. On the other hand, queries not utilizing the row_timestamp column<br /> will have to potentially scan the entire table.</p> <p> The perf benefits are negated however, when HBase runs a major<br /> compaction on the table. In the default compaction policy, when number<br /> of HFiles exceeds a certain threshold or when a pre-determined time<br /> period crosses, HBase performs a major compaction to consolidate the<br /> number of store files in a region to one. This effectively ends up<br /> setting the time range of the lone store file to all the data contained<br /> within that region. As a result, scans are no longer able to filter out<br /> what store files to skip since the lone store file happens to contain<br /> all the data. Do note that in such a condition, the performance of the<br /> query with the row_timestamp column is the same as the one without.</p> <p>In conclusion, if your table has a date based primary key and your<br /> queries are geared towards the tail-end of the data, you should think<br /> about using a row_timestamp column as it could yield huge performance<br /> gains.</p> <h4 id="Potential_Future_Work" style="color: inherit; font-family: inherit; font-size: 24px; font-weight: 500; margin: 10.5px 0px; text-rendering: optimizeLegibility;">Potential Future Work</h4> <p>One question you may be asking yourself is <em>Why does performance drop after a major compaction occurs? I thought performance was supposed to improve after compaction</em>. Time series data is different than other data in that it's typically write-once, append only. There are ways that this property of the data can be exploited such that better performance is maintained. For some excellent ideas along these lines, see&nbsp;Vladimir Rodionov's presentation from a previous HBase Meetup <a href="http://files.meetup.com/1350427/TimeSeriesHBase.pptx" target="_blank" title="presentation">here</a>.</p>