_posts/2014-04-11-the_effect_of_columnfamily_rowkey.html (997 lines of code) (raw):
---
layout: post
status: PUBLISHED
published: true
title: The Effect of ColumnFamily, RowKey and KeyValue Design on HFile Size
id: 6ac717dd-f6ad-457d-a2b9-7dae88a5733e
date: '2014-04-11 18:29:54 -0400'
categories: hbase
tags:
- row
- keys
- hfile
- hbase
permalink: hbase/entry/the_effect_of_columnfamily_rowkey
---
<p>By Doug Meil, HBase Committer and Thomas Murphy</p></p>
<p style="margin-bottom: 0in; line-height: 100%;"><i><b>Intro</b></i></p>
<p style="margin-bottom: 0in; line-height: 100%;"><span style="line-height: 100%;">One of the most<br />
common questions in the HBase user community is estimating disk<br />
footprint of tables, which translates into HFile size – the<br />
internal file format in HBase.</span></p>
<p style="margin-bottom: 0in; line-height: 100%;"><span style="line-height: 100%;">We designed an<br />
experiment at Explorys where we ran combinations of design time<br />
options (rowkey length, column name length, row storage approach) and<br />
runtime options (HBase ColumnFamily compression, HBase data block<br />
encoding options) to determine these factors’ effects on the<br />
resultant HFile size in HDFS.</span></p></p>
<p style="margin-bottom: 0in; line-height: 100%;"><b>HBase Environment</b></p>
<p style="margin-bottom: 0in; line-height: 100%;">CDH4.3.0 (HBase<br />
0.94.6.1)</p>
<p style="margin-bottom: 0in; line-height: 100%;"><b>Design Time<br />
Choices</b></p>
<ol>
<li>
<p style="margin-bottom: 0in; line-height: 100%;"><u>Rowkey</u></p>
<ol type="a">
<li>
<p style="margin-bottom: 0in; line-height: 100%;">Thin</p>
<ol type="i">
<li>
<p style="margin-bottom: 0in; line-height: 100%;">16-byte MD5<br />
hash of an integer.
</p>
</li>
</ol>
</li>
<li>
<p style="margin-bottom: 0in; line-height: 100%;">Fat
</p>
<ol type="i">
<li>
<p style="margin-bottom: 0in; line-height: 100%;">64-byte<br />
SHA-256 hash of an integer.</p>
</li>
</ol>
</li>
</ol>
</li>
</ol>
<ol>
<ol type="a" start="3">
<li>
<p style="margin-bottom: 0in; line-height: 100%;">Note: neither<br />
of these are realistic rowkeys for real applications, but they<br />
chosen because they are easy to generate and one is a lot bigger<br />
than the other.</p>
</li>
</ol>
</ol>
<ol start="2">
<li>
<p style="margin-bottom: 0in; line-height: 100%;"><u>Column Names</u></p>
<ol type="a">
<li>
<p style="margin-bottom: 0in; line-height: 100%;">Thin</p>
<ol type="i">
<li>
<p style="margin-bottom: 0in; line-height: 100%;">2-3 character<br />
column names (c1, c2).</p>
</li>
</ol>
</li>
<li>
<p style="margin-bottom: 0in; line-height: 100%;">Fat</p>
<ol type="i">
<li>
<p style="margin-bottom: 0in; line-height: 100%;">10<br />
characters, randomly chosen but consistent for all rows.</p>
</li>
</ol>
</li>
</ol>
</li>
</ol>
<ol>
<ol type="a" start="3">
<li>
<p style="margin-bottom: 0in; line-height: 100%;">Note: it is<br />
advisable to have small column names, but most people don’t start<br />
that way so we have this as an option.</p>
</li>
</ol>
</ol>
<ol start="3">
<li>
<p style="margin-bottom: 0in; line-height: 100%;"><u>Row Storage<br />
Approach</u></p>
<ol type="a">
<li>
<p style="margin-bottom: 0in; line-height: 100%;">Key Value Per<br />
Column</p>
<ol type="i">
<li>
<p style="margin-bottom: 0in; line-height: 100%;">This is the<br />
traditional way of storing data in HBase.</p>
</li>
</ol>
</li>
<li>
<p style="margin-bottom: 0in; line-height: 100%;">One Key Value<br />
per row.</p>
<ol type="i">
<li>
<p style="margin-bottom: 0in; line-height: 100%;">Actually,<br />
two.
</p>
</li>
<li>
<p style="margin-bottom: 0in; line-height: 100%;">One KV has an<br />
Avro serialized byte array containing all the data from the row.</p>
</li>
<li>
<p style="margin-bottom: 0in; line-height: 100%;">Another KV<br />
holds an MD5 hash of the version of the Avro schema.</p>
</li>
</ol>
</li>
</ol>
</li>
</ol>
<p style="margin-bottom: 0in; line-height: 100%;"><b>Run Time</b></p>
<ol>
<li>
<p style="margin-bottom: 0in; line-height: 100%;"><u>Column<br />
Family Compression</u></p>
<ol type="a">
<li>
<p style="margin-bottom: 0in; line-height: 100%;">None</p>
</li>
<li>
<p style="margin-bottom: 0in; line-height: 100%;">GZ</p>
</li>
<li>
<p style="margin-bottom: 0in; line-height: 100%;">LZ4</p>
</li>
<li>
<p style="margin-bottom: 0in; line-height: 100%;">LZO</p>
</li>
<li>
<p style="margin-bottom: 0in; line-height: 100%;">Snappy</p>
</li>
</ol>
</li>
</ol>
<p style="margin-left: 1in; margin-bottom: 0in; line-height: 100%;"> </p>
<ol>
<ol type="a" start="6">
<li>
<p style="margin-bottom: 0in; line-height: 100%;">Note: it is<br />
generally advisable to use compression, but what if you didn’t?<br />
So we tested that too.</p>
</li>
</ol>
</ol>
<ol start="2">
<li>
<p style="margin-bottom: 0in; line-height: 100%;"><u>HBase Block<br />
Encoding</u></p>
<ol type="a">
<li>
<p style="margin-bottom: 0in; line-height: 100%;">None</p>
</li>
<li>
<p style="margin-bottom: 0in; line-height: 100%;">Prefix</p>
</li>
<li>
<p style="margin-bottom: 0in; line-height: 100%;">Diff</p>
</li>
<li>
<p style="margin-bottom: 0in; line-height: 100%;">Fast Diff</p>
</li>
</ol>
</li>
</ol>
<ol>
<ol type="a" start="5">
<li>
<p style="margin-bottom: 0in; line-height: 100%;">Note: most<br />
people aren’t familiar with HBase Data Block Encoding. Primarily<br />
intended for squeezing more data into the block cache, it has<br />
effects on HFile size too. See HBASE-4218 for more detail.</p>
</li>
</ol>
</ol>
<p style="margin-bottom: 0in; line-height: 100%;">1000 rows were<br />
generated for each combination of table parameters. Not a ton of<br />
data, but we don’t necessarily need a ton of data to see the<br />
varying size of the table. There were 30 columns per row comprised<br />
of 10 strings (each filled with 20 bytes of random characters), 10<br />
integers (random numbers) and 10 longs (also random numbers).</p>
<p style="margin-bottom: 0in; line-height: 100%;">HBase blocksize was<br />
128k.</p>
<p style="margin-bottom: 0in; line-height: 100%;"></p>
<p style="margin-bottom: 0in; line-height: 100%;"><i><b>Results</b></i></p>
<p style="margin-bottom: 0in; line-height: 100%;">The easiest way to<br />
navigate the results is to compare specific cases, progressing from<br />
an initial implementation of a table to options for production.</p>
<p style="margin-bottom: 0in; line-height: 100%;"><b>Case #1: Fat<br />
Rowkey and Fat Column Names, Now What?</b></p>
<p style="margin-bottom: 0in; line-height: 100%;">This is where most<br />
people start with HBase. Rowkeys are not as optimal as they should<br />
be (i.e., the Fat rowkey case) and column names are also inflated<br />
(Fat column-names).</p>
<p style="margin-bottom: 0in; line-height: 100%;">Without CF<br />
Compression or Data Block Encoding, the baseline is:</p>
<table width="712" cellpadding="8" cellspacing="0">
<colgroup>
<col width="299" />
<col width="83" />
<col width="83" />
<col width="83" />
<col width="83" /> </colgroup>
<tbody>
<tr valign="bottom">
<td width="299" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATKEY-FATCOL</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">6,293,670</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
</tr>
</tbody>
</table>
<p style="margin-bottom: 0in; line-height: 100%;"><i>What if we just<br />
changed CF compression?</i></p>
<p style="margin-bottom: 0in; line-height: 100%;">This drastically<br />
changes the HFile footprint. Snappy compression reduces the HFile<br />
size from 6.2 Mb to 1.8 Mb, for example.</p>
<table width="397" cellpadding="8" cellspacing="0">
<colgroup>
<col width="83" />
<col width="83" />
<col width="83" />
<col width="83" /> </colgroup>
<tbody>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,362,033</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">GZ</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,803,240</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,919,265</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">LZ4</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,950,306</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">LZO</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
</tr>
</tbody>
</table>
<p style="margin-bottom: 0in; line-height: 100%;">However, we<br />
shouldn’t be <i>too</i> quick to celebrate. Remember that this is<br />
just the <i>disk</i> footprint. Over the wire the data is<br />
uncompressed, so 6.2 Mb is still being transferred from RegionServer<br />
to Client when doing a Scan over the entire table.</p>
<p style="margin-bottom: 0in; line-height: 100%;"><i>What if we just<br />
changed data block encoding?</i></p>
<p style="margin-bottom: 0in; line-height: 100%;">Compression isn’t<br />
the only option though. Even without compression, we can change the<br />
data block encoding and also achieve HFile reduction. All options<br />
are an improvement over the 6.2 Mb baseline.</p>
<p style="margin-bottom: 0in; line-height: 100%;"> </p>
<table width="397" cellpadding="8" cellspacing="0">
<colgroup>
<col width="83" />
<col width="83" />
<col width="83" />
<col width="83" /> </colgroup>
<tbody>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,491,000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">DIFF</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,492,155</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">FAST_DIFF</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">2,244,963</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">PREFIX</font></font></p>
</td>
</tr>
</tbody>
</table>
<p style="margin-bottom: 0in; line-height: 100%;"><i>Combination</i></p>
<p style="margin-bottom: 0in; line-height: 100%;">The following table<br />
shows the results of all remaining CF compression / data block<br />
encoding combinations.</p>
<table width="397" cellpadding="8" cellspacing="0">
<colgroup>
<col width="83" />
<col width="83" />
<col width="83" />
<col width="83" /> </colgroup>
<tbody>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,146,675</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">GZ</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">DIFF</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,200,471</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">GZ</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">FAST_DIFF</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,274,265</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">GZ</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">PREFIX</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,350,483</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">DIFF</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,358,190</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">LZ4</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">DIFF</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,391,016</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">FAST_DIFF</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,402,614</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">LZ4</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">FAST_DIFF</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,406,334</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">LZO</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">FAST_DIFF</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,541,151</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">PREFIX</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,597,440</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">LZO</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">PREFIX</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,622,313</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">LZ4</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">PREFIX</font></font></p>
</td>
</tr>
</tbody>
</table>
<p style="margin-bottom: 0in; line-height: 100%;"><b>Case #2: What if<br />
we re-designed the column names (and left the rowkey alone)?</b></p>
<p style="margin-bottom: 0in; line-height: 100%;">Let’s assume that<br />
we re-designed our column names but left the rowkey alone. After<br />
using the “thin” column-names without CF compression or data<br />
block encoding it results in an HFile 5.8 Mb in size. This is an<br />
improvement from the original 6.2 Mb baseline. It doesn’t seem<br />
like much, but it’s still a 6.5% reduction in the eventual<br />
wire-transfer footprint.</p>
<table width="712" cellpadding="8" cellspacing="0">
<colgroup>
<col width="299" />
<col width="83" />
<col width="83" />
<col width="83" />
<col width="83" /> </colgroup>
<tbody>
<tr valign="bottom">
<td width="299" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATKEY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">5,778,888</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
</tr>
</tbody>
</table>
<p style="margin-bottom: 0in; line-height: 100%;">Applying Snappy<br />
compression can reduce the HFile size further:</p>
<table width="397" cellpadding="8" cellspacing="0">
<colgroup>
<col width="83" />
<col width="83" />
<col width="83" />
<col width="83" /> </colgroup>
<tbody>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,349,451</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">DIFF</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,390,422</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">FAST_DIFF</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,536,540</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">PREFIX</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,785,480</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
</tr>
</tbody>
</table>
<p style="margin-bottom: 0in; line-height: 100%;"><b>Case #3: What if<br />
we re-designed the rowkey (and left the column names alone)?</b></p>
<p style="margin-bottom: 0in; line-height: 100%;">In this example,<br />
what if we only redesigned the rowkey? After using the “thin”<br />
rowkey the result is an HFile size that is 4.9 Mb down from the 6.2<br />
Mb baseline, a 21% reduction. Not a small savings!</p>
<table width="712" cellpadding="8" cellspacing="0">
<colgroup>
<col width="299" />
<col width="83" />
<col width="83" />
<col width="83" />
<col width="83" /> </colgroup>
<tbody>
<tr valign="bottom">
<td width="299" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATCOL</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">4,920,984</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
</tr>
</tbody>
</table>
<p style="margin-bottom: 0in; line-height: 100%;">Applying Snappy<br />
compression can reduce the HFile size further:</p>
<table width="397" cellpadding="8" cellspacing="0">
<colgroup>
<col width="83" />
<col width="83" />
<col width="83" />
<col width="83" /> </colgroup>
<tbody>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,295,895</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">DIFF</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,337,112</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">FAST_DIFF</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,489,446</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">PREFIX</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,739,871</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
</tr>
</tbody>
</table>
<p style="margin-bottom: 0in; line-height: 100%;">However, note that<br />
the resulting HFile size with Snappy and no data block encoding (1.7<br />
Mb) is very similar in size to the baseline approach (i.e., fat<br />
rowkeys, fat column-names) with Snappy and no data block encoding<br />
(1.8 Mb). Why? The CF compression can compensate on disk for a lot<br />
of bloat in rowkeys and column names.</p>
<p style="margin-bottom: 0in; line-height: 100%;"><b>Case #4: What if<br />
we re-designed both the rowkey and the column names?</b></p>
<p style="margin-bottom: 0in; line-height: 100%;">By this time we’ve<br />
learned enough HBase to know that we need to have efficient rowkeys<br />
and column-names. This produces an HFile that is 4.4 Mb, a 29%<br />
savings over the baseline of 6.2 Mb.</p>
<table width="397" cellpadding="8" cellspacing="0">
<colgroup>
<col width="83" />
<col width="83" />
<col width="83" />
<col width="83" /> </colgroup>
<tbody>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">4,406,418</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
</tr>
</tbody>
</table>
<p style="margin-bottom: 0in; line-height: 100%;">Applying Snappy<br />
compression can reduce the HFile size further:</p>
<table width="397" cellpadding="8" cellspacing="0">
<colgroup>
<col width="83" />
<col width="83" />
<col width="83" />
<col width="83" /> </colgroup>
<tbody>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,296,402</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">DIFF</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,338,135</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">FAST_DIFF</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,485,192</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><a name="_GoBack"></a><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">PREFIX</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,732,746</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
</tr>
</tbody>
</table>
<p style="margin-bottom: 0in; line-height: 100%;">Again, the on-disk<br />
footprint with compression isn’t radically different from the<br />
others, as Compression can compensate to large degree for rowkey and<br />
column name bloat.</p>
<p style="margin-bottom: 0in; line-height: 100%;"><b>Case #5:<br />
KeyValue Storage Approach (e.g., 1 KV vs. KV-per-Column)</b></p>
<p style="margin-bottom: 0in; line-height: 100%;">What if we did<br />
something radical and changed how we stored the data in HBase? With<br />
this approach, we are using a single KeyValue per row holding <i>all</i><br />
of the columns of data for the row instead of a KeyValue per column<br />
(the traditional way).</p>
<p style="margin-bottom: 0in; line-height: 100%;">The resulting HFile,<br />
even uncompressed and without Data Block Encoding, is radically<br />
smaller at 1.4 Mb compared to 6.2 Mb.</p>
<table width="712" cellpadding="8" cellspacing="0">
<colgroup>
<col width="299" />
<col width="83" />
<col width="83" />
<col width="83" />
<col width="83" /> </colgroup>
<tbody>
<tr valign="bottom">
<td width="299" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-AVRO</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,374,465</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
</tr>
</tbody>
</table>
<p style="margin-bottom: 0in; line-height: 100%;">Adding Snappy<br />
compression and Data Block Encoding makes the resulting HFile size<br />
even smaller.</p>
<table width="397" cellpadding="8" cellspacing="0">
<colgroup>
<col width="83" />
<col width="83" />
<col width="83" />
<col width="83" /> </colgroup>
<tbody>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,119,330</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">DIFF</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,129,209</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">FAST_DIFF</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,133,613</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">PREFIX</font></font></p>
</td>
</tr>
<tr valign="bottom">
<td width="83" height="7" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1,150,779</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p align="right"><font face="Calibri, serif"><font style="font-size: 12pt;">1000</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">SNAPPY</font></font></p>
</td>
<td width="83" bgcolor="#ffffff" style="border: none; padding: 0in;">
<p><font face="Calibri, serif"><font style="font-size: 12pt;">NONE</font></font></p>
</td>
</tr>
</tbody>
</table>
<p style="margin-bottom: 0in; line-height: 100%;">Compare the 1.1 Mb<br />
Snappy without encoding to the 1.7 Snappy encoded Thin rowkey/Thin<br />
column-name.</p>
<p style="margin-bottom: 0in; line-height: 100%;"><b>Summary</b></p>
<p style="margin-bottom: 0in; line-height: 100%;">Although Compression<br />
and Data Block Encoding can wallpaper over bad rowkey and column-name<br />
decisions in terms of HFile size, you will pay the price for this in<br />
terms of data transfer from RegionServer to Client. Also, concealing<br />
the size penalty brings with it a performance penalty each time the<br />
data is accessed or manipulated. So, the old advice about correctly<br />
designing rowkeys and column names still holds.</p>
<p style="margin-bottom: 0in; line-height: 100%;">In terms of KeyValue<br />
approach, having a single KeyValue per row presents significant<br />
savings both in terms of data transfer (RegionServer to Client) as<br />
well as HFile size. <i>However</i>, there is a consequence with this<br />
approach in having to update each row <i>entirely</i>, and that old<br />
versions of rows <i>also</i> be stored in their entirety (i.e., as<br />
opposed to column-by-column changes). Furthermore, it is impossible<br />
to scan on select columns; the whole row must be retrieved and<br />
deserialized to access any information stored in the row. The<br />
importance of understanding this tradeoff cannot be over-stated, and<br />
is something that must be evaluated on an application-by-application<br />
basis.</p>
<p style="margin-bottom: 0in; line-height: 100%;">Software engineering<br />
is an art of managing tradeoffs, so there isn’t necessarily one<br />
“best” answer. Importantly, this experiment only measures the<br />
file size and not the time or processor load penalties imposed by the<br />
use of compression, encoding, or Avro. The results generated in this<br />
test are still based on certain assumptions and your mileage may<br />
vary.</p></p>
<p>Here is the data if interested: <a href="http://people.apache.org/~dmeil/HBase_HFile_Size_2014_04.csv" target="_blank" style="color: #1155cc; font-family: Calibri, sans-serif; font-size: 14px;">http://people.apache.org/~<wbr />dmeil/HBase_HFile_Size_2014_<wbr />04.csv</a> </p></p>