2016/09/16/predicate-pushdown.html (281 lines of code) (raw):

<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8" /> <meta http-equiv="X-UA-Compatible" content="IE=edge" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags --> <meta name="description" content="A new open source Apache Hadoop ecosystem project, Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data" /> <meta name="author" content="Cloudera" /> <title>Apache Kudu - Pushing Down Predicate Evaluation in Apache Kudu</title> <!-- Bootstrap core CSS --> <link rel="stylesheet" href="/css/bootstrap.min.css"/> <!-- Custom styles for this template --> <link href="/css/kudu.css" rel="stylesheet"/> <link href="/css/asciidoc.css" rel="stylesheet"/> <link rel="shortcut icon" href="/img/logo-favicon.ico" /> <link rel="stylesheet" href="/css/font-awesome.min.css" /> <link rel="alternate" type="application/atom+xml" title="RSS Feed for Apache Kudu blog" href="/feed.xml" /> </head> <body> <div class="kudu-site container-fluid"> <!-- Static navbar --> <nav class="navbar navbar-default"> <div class="container-fluid"> <div class="navbar-header"> <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar"> <span class="sr-only">Toggle navigation</span> <span class="icon-bar"></span> <span class="icon-bar"></span> <span class="icon-bar"></span> </button> <a class="logo" href="/"><img src="/img/apachekudu_logo_0716_80px.png" srcset="/img/apachekudu_logo_0716_80px.png 1x, /img/apachekudu_logo_0716_160px.png 2x" alt="Apache Kudu"/></a> </div> <div id="navbar" class="collapse navbar-collapse"> <ul class="nav navbar-nav navbar-right"> <li > <a href="/">Home</a> </li> <li > <a href="/overview.html">Overview</a> </li> <li > <a href="/docs/">Documentation</a> </li> <li > <a href="/releases/">Releases</a> </li> <li class="active"> <a href="/blog/">Blog</a> </li> <!-- NOTE: this dropdown menu does not appear on Mobile, so don't add anything here that doesn't also appear elsewhere on the site. --> <li class="dropdown"> <a href="/community.html" role="button" aria-haspopup="true" aria-expanded="false">Community <span class="caret"></span></a> <ul class="dropdown-menu"> <li class="dropdown-header">GET IN TOUCH</li> <li><a class="icon email" href="/community.html">Mailing Lists</a></li> <li><a class="icon slack" href="https://join.slack.com/t/getkudu/shared_invite/zt-244b4zvki-hB1q9IbAk6CqHNMZHvUALA">Slack Channel</a></li> <li role="separator" class="divider"></li> <li><a href="/community.html#meetups-user-groups-and-conference-presentations">Events and Meetups</a></li> <li><a href="/committers.html">Project Committers</a></li> <li><a href="/ecosystem.html">Ecosystem</a></li> <!--<li><a href="/roadmap.html">Roadmap</a></li>--> <li><a href="/community.html#contributions">How to Contribute</a></li> <li role="separator" class="divider"></li> <li class="dropdown-header">DEVELOPER RESOURCES</li> <li><a class="icon github" href="https://github.com/apache/incubator-kudu">GitHub</a></li> <li><a class="icon gerrit" href="http://gerrit.cloudera.org:8080/#/q/status:open+project:kudu">Gerrit Code Review</a></li> <li><a class="icon jira" href="https://issues.apache.org/jira/browse/KUDU">JIRA Issue Tracker</a></li> <li role="separator" class="divider"></li> <li class="dropdown-header">SOCIAL MEDIA</li> <li><a class="icon twitter" href="https://twitter.com/ApacheKudu">Twitter</a></li> <li><a href="https://www.reddit.com/r/kudu/">Reddit</a></li> <li role="separator" class="divider"></li> <li class="dropdown-header">APACHE SOFTWARE FOUNDATION</li> <li><a href="https://www.apache.org/security/" target="_blank">Security</a></li> <li><a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a></li> <li><a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a></li> <li><a href="https://www.apache.org/licenses/" target="_blank">License</a></li> </ul> </li> <li > <a href="/faq.html">FAQ</a> </li> </ul><!-- /.nav --> </div><!-- /#navbar --> </div><!-- /.container-fluid --> </nav> <div class="row header"> <div class="col-lg-12"> <h2><a href="/blog">Apache Kudu Blog</a></h2> </div> </div> <div class="row-fluid"> <div class="col-lg-9"> <article> <header> <h1 class="entry-title">Pushing Down Predicate Evaluation in Apache Kudu</h1> <p class="meta">Posted 16 Sep 2016 by Andrew Wong</p> </header> <div class="entry-content"> <p>I had the pleasure of interning with the Apache Kudu team at Cloudera this summer. This project was my summer contribution to Kudu: a restructuring of the scan path to speed up queries.</p> <!--more--> <h2 id="introduction">Introduction</h2> <p>In Kudu, <em>predicate pushdown</em> refers to the way in which predicates are handled. When a scan is requested, its predicates are passed through the different layers of Kudu’s storage hierarchy, allowing for pruning and other optimizations to happen at each level before reaching the underlying data.</p> <p>While predicates are pushed down, predicate evaluation itself occurs at a fairly high level, precluding the evaluation process from certain data-specific optimizations. These optimizations can make tablet scans an order of magnitude faster, if not more.</p> <h2 id="a-day-in-the-life-of-a-query">A Day in the Life of a Query</h2> <p>Because Kudu is a columnar storage engine, its scan path has a number of optimizations to avoid extraneous reads, copies, and computation. When a query is sent to a tablet server, the server prunes tablets based on the primary key, directing the request to only the tablets that contain the key range of interest. Once at a tablet, only the columns relevant to the query are scanned. Further pruning is done over the primary key, and if the query is predicated on non-key columns, the entire column is scanned. The columns in a tablet are stored as <em>cfiles</em>, which are split into encoded <em>blocks</em>. Once the relevant cfiles are determined, the data are materialized by the block decoders, i.e. their underlying data are decoded and copied into a buffer, which is passed back to the tablet layer. The tablet can then evaluate the predicate on the batch of data and mark which rows should be returned to the client.</p> <p>One of the encoding types I worked very closely with is <em>dictionary encoding</em>, an encoding type for strings that performs particularly well for cfiles that have repeating values. Rather than storing every row’s string, each unique string is assigned a numeric codeword, and the rows are stored numerically on disk. When materializing a dictionary block, all of the numeric data are scanned and all of the corresponding strings are copied and buffered for evaluation. When the vocabulary of a dictionary-encoded cfile gets too large, the blocks begin switching to <em>plain encoding mode</em> to act like <em>plain-encoded</em> blocks.</p> <p>In a plain-encoded block, strings are stored contiguously and the character offsets to the start of each string are stored as a list of integers. When materializing, all of the strings are copied to a buffer for evaluation.</p> <p>Therein lies room for improvement: this predicate evaluation path is the same for all data types and encoding types. Within the tablet, the correct cfiles are determined, the cfiles’ decoders are opened, all of the data are copied to a buffer, and the predicates are evaluated on this buffered data via type-specific comparators. This path is extremely flexible, but because it was designed to be encoding-independent, there is room for improvement.</p> <h2 id="trimming-the-fat">Trimming the Fat</h2> <p>The first step is to allow the decoders access to the predicate. In doing so, each encoding type can specialize its evaluation. Additionally, this puts the decoder in a position where it can determine whether a given row satisfies the query, which in turn, allows the decoders to determine what data gets copied instead of eagerly copying all of its data to get evaluated.</p> <p>Take the case of dictionary-encoded strings as an example. With the existing scan path, not only are all of the strings in a column copied into a buffer, but string comparisons are done on every row. By taking advantage of the fact that the data can be represented as integers, the cost of determining the query results can be greatly reduced. The string comparisons can be swapped out with evaluation based on the codewords, in which case the room for improvement boils down to how to most quickly determine whether or not a given codeword corresponds to a string that satisfies the predicate. Dictionary columns will now use a bitset to store the codewords that match the predicates. It will then scan through the integer-valued data and checks the bitset to determine whether it should copy the corresponding string over.</p> <p>This is great in the best case scenario where a cfile’s vocabulary is small, but when the vocabulary gets too large and the dictionary blocks switch to plain encoding mode, performance is hampered. In this mode, the blocks don’t utilize any dictionary metadata and end up wasting the codeword bitset. That isn’t to say all is lost: the decoders can still evaluate a predicate via string comparison, and the fact that evaluation can still occur at the decoder-level means the eager buffering can still be avoided.</p> <p>Dictionary encoding is a perfect storm in that the decoders can completely evaluate the predicates. This is not the case for most other encoding types, but having decoders support evaluation leaves the door open for other encoding types to extend this idea.</p> <h2 id="performance">Performance</h2> <p>Depending on the dataset and query, predicate pushdown can lead to significant improvements. Tablet scans were timed with datasets consisting of repeated string patterns of tunable length and tunable cardinality.</p> <p><img src="/img/predicate-pushdown/pushdown-10.png" alt="png" class="img-responsive" /> <img src="/img/predicate-pushdown/pushdown-10M.png" alt="png" class="img-responsive" /></p> <p>The above plots show the time taken to completely scan a single tablet, recorded using a dataset of ten million rows of strings with length ten. Predicates were designed to select values out of bounds (Empty), select a single value (Equal, i.e. for cardinality <em>k</em>, this would select 1/<em>k</em> of the dataset), select half of the full range (Half), and select the full range of values (All).</p> <p>With the original evaluation implementation, the tablet must copy and scan through the tablet to determine whether any values match. This means that even when the result set is small, the full column is still copied. This is avoided by pushing down predicates, which only copies as needed, and can be seen in the above queries: those with near-empty result sets (Empty and Equal) have shorter scan times than those with larger result sets (Half and All).</p> <p>Note that for dictionary encoding, given a low cardinality, Kudu can completely rely on the dictionary codewords to evaluate, making the query significantly faster. At higher cardinalities, the dictionaries completely fill up and the blocks fall back on plain encoding. The slower, albeit still improved, performance on the dataset containing 10M unique values reflects this.</p> <p><img src="/img/predicate-pushdown/pushdown-tpch.png" alt="png" class="img-responsive" /></p> <p>Similar predicates were run with the TPC-H dataset, querying on the shipdate column. The full path of a query includes not only the tablet scanning itself, but also RPCs and batched data transfer to the caller as the scan progresses. As such, the times plotted above refer to the average end-to-end time required to scan and return a batch of rows. Regardless of this additional overhead, significant improvements on the scan path still yield substantial improvements to the query performance as a whole.</p> <h2 id="conclusion">Conclusion</h2> <p>Pushing down predicate evaluation in Kudu yielded substantial improvements to the scan path. For dictionary encoding, pushdown can be particularly powerful, and other encoding types are either unaffected or also improved. This change has been pushed to the main branch of Kudu, and relevant commits can be found <a href="https://github.com/cloudera/kudu/commit/c0f37278cb09a7781d9073279ea54b08db6e2010">here</a> and <a href="https://github.com/cloudera/kudu/commit/ec80fdb37be44d380046a823b5e6d8e2241ec3da">here</a>.</p> <p>This summer has been a phenomenal learning experience for me, in terms of the tools, the workflow, the datasets, the thought-processes that go into building something at Kudu’s scale. I am extremely thankful for all of the mentoring and support I received, and that I got to be a part of Kudu’s journey from incubating to a Top Level Apache project. I can’t express enough how grateful I am for the amount of support I got from the Kudu team, from the intern coordinators, and from the Cloudera community as a whole.</p> </div> </article> </div> <div class="col-lg-3 recent-posts"> <h3>Recent posts</h3> <ul> <li> <a href="/2024/11/13/apache-kudu-1-17-1-release.html">Apache Kudu 1.17.1 Released</a> </li> <li> <a href="/2024/03/07/introducing-auto-incrementing-column.html">Introducing Auto-incrementing Column in Kudu</a> </li> <li> <a href="/2023/09/07/apache-kudu-1-17-0-released.html">Apache Kudu 1.17.0 Released</a> </li> <li> <a href="/2022/06/17/apache-kudu-1-16-0-released.html">Apache Kudu 1.16.0 Released</a> </li> <li> <a href="/2021/06/22/apache-kudu-1-15-0-released.html">Apache Kudu 1.15.0 Released</a> </li> <li> <a href="/2021/01/28/apache-kudu-1-14-0-release.html">Apache Kudu 1.14.0 Released</a> </li> <li> <a href="/2021/01/15/bloom-filter-predicate.html">Optimized joins & filtering with Bloom filter predicate in Kudu</a> </li> <li> <a href="/2020/09/21/apache-kudu-1-13-0-release.html">Apache Kudu 1.13.0 released</a> </li> <li> <a href="/2020/08/11/fine-grained-authz-ranger.html">Fine-Grained Authorization with Apache Kudu and Apache Ranger</a> </li> <li> <a href="/2020/07/30/building-near-real-time-big-data-lake.html">Building Near Real-time Big Data Lake</a> </li> <li> <a href="/2020/05/18/apache-kudu-1-12-0-release.html">Apache Kudu 1.12.0 released</a> </li> <li> <a href="/2019/11/20/apache-kudu-1-11-1-release.html">Apache Kudu 1.11.1 released</a> </li> <li> <a href="/2019/11/20/apache-kudu-1-10-1-release.html">Apache Kudu 1.10.1 released</a> </li> <li> <a href="/2019/07/09/apache-kudu-1-10-0-release.html">Apache Kudu 1.10.0 Released</a> </li> <li> <a href="/2019/04/30/location-awareness.html">Location Awareness in Kudu</a> </li> </ul> </div> </div> <footer class="footer"> <div class="row"> <div class="col-md-9"> <p class="small"> Copyright &copy; 2023 The Apache Software Foundation. </p> <p class="small"> Apache Kudu, Kudu, Apache, the Apache feather logo, and the Apache Kudu project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries. </p> </div> <div class="col-md-3"> <a class="pull-right" href="https://www.apache.org/events/current-event.html"> <img src="https://www.apache.org/events/current-event-234x60.png"/> </a> </div> </div> </footer> </div> <script src="/js/jquery.min.js"></script> <script> // Try to detect touch-screen devices. Note: Many laptops have touch screens. $(document).ready(function() { if ("ontouchstart" in document.documentElement) { $(document.documentElement).addClass("touch"); } else { $(document.documentElement).addClass("no-touch"); } }); </script> <script src="/js/bootstrap.min.js"></script> <script src="/js/anchor.js"></script> <script> anchors.options = { placement: 'right', visible: 'touch', }; anchors.add(); </script> </body> </html>