--- layout: post status: PUBLISHED published: true title: Let's meet B-trees id: 7dd18932-9098-4000-a8ca-a0fb4ef34cf5 date: '2017-02-27 17:19:55 -0500' categories: directory tags: [] permalink: directory/entry/let-s-meet-b-trees ---

Render unto Caesar

B-trees have been invented in the 70's by Rudolf Bayer and Edward McCreight. They were both working at Boeing ™ and wanted to design a better version of Binary trees, when addressing loads of data stored in disks. The B in B-tree comes from an aggregation of the three Bs in Bayer, Boeing and Binary-tree.

They wanted to improve browsing efficiency. The idea was to store values sequentially on the disk so as to limit the number of seek operations. Aggregating consecutive values in pages so that a single read returns a few of them is faster than fetching them one by one in different places on a disk.

Definition

What is a B-tree?

Let's first define some of the terminology in use:

And also a few properties/constraints of a B-tree:

Let's continue with the definition: it's a self-balancing tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time. Thanks Wikipedia :-)

That pretty much defines what a B-tree is.

Why have B-trees been invented?

So the whole idea was to speed up operations made on a binary-tree. In a binary-tree, values are stored so that each value has a left and right child. Searching in a binary tree is all about traversing nodes, down to the value you are looking for. The biggest issue is that binary-trees aren't necessary balanced: for any set of values, you may have any kind of distribution, from perfectly balanced binary-trees where at any point of the tree, the height of the B-tree is equal to log2(N), to a degenerated B-tree where you have N levels.

Actually, binary trees are faster because the number of comparisons you have to make to retrieve a value is also equal to the number of levels. If the binary-tree is balanced, this will be log2(nb values). In a B-tree, as you hold up to N values in each Page, the number of pages you will fetch is logN(nb values), but for each fetched page, you also have to do logN(nb values in the page) comparisons per level. So if N is 16, with 1 000 000 values stored, and each page 2/3rd full on average, you will do (nb levels) * log2( nb value per page) comparison. That is 6 * 4 comparison, ie 24 - where 4 stands for log2(10 values per page on average ) - instead of 20 (which is log2(1000000)).

Ok, so B-tree are less efficient than balanced binary-trees, that's a fact. Why then should we use a B-tree? Simply because we don't process data in memory only: we usually fetch them from disk. And that is the key: fetching something from a disk is a costly operation (even if you are using a SSD). Back when B-trees were invented, data were read from a sequential support: a band or a magnetic disk. The probability that we may fetch many consecutive values was extremely high, thus storing many values per page was a great idea.

Mavibot and ApacheDS

Mavibot was designed as a replacement for JDBM for the reason I explained in my previous post. Let's now see what are its requirements.

We needed some B-tree implementation in Java, under an AL 2.0 license, with support of cross-B-tree transaction. The solution was to implement a Multiversion concurrency control model, aka MVCC. What are the characteristics of such a system ?

Those characteristics come at a cost:

The third point (disk space) is important. In a B-tree, for any update, we have to copy each Node we are traversing to go to the leaf containing the data, and also copy this leaf. For a B-tree containing 1 million elements with pages holding up to 256 elements, any update will require 3 levels - 2 Nodes and a Leaf - to be copied. That means for 1000 updates, we will eat 3*(page size)*1000 bytes, ie 12Mb for 4096 byte pages. Even worse, we also have to keep some side data which will double this size. With a standard B-tree, we will just eat what is needed to hold those updates in the B-tree, ie something like a few Kb.

Now, if we keep all the versions, we might end up with a really huge database. By the way, this is what happens when you use git, except that git does not copy intermediate Nodes :-)

As I said, we needed Mavibot for Apache Directory Server, for which those drawbacks weren't critical: LDAP favors reads over writes, so having only one writer is a non issue. Data relevance is not a big issue: just because a phone number does not change that frequently, it's not really a problem that the number a reader has grabbed is not relevant anymore when the reader wants to use it. In the worst case scenario the reader will ask it again, and get the new one back. Now, for passwords, it might sounds a bit more critical, but actually, the time frame during which a password is read and used is really short, so when you change it, it's very unlikely to impact any reader. If it does, then it's not really a problem, because any logged application has been authenticated with the old password, and the administrator should have logged out anyone *before* allowing the password to be changed.

That being said, the most important feature for ApacheDS is consistency, which Mavibot provides. If you are to write a bank application, then this is not the right database for you...

How does it work ?

The B-tree we use in such a system is a bit specific. This is also a specific flavor of B-tree, because we decided not to store data in the Nodes.

Typically, we also cannot chain leaves together (in some B-tree implementations, chaining leaves increases browsing speed - because there is no need to go up to the parent Nodes to fetch the next values -. It also limits the number of locks needed to update the B-tree).

Updates

Let's see what's going on when we inject some values in a B-tree. Here, we will insert {E, A, D, I, K} in that order. The very first version will be:

rev1.png

Here, the root page is a Leaf, and contains one value, E. Then we insert the value A, which is added to the root page, and we create a copy of this root page for that purpose:

rev2.png

The insertion of the third value, D, can't be done in the root page, because it's already full. We have to create a parent Node which refers to two leaves, containing the three values:

As you can see, we have created two leaves and copied the root-page. Let's insert a fourth value, I:

Each new addition will copy some of the pages from the previous versions, but may keep some references to previously created pages. Here, version 4 has split the right leaf into 2 new pages, but we have kept the left leaf intact. The parent node is also copied. Now, a reader using version 3 will still see only three values (A, D and E) while a new reader will see 4 values ( A, D, E and I). In the process we have created 3 new pages.

Insertion is actually done bottom-up: we add the new value in the leaf, and if the leaf is already full, we create a new leaf, spread the data between the 2 leaves, and go back to the parent to add a new reference to the created leaves. Of course, this has to propagate up to the root page, and if the current root page is full, then we have to split it again. This is what happens when we insert a fifth value (K) in the previous B-tree:

rev5.png

This time, the B-tree has 3 levels. Version 5 would look like this to any new reader:

rev5-alone.png

It's goes on and on like this. In real life, we won't store only 2 values in each leaf: that would be way too expensive. But it's easier when drawing what happens :-)

At some point in time, when no reader is pointing to a version, we can reclaim the unused pages (to be explained in another blog post).

I haven't talked about deletion. Let's see what happens when we delete the value E, for instance:

rev6.png

This time, we have just created a new root page, pointing to existing previous pages. Note that you can still browse the tree and get all the data for this version: {A, D, I, K}.

One side note: you can see that if we were to link leaves to each other creating a new version would leads us to copy the whole B-tree, not only new or modified pages.

Browsing

The most frequent operation is browsing. We may want to find a specific value or to read many values, starting from a specific part of the B-tree. Typically, if the values are dates, then reading all the values after a given date is a matter of finding the starting date, and move on toward the end of the B-tree.

However it gets a bit more complicated in the absence of links between leaves.

Fetching a value

First let's see what happens when we try to fetch a value. We always start from the root Page. Here, let's say we want to retrieve the value I from version 5. I is not present in the root page, so we will go down the tree on the right side. The I key is present in the node at the second level, so we move again down to the right page, which is a leaf, and we can now get the data.

The rules are simple:

The following image shows how we walk the tree to get the value we are looking for:

fetch-i.png

Browsing many values

Now, we can decide to browse the B-tree values, either from the beginning or from a given value. This is slightly more complex, because we can't easily move from one leaf to another: we have to go up to the parents to do that. The following image shows such a browsing path:

browse-all.png

As we can see, we start from the bottom-left, get back to the parent, and down to the second leaf, back to the root page and down to the leaf containing E, up to the parent and down to the last leaf. Browsing is not a simple operation !

One important thing to note here though is that every reader will start from the root-page, and that the intermediary nodes are accessed many times. That gives us a clue about what pages need to stay in memory, but that will be discussed later on.

Conclusion

We are done with this blog post, which detailed the basic operations on a MVCC B-tree. In the next post, I will explore transactions.