_posts/2017-02-16-apache-mavibot-history.html (133 lines of code) (raw):
---
layout: post
status: PUBLISHED
published: true
title: Apache Mavibot history
id: 58152764-1378-4668-8820-a8d6e25ced73
date: '2017-02-16 23:23:41 -0500'
categories: directory
tags:
- diretcory
- mavibot
permalink: directory/entry/apache-mavibot-history
---
<h1>Introduction</h1>
<p>
First of all, let me introduce <b>Apache Mavibot</b>: it's a <a href="https://en.wikipedia.org/wiki/Multiversion_concurrency_control">MVCC</a> <a href="https://en.wikipedia.org/wiki/B%2B_tree">B+<br />
tree</a> library in Java under an <a href="https://www.apache.org/licenses/LICENSE-2.0">AL 2.0 license</a> (<b>MVCC</b> stands for<br />
<em>Multi-Version Concurrency Control</em>). The whole idea is to have a B-tree<br />
implementation that never crashes, and does not use locks to protect the<br />
data against concurrent access (well … while reading).</p>
<p>
The B+ tree is a variant of a B-tree, where values are only stored in<br />
the leaves, not in internal nodes.</p>
<p>
Ok. Good. You don't know much about Mavibot after this introduction, so<br />
I'll dig a bit deeper in this post. Let's start with the original idea.</p>
<h2>Apache Directory, CouchDB, and some other databases...</h2>
<p>
Back in 2009, I was attending the <a href="http://archive.apachecon.com/c/acus2009/">Apache Conference in Oakland</a>. I had been<br />
working on the <a href="http://directory.apache.org">Apache Directory project</a> for a bit more than a 4 years<br />
and a half. <em>Apache Directory</em> is a <b>LDAP</b> server written in Java, and we<br />
chose to store data in B-trees. There was a very limited choice back<br />
then, and the library we used - and still use as of today - was <a href="http://jdbm.sourceforge.net/">JDBM<br />
</a>, a java avatar of <a href="http://www.gnu.org.ua/software/gdbm/">GDBM</a>.</p>
<p>
<b>JDBM</b> is written in Java, implements <b>B+trees</b> and has transactions support<br />
(experimental), but it has one big drawback: it's not a cross-B-trees<br />
transaction system. And that does not fit our requirement in <b>LDAP</b>.</p>
<p>
An alternative could have been <b>Berkeley DB &tm;</b>, which released a Java<br />
edition of its database, but its license was incompatible with the AL<br />
2.0 license. Moreover Berkeley DB was bought by <b>Oracle</b> in 2006, so it was<br />
simply not an option.</p>
<h2>What’s wrong with using JDBM?</h2>
<p>
In <b>LDAP</b>, an update operation impacts one single entry but this entry<br />
uses many <em>AttributeTypes</em>, which can be indexed. In other words, an<br />
update will impact as many B-trees as we have indexes (and a bit more).<br />
In order to guarantee that an entry UPDATE is consistent, we must be<br />
sure that either all or none of the indexes have been flushed to disk:<br />
otherwise we might end with an inconsistent database, where some indexes<br />
are up to date when some other aren't.</p>
<p><a href="https://blogs.apache.org/directory/mediaresource/847a2aa6-dce8-4d6f-9a65-79c74219d785"><img src="https://blogs.apache.org/directory/mediaresource/847a2aa6-dce8-4d6f-9a65-79c74219d785" alt="indexeldap.png" height="50%" width="50%"></img></a></p>
<p>
Even worse, in the event of a crash, we might simply not be able<br />
to restart the server because the database gets corrupted (and<br />
sadly, we are experiencing this problem today...).</p>
<p>
So in <b>Oakland</b>, I went to the <a href="http://couchdb.apache.org/">Apache Couch-DB</a> presentation (sadly,<br />
the slides are not anymore available), and was struck by the idea behind<br />
their database: <b>MVCC</b>. Crucially when you start to use the<br />
database at a given revision, you always see everything associated with<br />
this revision, up to the point you are done. Sounds like <b>Git</b> or<br />
<b>Subversion</b> … actually, it's pretty much the same mechanism.</p>
<p>
Being able to process some read operations on a specific version of the<br />
database guarantees that no update will ever corrupt the data being<br />
processed. And every time we want to access the database, the very first<br />
thing it will do is to select the latest available version: this is<br />
all we will see during the operation processing. Perfect when you don't really care about having a fresh view of the stored data at any time, which is the case in <b>LDAP</b>.</p>
<p>
But <b>Apache CouchDB</b> was written in <b>Erlang</b> :/ Anyway, the discussion we<br />
had with the Directory team was really about moving to a <b>MVCC</b> database.</p>
<h2>Transactions</h2>
<p>
Transactions are another big missing feature in <b>LDAP</b>. This is not something that was<br />
in the air back then: it was specified only <a href="https://tools.ietf.org/html/rfc5805">one year later</a>. Of course, the original specifications said that every operation is atomic, but there is no requirement for multiple operations to be atomic (and we often need to update two entries in <b>LDAP</b>, and to guarantee that those two operations are either completed, or<br />
roll-backed). Think about user/group management...</p>
<p>
<b>Alex Karasulu</b> always had in mind that we needed a transactional database<br />
in Apache Directory, too. And his point was proved correct when years<br />
later, we faced the first database corruptions. It's a bit sad that we<br />
ignored this aspect for so long :/</p>
<p>
Anyway, we needed (a) transactions and (b) a rock solid database that could<br />
resist any type of crash.</p>
<h2>Locks</h2>
<p>
For some time, we tried to mitigate the consistency problems we had by<br />
adding tons of locks. As we weren't able to protect the database against<br />
concurrent reads and writes we made them exclusive (i.e.<br />
when some write is processed, no read can be processed). This was<br />
slightly better, but it came at a huge cost: a major slowdown when<br />
writes were done. Also it was not good enough: long-lasting searches<br />
were just killing us, as there were no solution to guarantee that an<br />
entry for which we had a reference would still be present in the database<br />
when we needed to fetch it. In such cases, we simply ended up by<br />
discarding the entry. Last, not least, a crash in the middle of an<br />
update operation would leave the database in a potential inconsistent<br />
state, which would make it impossible to start again (this was somehow<br />
mitigated by adding a 'repair' mode lately, but this is just an horrible<br />
hack).</p>
<h2>Mavibot first steps</h2>
<p>
So we needed something better, which turned out to be <b>Mavibot</b>. We started working<br />
on Mavibot in June 2012 (<a href="http://svn.apache.org/viewvc/labs/mavibot/README.txt?revision=1349589&view=markup&pathrev=1349589">Jun 13 00:04:10 2012, exactly</a>).</p>
<p>
The funny thing is that <b>OpenLDAP</b> started to work on the exact same kind<br />
of database 1 year before (<a href="http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=commit;h=d620d4368a7ee17d60f2b381d4c80b22c68ba8e2">LMDB</a>) - even if the <a href="http://marc.info/?l=openldap-devel&m=124253456912512">discussion about the<br />
need for such a database started in 2009</a>. Parallel discussions,<br />
parallel developments, we have always shared a lot!</p>
<p>
The very first released version of <b>Mavibot</b> was out one year later, in<br />
June 2013, followed by 7 other versions (all of them milestones). At<br />
some point, we added a <b>MVBT</b> partition in <b>ApacheDS</b>, in 2.0.0-M13 (and it<br />
was using a SNAPSHOT!!! Mavibot 1.0.0-M1 was used in <b>ApacheDS</b><br />
2.0.0-M15). This was 'good enough' to have the <b>LDAP</b< server working (and<br />
it was 2 to 5 times faster than <b>JDBM</b>, too ;-), but it didn't offer all<br />
we wanted to add: typically, we didn't have transaction support.</p>
<p>
So why isn’t <b>Mavibot</b> the <b>Apache Directory Server</b> backend of choice today?</p>
<p>
Well, we don't have cross B-tree transactions, so we are pretty much in<br />
the same situation as with <b>JDBM</b> (except that it's faster, and we also<br />
have a bulk-loader for <b>Mavibot</b>). Adding cross-B-trees transaction is not<br />
a piece of cake, and it requires some thinking. Sadly, it arrived at a<br />
moment where the team had less time to work on it (new jobs, family,<br />
you name it).</p>
<p>
So in 2017, the effort has been rebooted, and we do expect to have a<br />
working version soon enough!</p>
<p>
I'll blog later on about various technical aspects on <b>Apache Mavibot</b>, so keep tuned !</p>