_posts/2017-02-16-apache-mavibot-history.html (133 lines of code) (raw):

--- layout: post status: PUBLISHED published: true title: Apache Mavibot history id: 58152764-1378-4668-8820-a8d6e25ced73 date: '2017-02-16 23:23:41 -0500' categories: directory tags: - diretcory - mavibot permalink: directory/entry/apache-mavibot-history --- <h1>Introduction</h1> <p> First of all, let me introduce <b>Apache Mavibot</b>: it's a <a href="https://en.wikipedia.org/wiki/Multiversion_concurrency_control">MVCC</a> <a href="https://en.wikipedia.org/wiki/B%2B_tree">B+<br /> tree</a> library in Java under an <a href="https://www.apache.org/licenses/LICENSE-2.0">AL 2.0 license</a> (<b>MVCC</b> stands for<br /> <em>Multi-Version Concurrency Control</em>). The whole idea is to have a B-tree<br /> implementation that never crashes, and does not use locks to protect the<br /> data against concurrent access (well &hellip; while reading).</p> <p> The B+ tree is a variant of a B-tree, where values are only stored in<br /> the leaves, not in internal nodes.</p> <p> Ok. Good. You don't know much about Mavibot after this introduction, so<br /> I'll dig a bit deeper in this post. Let's start with the original idea.</p> <h2>Apache Directory, CouchDB, and some other databases...</h2> <p> Back in 2009, I was attending the <a href="http://archive.apachecon.com/c/acus2009/">Apache Conference in Oakland</a>. I had been<br /> working on the <a href="http://directory.apache.org">Apache Directory project</a> for a bit more than a 4 years<br /> and a half. <em>Apache Directory</em> is a <b>LDAP</b> server written in Java, and we<br /> chose to store data in B-trees. There was a very limited choice back<br /> then, and the library we used - and still use as of today - was <a href="http://jdbm.sourceforge.net/">JDBM<br /> </a>, a java avatar of <a href="http://www.gnu.org.ua/software/gdbm/">GDBM</a>.</p> <p> <b>JDBM</b> is written in Java, implements <b>B+trees</b> and has transactions support<br /> (experimental), but it has one big drawback: it's not a cross-B-trees<br /> transaction system. And that does not fit our requirement in <b>LDAP</b>.</p> <p> An alternative could have been <b>Berkeley DB &tm;</b>, which released a Java<br /> edition of its database, but its license was incompatible with the AL<br /> 2.0 license. Moreover Berkeley DB was bought by <b>Oracle</b> in 2006, so it was<br /> simply not an option.</p> <h2>What&rsquo;s wrong with using JDBM?</h2> <p> In <b>LDAP</b>, an update operation impacts one single entry but this entry<br /> uses many <em>AttributeTypes</em>, which can be indexed. In other words, an<br /> update will impact as many B-trees as we have indexes (and a bit more).<br /> In order to guarantee that an entry UPDATE is consistent, we must be<br /> sure that either all or none of the indexes have been flushed to disk:<br /> otherwise we might end with an inconsistent database, where some indexes<br /> are up to date when some other aren't.</p> <p><a href="https://blogs.apache.org/directory/mediaresource/847a2aa6-dce8-4d6f-9a65-79c74219d785"><img src="https://blogs.apache.org/directory/mediaresource/847a2aa6-dce8-4d6f-9a65-79c74219d785" alt="indexeldap.png" height="50%" width="50%"></img></a></p> <p> Even worse, in the event of a crash, we might simply not be able<br /> to restart the server because the database gets corrupted (and<br /> sadly, we are experiencing this problem today...).</p> <p> So in <b>Oakland</b>, I went to the <a href="http://couchdb.apache.org/">Apache Couch-DB</a> presentation (sadly,<br /> the slides are not anymore available), and was struck by the idea behind<br /> their database: <b>MVCC</b>. Crucially when you start to use the<br /> database at a given revision, you always see everything associated with<br /> this revision, up to the point you are done. Sounds like <b>Git</b> or<br /> <b>Subversion</b> &hellip; actually, it's pretty much the same mechanism.</p> <p> Being able to process some read operations on a specific version of the<br /> database guarantees that no update will ever corrupt the data being<br /> processed. And every time we want to access the database, the very first<br /> thing it will do is to select the latest available version: this is<br /> all we will see during the operation processing. Perfect when you don't really care about having a fresh view of the stored data at any time, which is the case in <b>LDAP</b>.</p> <p> But <b>Apache CouchDB</b> was written in <b>Erlang</b> :/ Anyway, the discussion we<br /> had with the Directory team was really about moving to a <b>MVCC</b> database.</p> <h2>Transactions</h2> <p> Transactions are another big missing feature in <b>LDAP</b>. This is not something that was<br /> in the air back then: it was specified only <a href="https://tools.ietf.org/html/rfc5805">one year later</a>. Of course, the original specifications said that every operation is atomic, but there is no requirement for multiple operations to be atomic (and we often need to update two entries in <b>LDAP</b>, and to guarantee that those two operations are either completed, or<br /> roll-backed). Think about user/group management...</p> <p> <b>Alex Karasulu</b> always had in mind that we needed a transactional database<br /> in Apache Directory, too. And his point was proved correct when years<br /> later, we faced the first database corruptions. It's a bit sad that we<br /> ignored this aspect for so long :/</p> <p> Anyway, we needed (a) transactions and (b) a rock solid database that could<br /> resist any type of crash.</p> <h2>Locks</h2> <p> For some time, we tried to mitigate the consistency problems we had by<br /> adding tons of locks. As we weren't able to protect the database against<br /> concurrent reads and writes we made them exclusive (i.e.<br /> when some write is processed, no read can be processed). This was<br /> slightly better, but it came at a huge cost: a major slowdown when<br /> writes were done. Also it was not good enough: long-lasting searches<br /> were just killing us, as there were no solution to guarantee that an<br /> entry for which we had a reference would still be present in the database<br /> when we needed to fetch it. In such cases, we simply ended up by<br /> discarding the entry.&nbsp; Last, not least, a crash in the middle of an<br /> update operation would leave the database in a potential inconsistent<br /> state, which would make it impossible to start again (this was somehow<br /> mitigated by adding a 'repair' mode lately, but this is just an horrible<br /> hack).</p> <h2>Mavibot first steps</h2> <p> So we needed something better, which turned out to be <b>Mavibot</b>. We started working<br /> on Mavibot in June 2012 (<a href="http://svn.apache.org/viewvc/labs/mavibot/README.txt?revision=1349589&view=markup&pathrev=1349589">Jun 13 00:04:10 2012, exactly</a>).</p> <p> The funny thing is that <b>OpenLDAP</b> started to work on the exact same kind<br /> of database 1 year before (<a href="http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=commit;h=d620d4368a7ee17d60f2b381d4c80b22c68ba8e2">LMDB</a>) - even if the <a href="http://marc.info/?l=openldap-devel&m=124253456912512">discussion about the<br /> need for such a database started in 2009</a>. Parallel discussions,<br /> parallel developments, we have always shared a lot!</p> <p> The very first released version of <b>Mavibot</b> was out one year later, in<br /> June 2013, followed by 7 other versions (all of them milestones). At<br /> some point, we added a <b>MVBT</b> partition in <b>ApacheDS</b>, in 2.0.0-M13 (and it<br /> was using a SNAPSHOT!!! Mavibot 1.0.0-M1 was used in <b>ApacheDS</b><br /> 2.0.0-M15). This was 'good enough' to have the <b>LDAP</b< server working (and<br /> it was 2 to 5 times faster than <b>JDBM</b>, too ;-), but it didn't offer all<br /> we wanted to add: typically, we didn't have transaction support.</p> <p> So why isn&rsquo;t <b>Mavibot</b> the <b>Apache Directory Server</b> backend of choice today?</p> <p> Well, we don't have cross B-tree transactions, so we are pretty much in<br /> the same situation as with <b>JDBM</b> (except that it's faster, and we also<br /> have a bulk-loader for <b>Mavibot</b>). Adding cross-B-trees transaction is not<br /> a piece of cake, and it requires some thinking. Sadly, it arrived at a<br /> moment where the team had less time to work on it (new jobs, family,<br /> you name it).</p> <p> So in 2017, the effort has been rebooted, and we do expect to have a<br /> working version soon enough!</p> <p> I'll blog later on about various technical aspects on <b>Apache Mavibot</b>, so keep tuned !</p>