--- layout: post status: PUBLISHED published: true title: Mavibot Transactions id: 3e0ee38c-32c3-4c5c-a845-85044f682d80 date: '2017-09-18 12:45:16 -0400' categories: directory tags: - directory - mavibot permalink: directory/entry/mavibot-transaction ---
So here is the blog post I promised 6 months ago... It took a bit of a time, due to a family expansion that sucked up pretty much all our free time :-)
Let's start stating the obvious !
It's simply 'a unit of work'. That means it's either fully completed, or it never happened. This is critical to database, as it makes certain the database is always in a consistent state. But it's more that that. A Transaction system must provide the following properties (aka ACID) :
You can't split any of the tasks in a transaction. The transaction is always seen as a whole.
The database must be valid after the transaction as it was before. It's mainly about guarantying that implied updates (like triggers, constraints, etc) are done within the transaction
Here, we are going to use a restricted version of 'isolation' : read committed. That means we guarantee a user will always be able to read consistent data at any time, and will not see 'dirty' data (ie, data being written during a transaction). However, when the transaction is written, then a user doing a second read may not see the same data he got with a previous read. This has some deep impact on the way the user should manage the results : there is no guarantee that a data one get with a read is still valid a few seconds later. We don't lock the database on reads.
This seems trivial : once committed, the transaction will let the database in a stable state, for ever. It's mainly about guaranteeing that the database survives a crash.
So far, we can assume that being a MVCC system, Mavibot already covers all those aspects :
The only part we have to take care of is the way we make a new version available to readers : at some point, every new reader should use the new version. In order to make that possible, we just have to switch one single reference to the current version.
At this point, I have to come back to how B-trees are managed in Mavibot.
A B-tree consist of three separate elements :
Basically, we always use the B-tree header as a reference when reading or updating a B-tree.
Now, we have special B-trees that are used to manage Mavibot :
I won't go any further here, I just wanted to expose those management elements for a better understanding of what will follow.
Well, we could live without them, but it would be very inefficient. There are two problems :
This is where adding a higher level of transaction is required and good to have. Let's see what that means more in detail.
The following two schema show what happens when we update one single tree 5 times (from the very beginning, with an empty B-tree, up to a point we have added 4 elements in it.
For the purpose of this demonstration, I'm using pages that can contain only 2 elements, but we would get the exact same result with bigger pages (except that it would be way more dreadful to draw :-)
In this schema, each page has a reference (its offset on disk), noted O
Pretty complicated...
What if we gather updates within a single transaction ?
Here is what we get if 3 of the previous updates are done within one single transaction :
As we can see, we have way less pages being created, mainly in the BoB and CPB B-trees. For the record, we avoid flushing 15 pages (out of 36), a 40% improvement.
Is it realistic to gather many operations on many B-trees into a single transaction ? It feels like we may lose some data if the system crashes in the middle of a transaction...
Actually, this is badly needed : it's quite rare that we update only one B-tree. Typically, when using Mavibot in ApacheDS, we want to commit a transaction only when the full LDAP update is completed, and that involves many B-trees :
We do want all these index to be updated all at once, otherwise if we have a crash when only a sub-set of them are cmmitted on disk, and teh rest aren't, then we end up with an inconsistent database.
So we want a transaction that covers all those B-trees, and the side benefit is that if any of those B-trees get updated more than once (like the RDN index), we save some disk access. We also save a lot of disk access as the BoB and CPB B-tress get updated at the very end, saving some more disk access.
We can move forward, and let the user decide that transaction should be committed only after a certain amount of updates being applied. This is typically something you want when pushing a lot of modifications in a batch.
The drawback is that you may lose everything from the point you started the transaction if your system crashes at some point before the commit, and it eats some memory as we have to keep the pending modification in memory until the moment the transaction is committed.
It's a balance : speed vs safety.
As soon as we start a transaction, we create a copy of the existing B-trees (well, not all of them, just the parts that are to be protected !) so that the existing readers aren't impacted by the on going modifications.
Now, for every modification done in a B-tree, we will copy the modified pages in memory, and update them. If a page has already been copied - and is in memory -, we simply reuse the copy. We have a map of all the in-memory pages for that purpose, and every new copy is put into this map.
When we are done with the modifications, we just have to flush all the pages in this map on the disk : we are done.
Well, not completely... We also have to update the BoB and the CPB B-trees, as usual. The point is that those updates can be done at the very end. Last, not least, we can move the unused pages into the free-pages list.
That is the theory, the implementation is a bit complex, and I won't expose it here...
At this point, a bit of code sample could help :
// Create the Mavibot database
RecordManager recordManager = new RecordManager( "MyData.bin" );
// Create a B-tree : we need to start a transaction...
try ( WriteTransaction transaction = recordManager.beginWriteTransaction() )
{
btree = recordManager.addBTree( transaction, "test", LongSerializer.INSTANCE, StringSerializer.INSTANCE, true );
}
// Add some data in the created B-tree.
try ( WriteTransaction writeTxn = recordManager.beginWriteTransaction() )
{
BTree
btree.insert( writeTxn, 1L, "1" );
btree.insert( writeTxn, 4L, "4" );
btree.insert( writeTxn, 2L, "2" );
}
// Now, read the B-tree
try ( Transaction readTxn = recordManager.beginReadTransaction() )
{
BTree
TupleCursor
while ( cursor.hasNext() )
{
System.out.println( cursor.next() );
}
}
...
This will print "<1,1>", "<2,2>" and "<4,4>".
A few remarks :
This is just a 10 feet high description of the transaction system we use in Mavibot, it does not go into details, nor it gives all the keys to be able to use Mavibot. The whole idea is to give a sense of what we are working on, and why we need transactions in Mavibot.
We may changes a thing here and there, typically some method names (so don't expect the code to be perfectly representative of what will be the released version). Anyway, it's going to be pretty close !
In the next blog post, I'll talk more in details about the management B-trees (BoB, CPB) and the free-pages management (ie, the Pages Reclaimer). There are quite interesting aspects to discuss when it comes to reclaiming pages :-)