_posts/2012-02-09-all_you_wanted_to_know.html (104 lines of code) (raw):

--- layout: post status: PUBLISHED published: true title: 'All you wanted to know about Hadoop, but were too afraid to ask: genealogy of elephants.' excerpt: "<p>Lining up versions of Hadoop and making sense of all of them and their relations can be quit difficult.&nbsp;</p> \r\n <p>This article attempts to address the moot points and help you understand the &quot;bigger picture&quot; - literally. <br /></p>" id: ddfa655e-b974-4ccf-85b8-0e5be1558bf3 date: '2012-02-09 21:23:57 -0500' categories: bigtop tags: - versions - hadoop - lineage permalink: bigtop/entry/all_you_wanted_to_know --- <p>Hadoop is taking central stage in the discussions about processing of the large amount of unstructured data.</p> <p>With<br /> raising the popularity of the system I found that people are really<br /> puzzled with all the multiplicity of Hadoop versions; the small, yet<br /> annoying differences introduced by different vendors; the frustration<br /> when vendors are trying to lock up their customers using readily<br /> available open source data analytic components on top of Hadoop, and on<br /> and on.</p> <p>So, after explaining who was born from whom for the 3rd time - and I<br /> tell you, drawing neat pictures on a napkin in a coffee shop isn't my<br /> favorite activity - I put together this little diagram below. Click on<br /> it to inspect it in greater details. A warning: the diagram only<br /> includes more or less significant releases of Hadoop and Hadoop-derived<br /> systems available today. I don't want to waste any time on some obscure<br /> releases or branches which never been accepted at any significant level.<br /> The only exception is 0.21 which was a natural continuation of 0.20 and<br /> predecessor of recently released 0.22. </p> <p style="clear: both; text-align: center;" class="separator"><img src="http://2.bp.blogspot.com/-pHJR7XSCTlM/TzS-3-oGS4I/AAAAAAAAAB4/9M7OUrDapro/s640/hadoop-vers.png" /></p> <p>Some explanations for the diagram:</p> <ul> <li>Green rectangles designate official Apache Hadoop releases openly available for anyone in the world for free</li> <li>Black ovals show Hadoop branches that are not yet officially<br /> released by Apache Hadoop (or might not be released ever). However, they<br /> are usually available in the form of source code or tar-ball artifacts</li> <li>Red ovals are for commercial Hadoop derivatives which might be based<br /> on Hadoop or use Hadoop as a part of custom systems (like in case of<br /> MapR). These derivatives can be or can be not compatible with Hadoop and<br /> Hadoop data processing stack.</li> </ul> <p>Once you're presented with the view like this it is getting<br /> clear that there are two centers of the gravity in today's universe of<br /> elephants: 0.20.2 based releases and derivatives; and 0.22 based<br /> branches, future releases, and derivatives. Also, it becomes quite clear<br /> which are likely to be sucked into a black hole.</p> <p>The<br /> transition from 0.20+ to 0.2[1,2] was real critical because of<br /> introduced true HDFS append, fault injection, and code injection for<br /> system testing. And the fact that 0.21 hasn't been released for a long<br /> time, creating an empty space in the high demand environment. Even after<br /> it did come out, it didn't get any traction in the community.<br /> Meanwhile, HDFS append was very critical for HBase to move forward, so<br /> 0.20.2-append has been created to support the effort. A quite similar<br /> story had happened to 0.22: two different release managers was trying to<br /> get it out: first gave up, but the second has actually succeeded in<br /> pulling an effort of a part of the community towards it.</p> <p>As<br /> you can see, HDFS append wasn't available in an official Apache Hadoop<br /> release for some time (except for 0.21 with the earlier disclaimer).<br /> Eventually it has been merged into 0.20.205 (recently dubbed as Hadoop<br /> 1.0) and that allows HBase to be nicely integrated with the official<br /> Apache Hadoop without any custom patching process.</p> <p>The<br /> release of 0.20.203 was quite significant because it provided a heavily<br /> tested Hadoop security, developed by Yahoo! Hadoop development team<br /> (known as HortonWorks nowadays). Bits and pieces of 0.20.203 - even<br /> before the official release - were absorbed by at least one commercial<br /> vendor to add corporate grade Kerberos security to their derivatives of<br /> Hadoop (as in case of Cloudera CDH3).</p> <p>The diagram above clearly shows a few important gaps of the rest of commercial offerings:</p> <ol> <li>none of them supports Kerberos security (EMC, IBM, and MapR)</li> <li>unavailability of Hbase due to the lack of HDFS append in their<br /> systems (EMC, IBM). In case of MapR you end up using a custom HBase<br /> distributed by MapR. I don't want to make any speculation of the latter<br /> in this article.</li> </ol> <p>Apparently, the vacuum of significant releases between 0.20 and<br /> 0.22 appeared to be a major urge for Hadoop PMC and now - just days<br /> after release of 1.0 - 0.22 got out. With 0.23 already going through<br /> release process, championed by HortonWorks team. That release brings in<br /> some interesting innovations like Federations and MapReduce 2.0.</p> <p>Once<br /> current alpha 0.23 (which might become Hadoop 2.0 or even Hadoop 3.0) is ready for<br /> the final release I would expect new versions of commercial<br /> distributions springing to live as it was the case before. At this point<br /> I will update the diagram :)</p> <p>If you can imagine the<br /> variety of the other animals such as Pig, and Hive piling on top of<br /> Hadoop, you would get astonished by the complexity of inter-component<br /> relations and, more importantly, about intricacies of building a stable<br /> data processing stack. This is why project BigTop has<br /> been so important and popular ever since it sprung to life last year.<br /> Here you can read about Bigtop's relation to Hadoop stack <a href="http://is.gd/5kJ6Iv" target="_blank">here</a>.</p> <p><a href="http://is.gd/H4Jfa7">Cross-posted from</a></p> <p><img id="smallDivTip" src="chrome://dictionarytip/skin/dtipIconHover.png" style="border: 0px solid blue; left: 617px; position: absolute; top: 815px; z-index: 90;" /></p>