_posts/2011-12-28-conception_and_validation_of_hadoop.html (104 lines of code) (raw):

--- layout: post status: PUBLISHED published: true title: Conception and validation of Hadoop BigData stack. excerpt: "<p>What is BigTop project? What are the goals and how it is getting to achieve it? What are the roots and founding ideas of the project?</p> \r\n <p> I think you'll find the answers for these questions in what hopefully became a series of helpful posts helping IT professionals with Hadoop stack deployment and adoption.<br /> </p>" id: 38099aac-f594-4d7e-a227-40a322a47e22 date: '2011-12-28 00:59:00 -0500' categories: bigtop tags: - bigtop - hadoop permalink: bigtop/entry/conception_and_validation_of_hadoop --- <p> With more and more people jumping on the bandwagon of big data it is very gratifying to see that Hadoop is gaining momentum every day.</p> <p>Even more fascinating is too see how the idea of putting together a<br /> bunch of service components on top of Hadoop proper is getting more and<br /> more steam. IT and software development professionals are getting<br /> better understanding of benefits that a flexible set of loosely<br /> coupled yet compatible components provides when one needs to customize<br /> a data processing solution at scale.</p> <p>The biggest problem for most businesses trying to add Hadoop<br /> infrastructure into their existing IT is a lack of the knowledge,<br /> professional support, and/or clear understanding of what's out there on<br /> the market to help you. Essentially, Hadoop exists in one incarnation -<br /> this is the open-source project under the umbrella of Apache Software<br /> Foundation (ASF). This is where all the innovations in Hadoop are coming<br /> from. And essentially this is a source of profit for a few commercial<br /> offerings. </p> <p>What's wrong with the picture, you might ask? Well, the main issue with<br /> most of those offerings are mostly two folds. They are<br /> either immature and based on sometimes unfinished nor unreleased<br /> Hadoop code, or provide no significant value add compare to Hadoop<br /> proper available in source form from <a href="http://hadoop.apache.org/">hadoop.apache.org</a>.<br /> And no matter if any of above (or both of them together) apply to a<br /> commercial solution based on Hadoop, you can be sure of one thing: these<br /> solutions will cost you literally tons of money - as much as&nbsp;<br /> $1k/node/year in some cases - for what is essentially available for<br /> free.</p> <p>"What about neat packages I can get from a commercial provider and<br /> perhaps some training too?" one might ask. Well, yeah if you are willing<br /> to pay top bucks per node for packaging bugs<a href="http://is.gd/WKBkuI"> </a> to get fixed or learn how to install packages on a virtual machine - go ahead by all means.</p> <p>However, keep in mind that you always can get a set of packages for Hadoop produced by another open source project called <a href="https://incubator.apache.org/bigtop/">Bigtop</a>,<br /> hosted by Apache. What essentially you get from it are packages for your Linux<br /> distro, which can be easily installed on your cluster's nodes. A great<br /> benefit is that you can easily trim your Hadoop stack to only include<br /> what you need: Hadoop + Hive, or perhaps Hadoop + HBase (which will<br /> automatically pick up Zookeper for you).</p> <p>At any rate, the best part of the story isn't a set of packages that can<br /> be installed: after all this is what packages are usually being created<br /> for, right? The problem with the packages or other forms of component<br /> distribution is compatibility: you don't know in advance if A-package will nicely<br /> work with B-package v.1.2 unless somebody has tested this assumption.<br /> Even then, testing environment might be significantly different from<br /> your production environment and then all bets are off. Unless - again -<br /> you're willing to pay through your nose to someone who's willing to get<br /> it for you. And that's where true miracle of something like BigTop is<br /> coming for a rescue.</p> <p>Before I'll explain more, I wanna step back a bit and look upon<br /> some recent history. A couple of years ago Yahoo's Hadoop development<br /> team had to address an issue of putting together working and<br /> well-validated Hadoop stack including a number of components developed<br /> by different engineering organizations with their own development<br /> schedule and integration criteria. The main integration point of all of<br /> the pieces was the operations team which was in charge of a big number of<br /> cluster deployments, provisioning and support. Without their own QA<br /> staff they were oftentimes at mercy of an assumed code or configuration<br /> quality coming from all the corners of the company. Even with<br /> a chance of the high quality of all these components there were no<br /> guarantees that they will work together as expected once put together on a cluster. And indeed, integration problems were many.</p> <p>That's were a small team of engineers including yours truly who put together<br /> a prototype of a system called FIT (Final Integration Testing). The<br /> system essentially allowed you to pick up a packaged component you wanted<br /> to validate against your cluster environment and perform the deployment,<br /> configuration, and testing with the integration scenarios provided by<br /> either the component's owner or your own team.</p> <p>The approach was so effective that the project was continued and funded<br /> further in the form of HIT (Hadoop Integration Testing). At which point<br /> two of us have left for what seemed like a greener pasture back then.</p> <p>We thought the idea was the real deal so we have continued on the path<br /> of developing a less custom and more adoptable technology based on open<br /> standards such as Maven and Groovy. Here you can find the <a href="http://www.scribd.com/doc/63012489/Big-Data-Stacks-Validation" target="_blank">slides from the talk</a><br /> we gave at eBay about a year ago. The presentation is putting the<br /> concept of Hadoop data stack in open writing for the first time, as well as<br /> defines stacks customization and validation technology. When this presentation<br /> were given we already had <a href="http://is.gd/XvhqFW" target="_blank">well working mechanism</a> of creating, deploying, and validating both packaged and non-packaged Hadoop components.</p> <p>BigTop - <i>open-sourced for the second time</i> just a few months ago and<br /> based on our project above - has added up a packaging creation layer on<br /> top of the stack validation product. This, of course, makes your life<br /> even easier. And even more so with a number of Puppet recipes allowing<br /> you to deploy and configure your cluster in highly efficient and<br /> automatic manner. I encourage you to check it out.</p> <p>BigTop has been successfully used for validating release of Apache<br /> Hadoop 0.20.205 which has become a foundation of Hadoop 1.0.0<br /> Another release of Hadoop - 0.22 - was using BigTop for release<br /> candidates validation and so on. </p> <p>On top of "just packages" BigTop now produces ready to go VMs pre-installed with Hadoop stack for different Linux distributions: just download one and instantiate your very own cluster in minutes! We'll tell about it next time.</p> <p>I encourage to check BigTop project, contribute to it your ideas, time, and knowledge!</p></p> <p><a href="http://is.gd/06yvPM">Cross-posted</a> </p>