Slides from my presentation at #ChefConf 2013
Big Data meets Configuration Management. Edmunds.com's first foray into Hadoop is a tale of challenges, discovery, and ultimately triumph. This is the story of how Edmunds.com leveraged Chef - and its community - to build a fully automated Hadoop cluster in the face of looming project deadlines.
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Building Hadoop with Chef
1. Build & Managing HadoopBuild & Managing Hadoop
with Chefwith Chef
with Chefwith Chef
John Martin
Sr Director, Production Engineering
2. IntroductionIntroduction
• Me, Me, Me
• 10+ years in .com & JEE space
• Project Crew
• Paul MacDougall
• Greg Rokita
• KC Braunschweig (former)
• Ryan Holmes (former)
• Edmunds.com
• Founded in 1966
• Gopher site in 1994
• HTTP site in 1995
3. Edmunds.com EnvironmentEdmunds.com Environment
• Nearing 3000 hosts
• Heavily virtualized
(Xen, CloudStack, AWS)
• Tomcat with some WebLogic
• Coherence
Solr
Mongo
• Publishing built on ActiveMQ
• Newly launched DWH built
around Hadoop + Netezza
4. • Explosive infrastructure growth
• Quick to bootstrap
• Easy integration with our tooling
• knife
• The Chef Community
Why Chef?Why Chef?
5. • Open framework for data-intensive distributed applications
• Reigning King of “Big Data”
• Many services
• HDFS
• MapReduce
• HBase
• ZooKeeper
• Designed to run on
commodity hardware
What’s Hadoop?What’s Hadoop?
6. • Multiple Clusters
• Roughly 200Tb in total
• 40+ nodes in production
• Maintained by Ops + Dev
• Dell R410
• Six-core 2.40Ghz
• 24Gb RAM
• 4x 1Tb 7200RPMs
Edmunds Hadoop EnvironmentEdmunds Hadoop Environment
7. • First cluster was a Frankenstein
• Part BMC
• Part manual effort
• Part Puppet
• Staff changes & knowledge loss
• Time for a clean slate!
How We Got HereHow We Got Here
8. • True Dev + Ops effort
• Production built in 3 weeks
• Built with community cookbooks
• All services now administered with knife
• New nodes now cluster-ready within minutes
Building Hadoop with ChefBuilding Hadoop with Chef
9. • First highly-visible Chef success story at Edmunds
• Cemented Chef as our CM solution
• Engaged us with the community
• Completely automated Hadoop infrastructure
• New suite of administrative scripts
• knife-[start|stop]-all.sh $cluster
• knife-[start|stop]-hbase.sh $cluster
• knife-[start|stop]-mapred.sh $cluster
• knife-[start|stop]-oozie.sh $cluster
What We GainedWhat We Gained
10. • New cluster currently being built!
• Integration with Cloudera Manager
• Cluster replication
• Continue evangelism of
Chef’s awesomeness
• Extend more of the toolchain
around Chef
• See you around at the LA Chef UG!
Where Next?Where Next?
We currently have close to 3000 hosts deployed in our environments. We are highly virtualized, relying on a mix of RHEL Xen and CloudStack. Historically, our server vendor of choice has been Dell but we began using Cisco’s UCS chassis about a year ago to back our CloudStack puffs. We’re a Java shop, with Tomcat 7 being our container of choice. There are still a few WebLogic applications floating around, but they are slowing being rebuilt on Tomcat Our web apps rely on a mix of Oracle Coherence, Solr, and Mongo. All data and content on the website is kept up-to-date using our homegrown publishing solution built on ActiveMQ. And on the topic you’re hear today to hear… we are about to unveil our new data warehouse that has been built with Hadoop, HBase, and Netezza. I’ll be getting in to that shortly. There’s more to our environment, such as Oracle RACs and BPEL services, but everything that’s listed here has been built and is supported with Chef. Since our adoption about 18 months ago, we have brought nearly all our services under Chef management. Our migration from WebLogic to Tomcat has been aided greatly by Chef adoption, shaving months off the original estimates.
How did Edmunds come to adopt Chef? Well, for a few years we had been trying to get our heads around config management. With explosive growth in the number of hosts we were building, it was an absolute necessity for us. We were customers of one of the big names in the space, but ultimately found the tool to be a challenge for us. So we began looking for something to replace it and started experimenting with chef and puppet. One of those experiments was with a Hadoop cluster. So as we begin to put these different offerings through their paces, there were a few things about Chef that stood out to us. The first was that it was easy for us to get up and running with Chef. With very minimal effort we had a Chef server, setup a repo, and we were off to the races. We were then trying to figure out how well it bolted together with other tools in our toolchain. The good news was that it wasn’t going too be too difficult for us. The bad news is that we realized we didn’t like some of the other tools in the chain and were going to re-write them now that we had a better configuration management tool. (That’s for a whole other presentation.) knife was something that seemed easy for our admins to pick up. It was intuitive to wrap our heads around and it was easy for us to see how powerful a weapon it could be. Other CMs have their equivalents, but it felt like we could get more done with knife with a shorter ramp up. Lastly, the Chef community ended up being a big factor in how Chef was adopted at Edmunds. There’s such a wealth of knowledge sharing – not just from Opscode – but from the daily users of Chef. Mailing lists, IRC, Twitter, blogs; the places we could go to for help when we were learning our way was invaluable in those early days.
A really quick overview for the uninitiated as to what Hadoop is… In short, Hadoop is an open framework for running data-intensive distributed applications. When you hear anyone marketing “big data”, the first thing that comes to mind is Hadoop; it is the reigning king of “big data”. One of the projects co-creators, Doug Cutting, named the project after his son’s toy elephant, Hadoop. The Hadoop framework is actually a collection of services. HDFS, at the foundation of the framework, is distributed file system. Each data node in your Hadoop cluster runs a localized copy of HDFS. It’s through this distribution of data that another of Hadoop’s services – MapReduce – gains its speed. By distributing copies of the data across multiple nodes, MapReduce and HBase are able to perform highly-parallelized tasks at great speed. ZooKeeper is a configuration registry service. As the name implies, ZooKeeper keeps track of all the animals running wild in your Hadoop cluster. It’s highly customizable and several folks – ourselves included – have started using ZooKeeper for non-Hadoop related projects. While there are a lot of companies that sell “big data” platforms or appliances, the Hadoop platform was specifically designed to run on commodity hardware. It was born out of the mind of Google’s MapReduce and Google File System white papers, where Google talked about the massive scale at which they had prototyped the services on cheap whitebox servers. In 2011, Facebook staked claim to the largest Hadoop cluster in the world, clocking in at 30 petabytes across thousands of servers. Now our Hadoop cluster isn’t anywhere near that size, but you can already see a need for some sort of configuration management solution to this problem, can’t you? There is a lot more to Hadoop, HDFS, MapReduce, and Hbase so please don’t view this as a comprehensive view into any of them. I simply can not do it justice in one slide and encourage anyone not familiar with its capabilities to do further research on it.
Okay – Let’s get to the meat of the discussion, our Hadoop environment. We have two pre-production and one production Hadoop cluster. There is approximately 200Tb in total across the clusters, with the majority of that being housed in the production cluster. The node counts you see to the side are for our production cluster. As I mentioned earlier, our hosts are built on Dell hardware. In these clusters, we’re relying on a fleet of Dell R410s, a low-end PowerEdge model, geared with this type of work in mind. There are just over 40 of these nodes in the production cluster. These clusters are managed daily by the combined efforts of by Ops and Devs. This is where we’re at today. But we didn’t start with this setup. In fact, it’s a bit of painful story as to how we got here.
Earlier, I’d mentioned that prior to our full adoption of Chef, we had been experimenting with other solutions. There was a five or six month period where we had been experimenting with both Chef and Puppet. Chef had certainly won several of us over, but others remained uncertain. As a result, we had a Hadoop cluster that was half-baked with Puppet. Because we were also learning our way around Hadoop at the time, that’s where most of our attention went. As a result, there was a large of number of manual pieces in the Hadoop-Puppet cluster that just weren’t scaling correctly for us. This is really no fault of or knock against Puppet. As I said, we simply were not focusing the priority on the config management aspect of this project, but rather the Hadoop services themselves. When the Systems Engineer/Architect of that solution left the organization, many of us were scratching our heads as to how the thing functioned. The documentation wasn’t well flushed out and the Hadoop project owners were pushing for some large expansion in a short period of time. The team which had been focused on evaluating config management tools wanted their shot as this challenge. We got the Hadoop project owners to allow for a one-month freeze on their expansion requests and went to work. The path forward was clear. Rather than try to figure out how to upright the Hadoop-Puppet project, the best thing to do was scrap it and move forward with a cluster built with Chef.
That put into motion our first highly visible project using Chef. This was collaboration at its finest. In a single week, engineers across Dev + Ops came together from both the Hadoop and config management projects to scope out the effort. Then, within 3 weeks, all the cookbooks necessary to fully automate and manage our Hadoop infrastructure were put in place. This could not have been accomplished without cookbooks already available from the community. More specifically, we relied heavily on the ‘hadoop_cluster’ cookbook by InfoChimps to get us up and running. We weren’t out to reinvent the wheel and are far from being an unique snowflake. Leveraging the cookbooks already out there we shaved precious time off our effort. So over the course of that 3 weeks, the InfoChimps cookbooks were groomed to our environment. We did write a few of our own for HBase and the deployment of Oozie workflows as well. We may get to a point in the future where we push those changes and new cookbooks back out into the community, given how helpful they were to us. At the end of the effort, we had a brand new production Hadoop cluster that was fully automated. Now adding a new node takes as little as 15min once the host is racked and available for bootstrapping. Last summer the production cluster was expanded yet again and with minimal effort.
Our production Hadoop cluster being managed by Chef was a significant win and helped us gain a lot of traction internally in solidifying it as our configuration manager. It was the first highly-visible success story with Chef. We had asked for a month’s reprieve on delivery to our project owner and in that time revitalized our Hadoop infrastructure. Not only had we demonstrated the power of Chef, we had shown the power of leveraging the community. I know I’ve said it a few times already, but I simply can’t stress enough how difficult a task this would have been for us if we had to write all the cookbooks necessary to build these clusters. It encouraged us to begin engaging with our peers of other technical communities as well. There was a very positive kick out of our shells because of the positive interactions with the Chef community. The “S” in the DevOps CLAMS acronym is for “sharing” and it’s something we have really embraced because of the initial experience we had with this project. Kudos to everyone of you that are out there are participating. So now we had a fully automated Hadoop infrastructure. Gone were the days of half-automated/half-manual administrative tasks. Completely automated from top to bottom. What’s more… because of this automation, we were able to provide some great scripts that leverage knife for starting and stopping either the entire cluster – dangerous, and not really suggested! – or specific services with in the cluster. While these scripts don’t do anything magical and are really just nifty knife ssh’s based on roles, they actually abstract any required knowledge of how to use knife ssh. They were really great in the early days of Chef adoption at Edmunds because we could demonstrate Chef’s capabilities without requiring a huge amount of upfront education to new users.
So where are we going with all this? To start, we’ll be building a new production Hadoop cluster in the next couple of months. It will be a significant re-architecture for us as we’ll be using larger boxes than we have in the past. What’s more, we’re going to take a stab at using Cloudera Manager to help us with the built out. Cloudera Manager will provide some great insight into the performance and health of our cluster that we’ve been missing. We’re expecting that it will help our development teams find performance bottlenecks within our Hadoop clusters. We’re also going to get into cluster replication as well as exploring how to make this it work across our data centers. That’s some new territory for us, so it should be interesting experimenting with that. On the Chef front, we’ll continue with our evangelism of just how awesome a tool we think it is. We’re really excited to be a part of the new LA Chef UG that’s been started. We’ll also continue to extend our tooling around Chef. Right now, we’ve got several different tools already that are making great use of Chef and I don’t see that integration stopping anytime in the foreseeable future. That’s about all I had to present today. Thanks for your time.