Get better results from your Big Data. Learn how to build a better Hadoop business service using BMC ControlM for better service quality, and increased business agility.
Every industry segment is already using Big Data. If yours is not in the above list, it’s only because there’s not enough room.
This is a single-slide explanation and justification for Hadoop:
Companies have lots of data both conventional structured data and new types. But the fundamental problem is there is more data than can be economically processed with traditional approaches.
The traditional approach takes the ENTIRE data set and loads/copies into a relational database or some other file structure and processes it from top to bottom sequentially. This takes a long time and the hardware is VERY expensive. Even if money was no object, and when is THAT ever the case, getting bigger and bigger hardware would still give relatively incremental increases to capacity and speed.
Hadoop on the other hand, uses a cluster of cheap (NOT inexpensive but cheap) hardware that is expected to fail and so is disposable. (The cluster technology provides full redundancy so that when (not if) a component fails, the switch to an alternate copy of the data being processed is completely transparent).
The data is broken up into as many pieces as you have servers in the cluster (the largest Hadoop clusters have tens of thousands of nodes) and is processed in parallel. It is common for processing time to drop from hours or even days to minutes.
So let’s talk about a project to deliver the first business service to take advantage of Big Data.
You’ll see in a moment why (Hadoop) is in brackets.
A business requirement/goal is established and Application Developers start designing. One of the first steps/requirements that we have heard from almost every single customer we’ve talked to is identifying the data that will be processed. Almost without exception, that data includes a whole bunch of traditional data sources like relational databases, data processed by ETL tools, data transferred from various sources and perhaps some new data types like social, web click or sensor data.
Then all the other phases of Application Development are performed. This sounds very similar to other projects and in fact, much of what I’m about to discuss applies to just about every application development project.
So this sounds easy, right? Just like the stuff you do all the time. Perhaps not.
The pressure to deliver Big Data applications is huge. Many identify Big Data initiatives as key competitive differentiators and time is not your friend.
Hadoop and Big Data are relatively new technologies so there are few experienced, seasoned practitioners. According to Gartner, in the next few years, the market will provide only 25% of the required staffing to fill Big Data positions.
These factors conspire to make Big Data/Hadoop projects particularly challenging to staff and deliver and the common pitfalls and delays that are all too common among traditional projects make Big Data projects all the more difficult for organizations to pull off successfully.
This “piece of cake” can make you very sick.
Let’s examine some of the factors that are among the more challenging problems, especially because if you go down the common path, it becomes really difficult to change later.
We’ve identified data sources. Let’s quickly look at how each one if handled within the context of our first Big Data project.
So somebody in AppDev says “I know how to fix this” and scripts a bunch of this stuff, but not all of these tools can be eliminated so you get a bunch of scripts AND a bunch of tools. Testing is done and the application is delivered to Operations to “run this stuff”
So somebody in AppDev says “I know how to fix this” and scripts all this stuff. Testing is done and the application is delivered to Operations to “run this stuff”
There are five Control-M job types:
Java MapReduce
Pig script
Hive
Sqoop
HDFS File Watcher
All five are sub-types of the Hadoop job type. Select “Hadoop” from the Job Palette. Then select the “execution” type.
CLICK
You can add program and environment parameters for this specific execution
CLICK
It’s common to have to manipulate files before and after program execution so the Pre/Post Commands enable you to perform HDFS operations via the “Pre Commands” and Post Commands” sections in the job definition. You can also choose whether the success of these Pre/Post actions will affect overall job status or not.
Once you have built your Hadoop jobs, building a flow for either just Hadoop jobs or connecting those Hadoop jobs into an enterprise business process that may include ETL, RDBMS extracts, File Transfers and any other job types/applications that Control-M supports, is a simple, drag and drop process.
And when the workflow is finished, if you need to add an SLA or add a backup at the end or start up a VM that may not always be powered up, that too is a simple drag and drop addition of a BIM or Backup or Vmware job.
The Connection Profile simplifies setup by collecting all environment info into a single object that is encrypted and managed by Control-M.
Monitoring Hadoop is just like any other Control-M application with the ability to view job output, perform operational actions like Kill and provide visibility via Self Service.
Control-M now provides huge value and great capabilities through the entire Lifecycle for Hadoop applications. Developers can build Control-M jobs with Workload Change Manager (the simple, web-based self-service job authoring component which you will hear about very shortly), submit requests to Production Control that enables that group to service the request quickly and get it into production and once in operation, provide access to all constituents in the business via Control-M Self Service.
For Big Data, there seems to be a new project almost every day that will requires integration. The challenges I’ve been discussing are destined to be encountered over and over again for these and other technologies.