"This session will discuss a collection of guidelines and advice to help a technologist complete their first Hadoop project. Part 1 reviews tactics to ""sell"" Hadoop to stakeholders and senior management, including understanding what Hadoop is, alignment of goals, picking the right project, and level setting expectations. Part 2 entails running a successful Hadoop development project. Topics covered include training, preparation & planning activities, development & test activities, and deployment & operations activities. Also included are talking points to help with educating stakeholders.
Exploring the Future Potential of AI-Enabled Smartphone Processors
Hadoop World 2011: Practical Knowledge for Your First Hadoop Project - Mark Slusar, NAVTEQ, Boris Lublinsky, NAVTEQ, & Mike Segel, Segel & Associates
1. Boris Lublinsky / NAVTEQ
Mark Slusar / NAVTEQ
Mike Segel / Segel & Assoc.
2. Boris Lublinsky
• 25+ yrs experience as an enterprise architect with a focus on
end to end solutions, distributed systems, SOA, BPM, etc
• InfoQ SOA editor, OASIS member, writer, speaker
• FermiLab, SSA, Platinum, CNA , Navteq, et al
Mike Segel
• 20+ yrs experience in the IT Industry with a focus high
powered computing, information management, and
philosophy
• Founder of Chicago Hadoop User Group (an excuse to drink
beer and eat pizza )
• Clients include NAVTEQ, Orc, IBM, Informix, Montgomery
Ward, CCC, and others…
Mark Slusar
• 15 yrs experience with a background of design, technology,
and leadership
• Sponsor of Chicago Hadoop User Group
• Federal Reserve, NEC, United Airlines, NAVTEQ, et al
3. This presentation is based on our 2+ years of Hadoop
Projects onboarding experiences
Part 1: Tactics to „sell‟ Hadoop to Stakeholders and Senior
Management
• Understanding what is Hadoop
• Alignment of goals
• Picking project
• Level Setting Expectations
Part 2; Running a Successful Development Project
• Training
• Preparation & Planning Activities
• Development & Test Activities
• Deployment & Operations Activities
4.
5. • Define the problem:
• Understanding company‟s pain(s)
• Finding the right problem to solve
• Low Hanging Fruit
• High Value
• High Visibility
• Don‟t bet the Farm.
• Create a problem statement
Sell the Solution(s) and not a Technology
• Selling is an educational process
• Understand that Hadoop is a tool, not a panacea „cure
all‟
6. Hadoop Not Hadoop
Large data storage Real-time data processing
Bringing execution to the
(difficult)
data Data Set is not large enough
Structured and Processing
algorithm not
unstructured data compatible w M/R
Massively Parallel Existing processes are well
processing suited to solve the problem.
Extensible ecosystem ACID Requirements
(Transaction Based)
One person doing a million
things vs. one million
people doing one thing.
7. Set realistic goals
Set boundaries
Avoid scope creep
Embrace what you don‟t know:
• Honest evaluation of you and your team‟s skills
• Hadoop is a paradigm shift therefore you need to alter your
approach to solving the problem.
•Level Set Expectations
• Technology is new to the organization
• There is a learning curve
• TANSTAAFL (There ain't no such thing as a free lunch.)
• Think for yourself: take Hadoop urban legends with a grain
salt
8. The sales process takes time.
Selling is an educational process
For you:
• Learn the Stakeholders Pain
• Determine the Scope of the problem
• Formulate your own estimates
For your Stakeholders:
• Must „buy in‟ to your solution.
• Appreciate the underlying technology
• Understand the risks
Don‟t oversell and underestimate
9. Reaffirm the stated pains and any identified latent
pain(s).
Give your audience time to digest the presented
information.
Show how the solution solves their problem
Avoid „The Bottom Line‟
Understand common objections and overcome
them.
• “…We can do this in a RDBMS …”
• “…This sounds risky…”
• “… Who else is doing this? … “
• “… Who‟s using it in production? …”
• “… Sounds expensive … “
Talking points included at the end of the slide presentation
10.
11. Executive Sponsorship – Identify the key players and understand
their „pains‟.
Project is Sufficiently Funded
Project Charter – The project is well defined with set goals and
expectations.
Level Set Expectations: The technology is new to your company,
and it should be expected that you will face setbacks during the
project. (Lower the expectations to a point where you know you
can exceed them.)
Outside Expertise. (Buy/Build/Blended Model)
12. Resources have been identified and have been dedicated to this
project.
Business Analysts Support – have a good understanding of data
and access patterns is essential.
Architecture – Hadoop is a paradigm shift. It is essential to reflect
it in a solution architecture. Integration with existing enterprise
applications can provide additional challenges
Developers – Candidates (Java/Unix Proficiency with a myriad of
data-driven projects under their belt). Ability and desire to learn
new tricks.
Infrastructure Support – have Hadoop administrators who are
experienced and/or capable of learning.
Training – Not just APIs, but also Hadoop concepts and patterns.
13. Hadoop is an unregistered TM of Apache
There exist several companies that provide commercial
support for Hadoop and Hadoop derivatives.
• Cloudera
• MapRTech
• HortonWorks
• Others (HStreaming, DataStax, …)
And there is also Amazon…
14. Application - Walk through the business process and create an simple
plain English outline of what you want to achieve in each step.
Hardware - Determine your initial data set(s) and design out your
cluster accordingly.
Design & Development are iterative processes.
Your first iteration is rarely your last iteration.
Don‟t be embarrassed by your code. Share it with others for feedback
and improvement.
KISS, KISS, KISS
Data storage - Which to use: HDFS or Hbase?
15. HDFS HBase
Use HDFS when you are always going to Use HBase when you want random
access your data as an entire set or a access to your data set. Access
very large subset. individual records, partial records,
HDFS access is sequential read only. and subsets of records. HBase
provides more control over
HDFS supports only create and append
partitioning data.
HDFS is mainly used in Map/Reduce.
Direct access from the client is HBase supports get, put, update and
possible, but typically requires scan of sequential keys
indexing. It provides language HBase can be accessed from either a
(Java) APIs only Map/Reduce program, or directly from
a client. It supports Java, REST and
When using HDFS you always want large
Thrift APIs
(GB) size files.
HBase provides build in versioning
Packaging smaller sized files into larger
and purging of data.
ones requires development efforts.
Many new enhancements are coming.
Coprocessors is the most significant one
16. Automate your Environment Setup
Use Puppet, Chef, Cloudera Enterprise Manager, etc…
Rely on Hadoop Ecosystem whenever possible.
Configuration
• See Mike Guenther‟s Lecture (CHUG Archive)
• Use Cloudera Docs
• Configuration is a continuous process
• Tune both the cluster and application independently.
• Don‟t optimize your cluster for your application, optimize your
application for your cluster.
Plan your Development Iterations
• Data storage Model
• ETL (loading data in/out of Hadoop)
• Automate Environment Setup
• Processing
• Integration (interacting with other enterprise applications)
• Reporting interface & diagnostics to show speed and utilization
17. Understand Map Reduce model and patterns – read Jimmy Lin
and Chris Dyer book Data-Intensive Text Processing with
MapReduce.
See if you really need reducers (they are expensive) and if you
do, try to use combiners
Use custom InputFormat if you need better control execution of
Maps
Programmatic writes to predetermine files might lead to
unpredictable results.
Use Oozie for orchestrating multiple Map Reduce jobs.
Use Oozie for automatically starting your jobs when data arrives
Don‟t be afraid to ask for help.
18. Be prepared to re-factor your code many times. You often start
wrong, but your goal is to end right.
Tom White‟s Hadoop Book
Lars George‟s Hbase Book
In addition to MapReduce, Investigate additional Hadoop
technologies (Pig, Hive, Flume, et al)
Be prescriptive, use only the technology you really need
Don‟t forget about the community, they will be extremely
helpful. See (http://www.meetup.com/Chicago-area-Hadoop-
User-Group-CHUG/ ) [Shameless plug. ]
19. Unit Test the Application and the Interface
Test Hadoop – report issues to Cloudera.
Opening Support Tickets* – life saver for new teams. (Cloudera
offers support contracts )
Optimize your application, not the cluster
End to End Testing – it matters, it ensures confidence
Performance testing – its one of the drivers of the project.
Make Sure you test on realistic data volumes – results can be
deceiving on smaller data sets.
Showcase the ability of the cluster compared to existing
systems
Consulting – look over your application, but do not outsource
implementation to consultants. Make sure you build internal
knowledge
*Assumes that you have a corporate license…
20. SLAs – Not advisable for Hadoop Project #1
Involve Deployment & Operations personnel from the get-go;
they will be supporting it
Operations Team :
• Hadoop Administration Training
• Operations Team – Data Analysts & Users trained and
involved with process as stakeholders
Data Maintenance – The role of the DBA begins to
change, existing DBAs should have interest in Hadoop
Playbooks – should help address many Hadoop related issues
without involving developers & architects
UATs – use as needed and depending on methodology
21. What worked well in the first project?
What did not work?
Ready to process Mission Critical Data?
Begin to establish SLAs?
Consider real-time data delivery?
Ready to support enterprise data?
22. http:// hadoop.apache.org/ (Apache Hadoop)
http:// www.cloudera.com/ (Cloudera)
http://www.meetup.com/Chicago-area-Hadoop-User-Group-
CHUG/ (CHUG)
Or find Mike, Boris, or Mark on Linkedin
24. • Scalability – A large data problem can broken into many pieces
processed in parallel by 10, 100, or 1000 machines; all while
working for a common goal. Adding more machines improves
scalability.
• Incredible Performance – Hadoop holds the performance record
for data processing (terabyte sort in 209 seconds – yahoo)
• Data integrity – Data is stored multiple times across nodes.
• Separation of concerns – developers need to write only business
code – mappers and reducers. All infrastructure “heavy lifting” &
job management is done by the framework.
25. • Yahoo – Content Optimization, Sorting, Ad Placement
• Facebook – Largest Hadoop Cluster, Terabytes of insights
processed per DAY. Social email.
• LinkedIn – Computationally Intensive operations for Enterprise
Data: “People You May Know”, “Viewers of this Profile Also
Viewed”, “Job Recommendations”
• Groupon – Analytics and Data mining on “Extreme Data”
• Nokia- See http://www.cloudera.com/videos/apache-hadoop-
nokia-josh-devins
• For more companies see:
http://wiki.apache.org/hadoop/PoweredBy
26. • Massive data storage – ability to correlate seemingly disparate
data. Ability to store lots of historical data.
• Computational Power – Ability to run reports and ask questions
that could previously not be asked – asking “golden questions”
• Throughput – time to complete jobs allows even more “golden
questions”
• “Golden questions” – change the game, drive profits, and
positively disrupt businesses
27. • Commodity Resources - Nodes cost as much as a workstation.
No specialized hardware.
Expenditures - No software purchases, no negotiations with
vendors, no licensing headaches – free downloads. (For initial
PoC installation.)
• Easily proved - Proof of Concept can be executed in a
virtualized environment or at a public cloud.
Notes de l'éditeur
This is our obligatory slide that tells you who we are and that some of us are really old and have been doing this for far too long.Everyone does their own, so that the audience know who we areMaybe have everyone introduce themselves, but I really don’t want to pimp myself. [Mikey]
[Mikey] I want to preface this slide by stating that the ideal audience for this presentation is for someone who’s just starting to investigate Hadoop and wants to introduce it to their organization. If you’ve already started implementing a project, please pay attention to part 2 where we discuss ways to increase your project’s chance for success.Also any feedback on your Hadoop selling experience will be valuable for authors
[Mikey]Step one: Setting your goals.The first thing one needs to do is to identify what problems you want to solve. Create a ‘short list’ of the problems, and determine which problem is the best candidate.Look for a problem that can be solved in a m/r environment. Look for a problem where you’re not ‘betting the farm’, one where if you fail to deliver a solution on time and on budget, you’re not going to condemn Hadoop as an option for future projects.Create a problem statement which in plain English identifies the problem you are attempting to solve and some ‘boxing’ constraints which limit the scope of the problem.Once you have identified the problem you want to solve, you need to sell the problem and solution to your stakeholders. In selling the solution you want to focus on the solution itself and not the underlying technology. In this case,we are talking about Hadoop. Sure Hadoop is sexy and everyone wants to learn it… to pad their resume. But your stake holders don’t care about the technology, just that you have a potential solution which solves their problem and is cost effective. While we are here because we like Hadoop; and use Hadoop; remember that Hadoop is just a tool, its not a ‘cure all’ and perfect for every problem.If you’re at the stage that you know you want to use Hadoop, but you don’t know what sort of problems you need to solve, it time to identify the potential stakeholders, those who
[Mikey] This leads us to our next point. When do we want to use Hadoop... What sort of problems do we think will be a good fit for Hadoop…, and what problems do we think would be better solved using a different tool… These are all questions that we have to think about before settling on a tool.[Boris will walk through slide]
[Mikey]Part of the selling process is to first realize what you want to sell to ‘management’. You first have to set your goals and know what you want to gain from the project. (Besides learning how to work with a really cool tool and pad your resume… ) If you do not yet have a problem to solve, you may want to do some research and talk with your stakeholders.So… we set realistic goals… like processing X records per hour or Y incoming files... Some metric that you know you can beat and should really be obtainable. Once you have the project, the goals, you need to set boundaries. Like processing a specific stream of data only. Or only handling csv files and not XML input files. Once you’ve set your boundaries, if at all possible, you want to avoid scope creep noting that you can always add to the project after you get it working. Lock down the requirements at the start of the project.[Talk through points on slide…]
[Mikey]There is a psychology to the selling process.At a high level.Even if you’ve done your homework and know the answer before presenting the solution, if you provide the answer too quickly, your stakeholders will suspect you and your solution.You have to listen to your stake holders, learn their ‘pain’ and determine the scope of the problem and what are the constraints to the problem. (Proposing a million dollar solution when there’s only 100,000 in the budget doesn’t help.) By listening to the stakeholders, you are showing them that you are crafting your solution to meet their needs and that when you present your solution you can address and re-affirm their pain.Once you have a rough idea, get estimates rather than relying on a SWAG.Its ok to say you don’t know something, make an action item and take the time to get the right answer.For the stakeholders they need to buy in to your solution, and to take ownership of the solution. They need to appreciate the underlying technology, its challenges and risks.
[Mikey]When presenting the solution to the stakeholders, you need to have your ducks in a row. You need to re-affirm their stated pains, along with any latent pains you find while talking with the stakeholder’s team(s). This not only shows that you are presenting a solution but that your solution addresses their needs and starts the process of taking ownership for the solution.During the presentation you want to avoid ‘cutting to the chase’ or going straight to the bottom line of saying that it costs X dollars. By going straight to the bottom line, you don’t give the stakeholders and project leads time to digest the solution and to take ownership of the problem.Stakeholders typically don’t care about the underlying technology. They are more interested in finding a cost effective solution that solves their problem and can be modified as the business or environment changes. This is not to say that explaining the technology in the solution isn’t important, but your ‘sales’ success is going to be based on how well you meet their criteria for success. Does the solution meet their needs? What’s the time to value? Relative to other possible solutions, is this cost effective?There are some objections that you can’t overcome. In these situations, Hadoop isn’t a good fit and you should move on to the next potential problem to solve.
Mark
Boris + Mark
Hadoop is an unregistered Trademark of Apache and is meant to refer to Apache’s release only. Any release which is not the official release from Apache would be a derivative work. Cloudera offers a derivative work which is free and also has commercial support.MapRTech has a derivative work that replaces the underlying HDFS with their own proprietary use of C++ and writing directly to the raw disk.Mikey, Boris Amazon
The KISS principal has been around for ages. Regardless of your design methodology, you want to start off simple and build out. This allows one to learn the technology and work through the design challenges. When working through the software design, start by creating a simple English description of how you want to process the data and what you want to achieve in each step. This is useful when you need to go back to the SME/Business Analyst who may not be familiar with UML or a class diagram. (Its also a document that you can use to verify the other diagrams.)Boris, Mike - hardware
Boris
[mike] I am not sure what you want to say with this slide please add speaker notes![mark] One of your first tasks is setting your environment up, whether you go virtual or physical for your first project you will need to refer to documentation, do not stay too far from default configuration until you are comfortable or advised to do so. Use a tool like puppet or chef for configuration management. Additional configuration tips can be found in Mike Guenthers presentation and at docs.cloudera.comAs you develop features; you will need to address your data model. You will also be writing code to ingest data, process it, and display it. Keep in mind that these features will be part of your report on how you succeeded with hadoop.This is a multi level slide:1. You need reproducible environment. You can’t afford to rely on the manual tweaking every time you have to re install2.Without proper configuration your application will not work, Configuration is a two level process. Optimize your cluster to run well any job and the optimize your job for the cluster that you are using. Give an example of separating Hbase configuration from your table configuration3 Describe design stepsMark?