Hadoop intro

INTRODUCTION TO
HADOOP
Explaining a complex product in 20 minutes or less…

INTRODUCTION
Keith R. Davis
Data Architect – NEMSIS Project
University of Utah, School of Medicine
keith.davis@hsc.utah.edu

WHAT IS HADOOP?
Hadoop is an open source Apache software project that enables
the distributed processing of large data sets across clusters of
commodity servers.

A QUICK BIT OF HISTORY…
• (2004) Google publishes the GFS and MapReduce papers
• (2005) Apache Nutch search project rewritten to use MapReduce
• (2006) Hadoop was factored out of the Apache Nutch project
• (2006) Development was sponsored by Yahoo
• (2008) Becomes a top-level Apache project
• (Trivia) Why is it called Hadoop?
• It was named after the principal architect’s son's toy elephant!

WHO IS USING HADOOP?
And more…

HOW IS HADOOP DIFFERENT FROM A
TRADITIONAL RDBMS?
• Data is not stored in tables
• Haoop supports only forward parsing
• Hadoop doesn’t guarantee ACID properties
• Hadoop takes code to the data
• Scales horizontally vs. vertically

WHAT’S THE BIG DEAL?
Hadoop is:
• Easily Scalable– New cluster nodes can be added as needed
• Cost effective– Hadoop brings massively parallel computing to commodity servers
• Flexible– Hadoop is schema-less, and can absorb any type of data
• Fault tolerant– Share nothing architecture prevents data loss and process failure

WHEN SHOULD I USE HADOOP?
Use Hadoop when you need to:
• Process a terabytes of unstructured data
• Running batch jobs is acceptable
• You have access to a lot of cheap hardware
DO NOT use Hadoop when you need to:
• Perform calculations with little or no data (Pi to one million places)
• Process data in a transactional manner
• Have interactive ad-hoc results (this is changing)

BASIC ARCHITECTURE
Hadoop consists of two primary services:
1. Reliable storage though HDFS (Hadoop Distributed File System)
2. Parallel data processing using a technique known as MapReduce

HOW IT WORKS: HDFS WRITE STEP #1
(FILE SPLITS)
Input Data
(CSV)
Block #2
Block #1
Block #3

HOW IT WORKS: HDFS WRITE STEP #2
(REPLICATION)
Block
#1
Block
#2
Block
#1
Block
#3
Block
#3
Block
#2
Node #1 Node #2
Node #3

HOW IT WORKS: MAP/REDUCE
Client
Job
Scheduler
Data
Node
Data
Node
Data
Node
Data
Node
...
...
HDFSFileSystem(input)
HDFSFileSystem(output)
Mapper
Mapper
Mapper
Reducer
Reducer
Mapper
Mapper
Mapper
Reducer
Reducer

LOOKS COMPLICATED!
Not to worry, there are many ways to access the power of MapReduce:
• Hadoop Java API (If you like Java and low level stuff)
• Pig (If you are a script wiz and LINQ doesn’t scare you)
• Hive (You know some SQL and coding isn’t your thing)
• RHadoop (If R is your thing)
• SAS/ACCESS (If SAS is your thing)

HIVE: THE EASY WAY TO GET DATA OUT
• Supports the concepts of databases, tables, and partitions through the use of
metadata (think of views over delimited text files)
• Supports a restricted version of SQL (no updates or deletes)
• Supports joins between tables - INNER, OUTER (FULL, LEFT, and RIGHT)
• Supports UNION to combine multiple SELECT STATEMENTS
• Provides a rich set of data types and predefined functions
• Allows the user to create custom scalar and aggregate functions
• Executes queries via MapReduce
• Provides JDBC and ODBC drivers for integration with other applications
• Hive is NOT a replacement for a traditional RDBMS as it is not ACID compliant

HIVE: MATH AND STATS FUNCTIONS
If you use HIVE to create sample sets for your analysis, here are a few standard
functions you may find useful:
round(), floor(), ceil(), rand(), exp(), ln(), log10(), log2(), log(), pow(), sqrt(), bin()
, hex(), unhex(), conv(), abs(), pmod(), sin(), asin(), cos(), acos(), tan(), atan(),
degrees(), radians(), positive(), negative(), sign(), e(), pi(), count(), sum(), avg(),
min(), max(), variance(), var_samp(), stddev_pop(), stddev_samp(), covar_pop
(), covar_samp(), corr(), percentile(), percentile_approx(), histogram_numeric(),
collect_set()

RESOURCES
• Cloudera (Easy Setup) - http://www.cloudera.com/content/cloudera/en/home.html
• NoSQL - http://nosql-database.org/
• Emulab - http://www.emulab.net/
• Apache Hadoop - http://hadoop.apache.org/#Getting+Started
• RHadoop - https://github.com/RevolutionAnalytics/RHadoop/wiki
• SAS/ACCESS - http://www.sas.com/software/data-management/access/index.html

Hadoop intro

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (13)

Similaire à Hadoop intro

Similaire à Hadoop intro (20)

Dernier

Dernier (20)

Hadoop intro

Notes de l'éditeur