SlideShare une entreprise Scribd logo
1  sur  26
Introduction to Hadoop
The Essentials
November 25, 2013

Fadi Yousuf
About Me
•
•
•
•
•

Founder and Managing Director of Axeldata Systems
13+ years involved in designing data architectures
Previous life at Sun, Cisco, Oracle, Google, F5 Networks
Working with Hadoop since 2011
Certified as Cloudera Hadoop Developer, Administrator
and HBase Specialist
• Authorized Cloudera Hadoop trainer
• Perspective - Hadoop is the foundation of scalable big
data platforms
© 2013. Axeldata Systems FZ-LLC

2
Why Hadoop?
•
•
•

RDBMS technology has served us well for
30+ years
Excellent for low-latency, real-time
transaction-oriented data processing
In the age of big data, RDBMS has many
limitations:
– Volume: shared-all architecture limits linear
scalability and requires fork-lift upgrades of
hardware infrastructure when limits are reached
– Variety: data has to fit nicely in rows and column,
with a rigid schema, suitable for structured data
but fails to handle unstructured data
– Velocity: ingesting data at speed means you can’t
afford the time to shape data into the clean
structures of relational databases

© 2013. Axeldata Systems FZ-LLC

3
A Brief History of Hadoop
1000-node
Yahoo! cluster

Google publish
MapReduce
paper
Google publish
GFS paper

Nutch rearchitecture

Nutch
created

2002

Hadoop subproject

2003

© 2013. Axeldata Systems FZ-LLC

2004

2005

2006

First
commercial
distribution
Top-level
Apache Project

2007

2008
4

Hadoop 2.0

Hive, Pig,
HBase graduate

2009

Impala, the
first real-time
query engine
Further
commercial
distributions

2010

2011

2012

2013
The Birth of Hadoop
“The name my kid gave a stuffed yellow
elephant. Short, relatively easy to spell
and pronounce, meaningless, and not
used elsewhere: those are my naming
criteria. Kids are good at generating
such.”
- Doug Cutting, Creator of Hadoop

© 2013. Axeldata Systems FZ-LLC

5
Hadoop: The Big Data Platform
It is a framework that allows for the distributed
processing of large data sets across clusters of
computers using simple programming models

© 2013. Axeldata Systems FZ-LLC

6
Core Hadoop Concepts
• Applications are written in high level code
– Developers don’t need to worry about network programming and
dependencies

• Minimal communication between the nodes
– Shared nothing architecture

• Move compute to storage, not the opposite
– Computation happens locally on each machine
– No need to move data around

• Failure is accepted and tolerated
– Data is replicated multiple times across different machines
© 2013. Axeldata Systems FZ-LLC

7
Hadoop Then…
• Storage
Batch MR

– Hadoop
Distributed File
System
(HDFS)

Resource Management

• Programming
Framework

Storage

Integration

© 2013. Axeldata Systems FZ-LLC

– MapReduce
8
Hadoop Now…
SQL

Searc
h

Math
&
Stats

InMemor
y

• Storage
…

Security

Metadata

Batch
MR

Resource Management

• Programming
Framework

Storage

Integration

– MapReduce
Source: Cloudera

© 2013. Axeldata Systems FZ-LLC

– Hadoop
Distributed File
System
(HDFS)

9
What is HDFS?
• Distributed file system
• Breaks large files into
smaller blocks that are
stored on clusters of nodes
• Master-Slave architecture
• Processes:
– NameNode (Master)
– Standby NameNode (Master)
– DataNode (Slave)
© 2013. Axeldata Systems FZ-LLC

10

Namenode
Standby
NameNode
Datanode
Datanode
Datanode

Datanode
HDFS Architecture

metadata
File1
File2

metadata
Block
1
2
3
4
5

NameNode

Location
n1r1 n1r2 n2r2
n1r1 n1r2 n4r2
n2r1 n1r3 n3r3
n4r1 n2r3 n3r3
n3r1 n3r2 n4r2

Blocks 1 2 3
Blocks 4 5

Standby NameNode
64MB

node1

1 2

1 2

3

node2

3

1

4

node3

5

5

3 4

node4

4

2 5
Rack1

© 2013. Axeldata Systems FZ-LLC

Rack2
11

DataNodes

Rack3
What is MapReduce (MRv1)?
• Programming Framework
• Breaks processing into 2
phases:
– Map phase
– Reduce phase

TaskTracker
TaskTracker

• Master-Slave architecture
• Processes:
– JobTracker (Master)
– TaskTracker (Slave)
© 2013. Axeldata Systems FZ-LLC

JobTracker

TaskTracker
TaskTracker

TaskTracker

12
MapReduce
Job

JobTracker

Task

Task

Task

node1

1 2

1 2

3

node2

3

1

4

node3

5

5

3 4

node4

4

2 5
Rack1

© 2013. Axeldata Systems FZ-LLC

Rack2
13

TaskTrackers

Rack3
MapReduce: The Mapper
• Is a function that performs the map phase
• Each mapper usually operates on a single HDFS
block
• Takes a key and value as input can generate
multiple keys and values as output
• <k1,v1>  list(<k2,v2>)
• The output of all mappers are then sorted by key
© 2013. Axeldata Systems FZ-LLC

14
MapReduce: The Reducer
• Is a function that performs the reduce phase
• Each reducer operates on a portion of the output
of all mappers
• Takes a key with a list of all values as input and
generates an aggregate of the values for each
key
• <k2,list(v2)>  list(<k3,v3>)
© 2013. Axeldata Systems FZ-LLC

15
MapReduce Data Flow
Input
HDFS

sort
Split 0

Output
HDFS

copy

Map

merge
Reduce

Part 0

Reduce

Part 1

sort
Split 1

Map
merge
sort

Split 2

© 2013. Axeldata Systems FZ-LLC

Map

16
HDFS & MapReduce Example: Word Count
Original File
I will arise and go now, and go to
Innisfree,
And a small cabin build there, of
clay and wattles made:
Nine bean-rows will I have there, a
hive for the honey-bee;
And live alone in the bee-loud
glade.
And I shall have some peace there,
for peace comes dropping slow,
Dropping from the veils of the
morning to where the cricket sings;
There midnight's all a glimmer, and
noon a purple glow,
And evening full of the linnet's
wings.
I will arise and go now, for always
night and day
I hear lake water lapping with low
sounds by the shore;
While I stand on the roadway, or on
the pavements grey,
I hear it in the deep heart's core.
© 2013. Axeldata Systems FZ-LLC

File on HDFS

Mapper

I will arise and go now, and go to
Innisfree,
And a small cabin build there, of
clay and wattles made:
Nine bean-rows will I have there, a
hive for the honey-bee;
And live alone in the bee-loud
glade.

Map

And I shall have some peace
there, for peace comes dropping
slow,
Dropping from the veils of the
morning to where the cricket sings;
There midnight's all a glimmer, and
noon a purple glow,
And evening full of the linnet's
wings.

Map

I will arise and go now, for always
night and day
I hear lake water lapping with low
sounds by the shore;
While I stand on the roadway, or on
the pavements grey,
I hear it in the deep heart's core.

Map

Reduce

Reduce

Reduce

17

Output
Demo: Word Count on Hadoop

© 2013. Axeldata Systems FZ-LLC

18
Querying Data in Hadoop
Apache Hive

Apache Pig

• Developed at Facebook
• Data warehouse infrastructure built
on top of Hadoop for providing data
summarization, query, and analysis
• Provides a mechanism to project
structure onto this data and query
the data using a SQL-like language
called HiveQL

• Developed at Yahoo!
• High-level platform for creating
MapReduce programs used with
Hadoop
• Has a language called PigLatin
• Can be extended with UDFs written
in Java, Python and other
languages

© 2013. Axeldata Systems FZ-LLC

19
Hadoop Ecosystem
• Avro: a data serialization system
• Flume: a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data.
• HBase: a scalable, distributed database that supports structured
data storage for large tables
• Mahout: a Scalable machine learning and data mining library
• Oozie: a workflow scheduler system to manage Apache Hadoop
jobs.
• Sqoop: a tool designed for efficiently transferring bulk data between
Apache Hadoop and structured datastores such as relational
databases.
• Zookeeper: a high-performance coordination service for distributed
applications
© 2013. Axeldata Systems FZ-LLC

20
Yet Another Resource Negotiator (YARN)
– Also known as: YARN (MapReduce v2)
– New framework that facilitates writing arbitrary
distributed processing frameworks and applications.
– Splits up the two major functionalities of the
JobTracker, resource management and job
scheduling/monitoring, into separate daemons.
– Can run applications that do not follow the
MapReduce model
© 2013. Axeldata Systems FZ-LLC

21
Learn Hadoop
• Download the Cloudera QuickStart VM
–
–
–
–

http://bit.ly/1b00iZj
To make it easy for you to get started with Hadoop
Cloudera Distribution including Apache Hadoop (CDH)
With Cloudera Manager, Cloudera Impala, and Cloudera Search,
this virtual machine includes everything you need

• Formal training as Developer, Administrator, Analyst and other
• Free Courseware on Udacity: Introduction to Hadoop and
MapReduce
– https://www.udacity.com/course/ud617

© 2013. Axeldata Systems FZ-LLC

22
Other Hadoop Resources
Apache Project Websites
Hadoop:
Hive:
Pig:
Sqoop:
Flume:

http://hadoop.apache.org/
http://hive.apache.org/
http://pig.apache.org/
http://sqoop.apache.org/
http://flume.apache.org/

Original GFS and MapReduce Papers
GFS:
http://bit.ly/VZk9VL
MapReduce: http://bit.ly/8VDMHO
© 2013. Axeldata Systems FZ-LLC

23
Community
A community of Hadoop professionals
and users in the region

meetup.com/Hadoop-User-Group-UAE/
© 2013. Axeldata Systems FZ-LLC

24
Q&A
© 2013. Axeldata Systems FZ-LLC

25
fadi@axeldata.com
www.axeldata.com
Hadoop and the Hadoop elephant logo
are trademarks of the Apache Software
Foundation. All other trademarks are
the property of their respective owners.

Contenu connexe

Tendances

Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
joshwills
 

Tendances (20)

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 

En vedette

Ernestas Sysojevas. Hadoop Essentials and Ecosystem
Ernestas Sysojevas. Hadoop Essentials and EcosystemErnestas Sysojevas. Hadoop Essentials and Ecosystem
Ernestas Sysojevas. Hadoop Essentials and Ecosystem
Volha Banadyseva
 
Business intelligence systems
Business intelligence systemsBusiness intelligence systems
Business intelligence systems
UMaine
 

En vedette (17)

Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
Ernestas Sysojevas. Hadoop Essentials and Ecosystem
Ernestas Sysojevas. Hadoop Essentials and EcosystemErnestas Sysojevas. Hadoop Essentials and Ecosystem
Ernestas Sysojevas. Hadoop Essentials and Ecosystem
 
Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Hadoop and Big Data Security
Hadoop and Big Data SecurityHadoop and Big Data Security
Hadoop and Big Data Security
 
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...
 
The Future of Hadoop Security - Hadoop Summit 2014
The Future of Hadoop Security - Hadoop Summit 2014The Future of Hadoop Security - Hadoop Summit 2014
The Future of Hadoop Security - Hadoop Summit 2014
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
 
10 Common Hadoop-able Problems Webinar
10 Common Hadoop-able Problems Webinar10 Common Hadoop-able Problems Webinar
10 Common Hadoop-able Problems Webinar
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
 
Data Warehouse 101
Data Warehouse 101Data Warehouse 101
Data Warehouse 101
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Big data and hadoop ecosystem essentials for managers
Big data and hadoop ecosystem essentials for managersBig data and hadoop ecosystem essentials for managers
Big data and hadoop ecosystem essentials for managers
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 
Business intelligence systems
Business intelligence systemsBusiness intelligence systems
Business intelligence systems
 

Similaire à Introduction to Hadoop - The Essentials

Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual Machines
DataWorks Summit
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Farzad Nozarian
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
saili mane
 
Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013
Jean-Pierre König
 

Similaire à Introduction to Hadoop - The Essentials (20)

Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual Machines
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Hadoop
HadoopHadoop
Hadoop
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
SpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache Hadoop
 
Hadoop operations
Hadoop operationsHadoop operations
Hadoop operations
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Introduction to Hadoop - The Essentials

  • 1. Introduction to Hadoop The Essentials November 25, 2013 Fadi Yousuf
  • 2. About Me • • • • • Founder and Managing Director of Axeldata Systems 13+ years involved in designing data architectures Previous life at Sun, Cisco, Oracle, Google, F5 Networks Working with Hadoop since 2011 Certified as Cloudera Hadoop Developer, Administrator and HBase Specialist • Authorized Cloudera Hadoop trainer • Perspective - Hadoop is the foundation of scalable big data platforms © 2013. Axeldata Systems FZ-LLC 2
  • 3. Why Hadoop? • • • RDBMS technology has served us well for 30+ years Excellent for low-latency, real-time transaction-oriented data processing In the age of big data, RDBMS has many limitations: – Volume: shared-all architecture limits linear scalability and requires fork-lift upgrades of hardware infrastructure when limits are reached – Variety: data has to fit nicely in rows and column, with a rigid schema, suitable for structured data but fails to handle unstructured data – Velocity: ingesting data at speed means you can’t afford the time to shape data into the clean structures of relational databases © 2013. Axeldata Systems FZ-LLC 3
  • 4. A Brief History of Hadoop 1000-node Yahoo! cluster Google publish MapReduce paper Google publish GFS paper Nutch rearchitecture Nutch created 2002 Hadoop subproject 2003 © 2013. Axeldata Systems FZ-LLC 2004 2005 2006 First commercial distribution Top-level Apache Project 2007 2008 4 Hadoop 2.0 Hive, Pig, HBase graduate 2009 Impala, the first real-time query engine Further commercial distributions 2010 2011 2012 2013
  • 5. The Birth of Hadoop “The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such.” - Doug Cutting, Creator of Hadoop © 2013. Axeldata Systems FZ-LLC 5
  • 6. Hadoop: The Big Data Platform It is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models © 2013. Axeldata Systems FZ-LLC 6
  • 7. Core Hadoop Concepts • Applications are written in high level code – Developers don’t need to worry about network programming and dependencies • Minimal communication between the nodes – Shared nothing architecture • Move compute to storage, not the opposite – Computation happens locally on each machine – No need to move data around • Failure is accepted and tolerated – Data is replicated multiple times across different machines © 2013. Axeldata Systems FZ-LLC 7
  • 8. Hadoop Then… • Storage Batch MR – Hadoop Distributed File System (HDFS) Resource Management • Programming Framework Storage Integration © 2013. Axeldata Systems FZ-LLC – MapReduce 8
  • 9. Hadoop Now… SQL Searc h Math & Stats InMemor y • Storage … Security Metadata Batch MR Resource Management • Programming Framework Storage Integration – MapReduce Source: Cloudera © 2013. Axeldata Systems FZ-LLC – Hadoop Distributed File System (HDFS) 9
  • 10. What is HDFS? • Distributed file system • Breaks large files into smaller blocks that are stored on clusters of nodes • Master-Slave architecture • Processes: – NameNode (Master) – Standby NameNode (Master) – DataNode (Slave) © 2013. Axeldata Systems FZ-LLC 10 Namenode Standby NameNode Datanode Datanode Datanode Datanode
  • 11. HDFS Architecture metadata File1 File2 metadata Block 1 2 3 4 5 NameNode Location n1r1 n1r2 n2r2 n1r1 n1r2 n4r2 n2r1 n1r3 n3r3 n4r1 n2r3 n3r3 n3r1 n3r2 n4r2 Blocks 1 2 3 Blocks 4 5 Standby NameNode 64MB node1 1 2 1 2 3 node2 3 1 4 node3 5 5 3 4 node4 4 2 5 Rack1 © 2013. Axeldata Systems FZ-LLC Rack2 11 DataNodes Rack3
  • 12. What is MapReduce (MRv1)? • Programming Framework • Breaks processing into 2 phases: – Map phase – Reduce phase TaskTracker TaskTracker • Master-Slave architecture • Processes: – JobTracker (Master) – TaskTracker (Slave) © 2013. Axeldata Systems FZ-LLC JobTracker TaskTracker TaskTracker TaskTracker 12
  • 13. MapReduce Job JobTracker Task Task Task node1 1 2 1 2 3 node2 3 1 4 node3 5 5 3 4 node4 4 2 5 Rack1 © 2013. Axeldata Systems FZ-LLC Rack2 13 TaskTrackers Rack3
  • 14. MapReduce: The Mapper • Is a function that performs the map phase • Each mapper usually operates on a single HDFS block • Takes a key and value as input can generate multiple keys and values as output • <k1,v1>  list(<k2,v2>) • The output of all mappers are then sorted by key © 2013. Axeldata Systems FZ-LLC 14
  • 15. MapReduce: The Reducer • Is a function that performs the reduce phase • Each reducer operates on a portion of the output of all mappers • Takes a key with a list of all values as input and generates an aggregate of the values for each key • <k2,list(v2)>  list(<k3,v3>) © 2013. Axeldata Systems FZ-LLC 15
  • 16. MapReduce Data Flow Input HDFS sort Split 0 Output HDFS copy Map merge Reduce Part 0 Reduce Part 1 sort Split 1 Map merge sort Split 2 © 2013. Axeldata Systems FZ-LLC Map 16
  • 17. HDFS & MapReduce Example: Word Count Original File I will arise and go now, and go to Innisfree, And a small cabin build there, of clay and wattles made: Nine bean-rows will I have there, a hive for the honey-bee; And live alone in the bee-loud glade. And I shall have some peace there, for peace comes dropping slow, Dropping from the veils of the morning to where the cricket sings; There midnight's all a glimmer, and noon a purple glow, And evening full of the linnet's wings. I will arise and go now, for always night and day I hear lake water lapping with low sounds by the shore; While I stand on the roadway, or on the pavements grey, I hear it in the deep heart's core. © 2013. Axeldata Systems FZ-LLC File on HDFS Mapper I will arise and go now, and go to Innisfree, And a small cabin build there, of clay and wattles made: Nine bean-rows will I have there, a hive for the honey-bee; And live alone in the bee-loud glade. Map And I shall have some peace there, for peace comes dropping slow, Dropping from the veils of the morning to where the cricket sings; There midnight's all a glimmer, and noon a purple glow, And evening full of the linnet's wings. Map I will arise and go now, for always night and day I hear lake water lapping with low sounds by the shore; While I stand on the roadway, or on the pavements grey, I hear it in the deep heart's core. Map Reduce Reduce Reduce 17 Output
  • 18. Demo: Word Count on Hadoop © 2013. Axeldata Systems FZ-LLC 18
  • 19. Querying Data in Hadoop Apache Hive Apache Pig • Developed at Facebook • Data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis • Provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL • Developed at Yahoo! • High-level platform for creating MapReduce programs used with Hadoop • Has a language called PigLatin • Can be extended with UDFs written in Java, Python and other languages © 2013. Axeldata Systems FZ-LLC 19
  • 20. Hadoop Ecosystem • Avro: a data serialization system • Flume: a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. • HBase: a scalable, distributed database that supports structured data storage for large tables • Mahout: a Scalable machine learning and data mining library • Oozie: a workflow scheduler system to manage Apache Hadoop jobs. • Sqoop: a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. • Zookeeper: a high-performance coordination service for distributed applications © 2013. Axeldata Systems FZ-LLC 20
  • 21. Yet Another Resource Negotiator (YARN) – Also known as: YARN (MapReduce v2) – New framework that facilitates writing arbitrary distributed processing frameworks and applications. – Splits up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. – Can run applications that do not follow the MapReduce model © 2013. Axeldata Systems FZ-LLC 21
  • 22. Learn Hadoop • Download the Cloudera QuickStart VM – – – – http://bit.ly/1b00iZj To make it easy for you to get started with Hadoop Cloudera Distribution including Apache Hadoop (CDH) With Cloudera Manager, Cloudera Impala, and Cloudera Search, this virtual machine includes everything you need • Formal training as Developer, Administrator, Analyst and other • Free Courseware on Udacity: Introduction to Hadoop and MapReduce – https://www.udacity.com/course/ud617 © 2013. Axeldata Systems FZ-LLC 22
  • 23. Other Hadoop Resources Apache Project Websites Hadoop: Hive: Pig: Sqoop: Flume: http://hadoop.apache.org/ http://hive.apache.org/ http://pig.apache.org/ http://sqoop.apache.org/ http://flume.apache.org/ Original GFS and MapReduce Papers GFS: http://bit.ly/VZk9VL MapReduce: http://bit.ly/8VDMHO © 2013. Axeldata Systems FZ-LLC 23
  • 24. Community A community of Hadoop professionals and users in the region meetup.com/Hadoop-User-Group-UAE/ © 2013. Axeldata Systems FZ-LLC 24
  • 25. Q&A © 2013. Axeldata Systems FZ-LLC 25
  • 26. fadi@axeldata.com www.axeldata.com Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation. All other trademarks are the property of their respective owners.

Notes de l'éditeur

  1. In a nutshell, Hadoop grew out of research at Google, which got adopted by the Open Source community, and supported by heavyweights such as Yahoo!, Facebook and others. It had 6 years to mature.
  2. No it’s not Charles Darwin Hadoop was named after the creator’s son’s toy elephant.
  3. So What is Hadoop?
  4. Brief description of the operation of HDFS. There are 3 main components (daemons) in HDFS: NameNode, DataNode, and Secondary NameNode.
  5. There are 2 main components in MapReduce (daemons): JobTracker and TaskTracker
  6. MapReduce is composed of Map tasks and Reduce tasks. Those tasks run in parallel and do not depend on each other’s output.
  7. The major resources to start learning more about Hadoop. Also recommended is reading the research papers from Google that spurred the whole Hadoop ecosystem (by SanjarGhemawat and Jeff Dean).
  8. Q&amp;A with the famous Hadoop elephant mascot