SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
INTRODUCTION TO HADOOP 
Brest – 29 octobre 2014 
David Morin - @davAtBzh
Me 
David Morin 
@davAtBzh 
Solutions Engineer at
3 
What is Hadoop ?
4 
An elephant – This one ?
5 
No, this one !
6 
The father
7 
Let's go !
8 
Let's go !
9 
Timeline
10 
Hadoop fundamentals 
● Distributed FileSystem for high volume of data 
● Use of common servers (limit costs) 
● Scalable / fault tolerance
11 
Hadoop Distributed FileSystem
12 
Hadoop Distributed FileSystem
13 
Mapreduce
14 
Mapreduce : word count 
Map Reduce
15 
Data Locality Optimization
16 
Mapreduce in action
17 
Hadoop v1 : drawbacks 
– One Namenode : SPOF 
– One Jobtracker : SPOF and un-scalable (nodes 
limitation) 
– MapReduce only : open this platform to non MR 
applications
18 
Hadoop v2 
Improvements : 
– HDFS v2 : Secondary namenode 
– YARN (Yet Another Resource Negociator) 
● JobTracker => Resource Manager + Applications 
Master (more than one) 
● Can be used by non MapReduce applications 
– MapReduce v2 : uses Yarn
19 
Hadoop v2
20 
YARN
21 
YARN
22 
YARN
23 
YARN
24 
YARN
25 
YARN
26
27 
Pig 
● With Pig write MR Jobs becomes easy 
● Dataflow model : data is the key ! 
● Langage : PigLatin 
● No limit : Used Defined Functions 
http://pig.apache.org/docs/r0.13.0/
28 
● Pig-Wordcount 
Pig 
lines = LOAD '/user/XXX/file.txt' AS (line:chararray); 
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word; 
grouped = GROUP words BY word; 
wordcount = FOREACH grouped GENERATE group, COUNT(words); 
DUMP wordcount;
29 
Import … 
public class WordCount2 { 
Pig 
public static class TokenizerMapper 
extends Mapper<Object, Text, Text, IntWritable>{ 
static enum CountersEnum { INPUT_WORDS } 
private final static IntWritable one = new IntWritable(1); 
private Text word = new Text(); 
private boolean caseSensitive; 
private Set<String> patternsToSkip = new HashSet<String>(); 
private Configuration conf; 
private BufferedReader fis; 
... 
=> 130 lines of code !
30 
● SQL like : HQL 
● UDFs 
● Hive-Wordcount 
Hive 
CREATE TABLE docs (line STRING); 
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs; 
CREATE TABLE word_counts AS 
SELECT word, count(1) AS count FROM 
(SELECT explode(split(line, 's')) AS word FROM docs) w 
GROUP BY word 
ORDER BY word;
31 
Zookeeper 
● Distributed coordination service 
● Dynamic configuration 
● Distributed locking
32 
Batch but not only..
33 
??

Contenu connexe

Tendances

Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
yaevents
 
Dremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasetsDremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasets
Hung-yu Lin
 
Case study ap log collector
Case study ap log collectorCase study ap log collector
Case study ap log collector
Jyun-Yao Huang
 
In-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesIn-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several times
Aleksander Alekseev
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
elliando dias
 

Tendances (20)

Clique square storage
Clique square storageClique square storage
Clique square storage
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
 
Getting started with PostGIS geographic database
Getting started with PostGIS geographic databaseGetting started with PostGIS geographic database
Getting started with PostGIS geographic database
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
A Hands-on Introduction to MapReduce (in Python)
A Hands-on Introduction to MapReduce (in Python)A Hands-on Introduction to MapReduce (in Python)
A Hands-on Introduction to MapReduce (in Python)
 
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
 
Page compression. PGCON_2016
Page compression. PGCON_2016Page compression. PGCON_2016
Page compression. PGCON_2016
 
Dremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasetsDremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasets
 
Big data solution capacity planning
Big data solution capacity planningBig data solution capacity planning
Big data solution capacity planning
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Case study ap log collector
Case study ap log collectorCase study ap log collector
Case study ap log collector
 
Redis深入浅出
Redis深入浅出Redis深入浅出
Redis深入浅出
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
 
In-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesIn-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several times
 
PgconfSV compression
PgconfSV compressionPgconfSV compression
PgconfSV compression
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
 
Big data
Big dataBig data
Big data
 
How to measure your dataflow using fio, pktgen and bandwidthTest
How to measure your dataflow using fio, pktgen and bandwidthTestHow to measure your dataflow using fio, pktgen and bandwidthTest
How to measure your dataflow using fio, pktgen and bandwidthTest
 
Using MongoDB and Python
Using MongoDB and PythonUsing MongoDB and Python
Using MongoDB and Python
 

Similaire à Introduction to Hadoop - FinistJug

Mr hadoop seedrocket
Mr hadoop seedrocketMr hadoop seedrocket
Mr hadoop seedrocket
SeedRocket
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
Wes Floyd
 

Similaire à Introduction to Hadoop - FinistJug (20)

Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjug
 
Mr hadoop seedrocket
Mr hadoop seedrocketMr hadoop seedrocket
Mr hadoop seedrocket
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Big Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce ParadigmsBig Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce Paradigms
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce Compatibility
 
Training
TrainingTraining
Training
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Introduction to Hadoop - FinistJug

  • 1. INTRODUCTION TO HADOOP Brest – 29 octobre 2014 David Morin - @davAtBzh
  • 2. Me David Morin @davAtBzh Solutions Engineer at
  • 3. 3 What is Hadoop ?
  • 4. 4 An elephant – This one ?
  • 5. 5 No, this one !
  • 10. 10 Hadoop fundamentals ● Distributed FileSystem for high volume of data ● Use of common servers (limit costs) ● Scalable / fault tolerance
  • 11. 11 Hadoop Distributed FileSystem
  • 12. 12 Hadoop Distributed FileSystem
  • 14. 14 Mapreduce : word count Map Reduce
  • 15. 15 Data Locality Optimization
  • 16. 16 Mapreduce in action
  • 17. 17 Hadoop v1 : drawbacks – One Namenode : SPOF – One Jobtracker : SPOF and un-scalable (nodes limitation) – MapReduce only : open this platform to non MR applications
  • 18. 18 Hadoop v2 Improvements : – HDFS v2 : Secondary namenode – YARN (Yet Another Resource Negociator) ● JobTracker => Resource Manager + Applications Master (more than one) ● Can be used by non MapReduce applications – MapReduce v2 : uses Yarn
  • 26. 26
  • 27. 27 Pig ● With Pig write MR Jobs becomes easy ● Dataflow model : data is the key ! ● Langage : PigLatin ● No limit : Used Defined Functions http://pig.apache.org/docs/r0.13.0/
  • 28. 28 ● Pig-Wordcount Pig lines = LOAD '/user/XXX/file.txt' AS (line:chararray); words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word; grouped = GROUP words BY word; wordcount = FOREACH grouped GENERATE group, COUNT(words); DUMP wordcount;
  • 29. 29 Import … public class WordCount2 { Pig public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ static enum CountersEnum { INPUT_WORDS } private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private boolean caseSensitive; private Set<String> patternsToSkip = new HashSet<String>(); private Configuration conf; private BufferedReader fis; ... => 130 lines of code !
  • 30. 30 ● SQL like : HQL ● UDFs ● Hive-Wordcount Hive CREATE TABLE docs (line STRING); LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, 's')) AS word FROM docs) w GROUP BY word ORDER BY word;
  • 31. 31 Zookeeper ● Distributed coordination service ● Dynamic configuration ● Distributed locking
  • 32. 32 Batch but not only..
  • 33. 33 ??