SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
✕ 
CloverETL versus Hadoop 
in light of transforming very large data sets 
in parallel 
a deathmatch or happy together ?
= 
similarities 
• Both technologies use data parallelism - input data are split into 
“partitions” which are then processed in parallel. 
• Each partition is processed the same way (same algorithm used). 
• At the end of the processing, results of individually processed partitions 
need to be merged to process final result. 
Part 1 
Part 2 
Part 3 
Final result 
data 
split 
data 
merge 
data 
process
✕ 
differences 
• Hadoop uses Map->Reduce pattern originally developed by Google for 
Web indexing and searching. Processing is divided into Map phase 
(filtering&sorting) and Reduction phase (summary operation). 
Hadoop approach expects that initial large volume of data is reduced to 
much smaller result -> e.g. search for pages with certain keyword. 
• CloverETL is based on pipeline-parallelism pattern where individual 
specialized components perform various operations on flow of data 
records - parsing, filtering, joining, aggregating,de-duping... 
Clover is optimized for large volumes of data flowing through it and 
being transformed on-the-fly.
= 
similarities 
Both technologies use partitioned&distributed storage of data (filesystem). 
• Hadoop uses HDFS (Hadoop Distributed Filesystem) with individual 
DataNodes residing on physical nodes of Hadoop/HDFS cluster. 
• CloverETL uses Partitioned Sandbox where data are spread over 
physical nodes of CloverETL Cluster. Each node is also a data processing 
node typically processing locally stored data (not exclusively). One node 
can be part of more than one Partitioned Sandbox.
✕ 
differences 
HDFS operates at byte level (data are read&written as streams of 
bytes). It includes data loss prevention through data redundancy. 
HDFS is based on “write-once, read-many-times” pattern. 
CloverETL’s Partitioned Sandbox operates at record level (data are 
read&written as complete records). Data loss prevention is left to be 
handled by the underlying file system storage. Clover’s Partitioned 
sandbox supports very high I/O throughput needed for massive data 
transformations.
CloverETL ✕ Hadoop HDFS 
HDFS stores, splits and distributes data at byte level 
4 5 6 , N Y , J OH N n 
split 
split 
CloverETL stores, splits and distributes data at record level 
456,NY,JOHNn 457,VA,BILLn 458,MA,SUEn 
split split
Hadoop HDFS 
organises files into large blocks of bytes (64MB or more) 
which are then physically stored on different nodes of 
Hadoop cluster 
HDFS data file 
Node 1 Node 2 
split record 
Block1 Block2 
data records{ 
data blocks of 64MB 
•block 1 
•block 3 
•block 5 
•block 7 
•... 
•block 2 
•block 4 
•block 6 
•block 8 
•...
Hadoop HDFS 
partitions, distributes and stores data at byte level 
4 5 6 , N Y , J OH N n 
split 
1st part stored 
2nd part stored 
Node 1 Node 2 
☛ One data record in source data can end up being split between two different nodes 
☛ Writing or reading such record requires accessing two different nodes via network 
☛ HDFS presents files as single continuous stream of data (similar to any local filesystem)
Block1 
Block2 
Hadoop HDFS 
☛ Parallel writing to one HDFS file is impossible 
Two processes can not write to one data block at the same time. 
Two processes trying to write in parallel to one HDFS file (two different blocks) 
will face the block boundary issue - with potential collision. 
not enough space 
already filled space by 2nd process 
Node 1 
Node 2 
n-th record 
1st process 
2nd process 
where to write ? 
data blocks of 64MB output file 
block 1 
block 2 
➟ file grows (blocks added) 
executed 
on Node1 
executed 
on Node2 
writes to 
Node1 & 2 
writes to 
Node1 & 2 
starts writing to Block2
CloverETL Partitioned Sandbox 
partitions, distributes and stores data at record level 
456,NY,JOHNn 457,VA,BILLn 458,MA,SUEn 
Node 2 
gets stored 
gets stored 
split split 
Node 1 Node 2 
☛ Nodes contain complete records. 
☛ Writing or reading records means accessing locally stored data only 
☛ Partitioned data are located in multiple files located on individual nodes. Clover offers 
unified user view over those files. When processing, partition files are accessed individually.
CloverETL Partitioned Sandbox 
Node 1 
Node 2 
☛ Parallel writing to Partitioned Sandbox is easy 
Two processes write to two independent partitions of Clover sandbox. 
Each process writes to partition which is local to node where it runs - no 
collisions. 
1st process 
2nd process 
Partition 1 
456,NY,JOHNn 458,VA,WILLIAMn 460,MA,MAGn ➟ 
Partition 2 
457,NJ,ANNn 459,IL,MEGANn 461,WA,RYANn ➟ 
executed 
on Node1 
executed 
on Node2 
writes to 
Node1 only 
writes to 
Node2 only
Fault resiliency 
☛ HDFS implements fault tolerance 
HDFS replicates individual data blocks across cluster nodes thus ensuring fault 
tolerance. 
☛ Clover delegates fault resiliency to local file system 
Clover provides unified view on data stored locally on nodes. It’s nodes’ setup (OS, 
filesystem) responsible for fault resiliency.
public 
16 
17 public 
18 
19 
20 
21 
InterruptedException { 
22 String line = value.toString(); 
23 StringTokenizer tokenizer = 
24 
25 word.set(tokenizer.nextToken()); 
26 context.write(word, one); 
27 } 
28 } 
29 } 
30 
31 public 
32 
33 
34 
35 
36 
37 sum += val.get(); 
38 } 
39 context.write(key, 
40 } 
41 } 
42 
43 public 
44 Configuration conf = 
45 
46 Job job = 
47 
48 job.setOutputKeyClass(Text.class); 
49 job.setOutputValueClass(IntWritable.class); 
50 
51 job.setMapperClass(Map.class); 
52 job.setReducerClass(Reduce.class); 
53 
How Hadoop processes data 
process 4 
reduce() 
Merge data 
(partially) 
Sort temp data 
Block1 
Block2 
Block3 
process 1 
map() 
map() 
map() 
process 5 
reduce() 
process 2 
process 3 
Map data 
to key->value pairs 
output.part1 
output.part2 
Input data file 
• Hadoop concentrates transformation logic into 2 stages - map & reduce. 
• Complex logic must be split to multiple map & reduce phases with temporary data being stored 
in between 
• Intense network communication happens when reducers (one or more) merge data from multiple 
mappers (mappers and reducers may run on different nodes) 
• If multiple reducers are used (to accelerate processing) the resulting data are located in multiple 
output files (need to be merged again to produce single final result)
How CloverETL processes data 
Input data file 
Partition 1 
456,NY,JOHNn 458,VA,WILLIAMn ➟ 
Partition 2 
457,NJ,ANNn 459,IL,MEGANn ➟ 
output.full 
Transformation logic 
with pipeline-parallelism 
Transformation logic 
with pipeline-parallelism 
• Clover processes data via set of transformation components running in pipeline-parallelism mode 
• Even complex transformation can be performed without temporarily storing data 
• Individual processing nodes obey data locality - each cluster node processes only locally stored 
data partition 
• Clover allows partitioned output data be automatically presented as one singe result 
Wikipedia > Pipeline parallelism - When multiple components run on same data set i.e. when a record is processed in one component and a previous 
record is being processed in another components.
✕ 
differences 
☛ HDFS optimizes for 
storage 
HDFS optimizes for storing vast amount of 
data across hundreds of cluster nodes. It 
follows the ““write-once, read-many-times” 
pattern. 
☛ Clover optimizes for I/O 
throughput 
Clover optimizes for very fast writing or 
reading of data in parallel on dozens of 
cluster nodes. This lends itself nicely to 
read&process&write (aka ETL)
Which approach is better ? 
it depends.. 
better for typical data transformation/integration 
tasks where all/most input data records get transformed 
and written out. 
Clover Partitioned sandbox expects short-term storage of data. 
better when storing vast amount of data which 
are written by single process and potentially read by 
several processes. 
HDFS expects long-term storage of data.
? 
which one 
Wouldn’t it be nice to have the best from both worlds ? 
It’s possible ! 
• Clover is able to read&write data from HDFS 
• Clover can read and process HDFS stored data in parallel 
• Clover can write the results of processing to its Partitioned sandbox in 
parallel or store them back to HDFS as serial file 
• Data processing tasks can be visually designed in CloverETL 
…thus taking advantage of both worlds.
CloverETL parallel reading from HDFS 
Input data file on HDFS 
Block1 
Block2 
Block3 
Multiple instances of 
Parallel Reader access 
HDFS 
to read data in parallel 
Standard CloverETL 
debugging available 
Final result written as 
single serial file to local 
filesystem 
Data processing 
performed by CloverETL 
standard components 
In this scenario: 
•HDFS serves as a storage system for raw source data 
•CloverETL is the data processing engine
+ 
Benchmarks
The (simple) scenario 
• Apache log stored on HDFS 
• ~274 million web log records 
• Extract year, month and IP address 
• Aggregate data to get number of unique visitors per 
month 
• Running on cluster of 4 HW nodes, using: 
• Hadoop only 
• Hadoop+Hive 
• CloverETL only 
• CloverETL + Hadoop/HDFS
The (simple) scenario results 
Time (sec) 
Hadoop 329 
8 reducers 
Hadoop Hive Query 127 
CloverETL only 59 
Partitioned Sandbox 
CloverETL + Hadoop/HDFS 72 
Segmented Parallel Reading from HDFS
+ 
synergy 
CloverETL brings 
•fast parallel processing 
•visual design & debugging 
•support for formats and 
communication protocols 
•process automation & monitoring 
Hadoop/HDFS brings 
•low cost storage of big data 
•fault resiliency through 
controllable data replication 
“Happy Together” 
song by 
The Turtles
+ 
synergy 
For more information on 
• CloverETL Cluster architecture: 
http://www.cloveretl.com/products/server/cluster 
http://www.slideshare.net/cloveretl/cloveretl-cluster 
• CloverETL in general: 
http://www.cloveretl.com

Contenu connexe

Tendances

Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Native erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationNative erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationlin bao
 
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingZhe Zhang
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemKonstantin V. Shvachko
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemAnand Kulkarni
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed DatasetsAlessandro Menabò
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsKonstantin V. Shvachko
 
Setting up a big data platform at kelkoo
Setting up a big data platform at kelkooSetting up a big data platform at kelkoo
Setting up a big data platform at kelkooFabrice dos Santos
 

Tendances (20)

Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Native erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationNative erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentation
 
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File System
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
 
Setting up a big data platform at kelkoo
Setting up a big data platform at kelkooSetting up a big data platform at kelkoo
Setting up a big data platform at kelkoo
 

Similaire à CloverETL + Hadoop

Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxssuser8c3ea7
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersMindsMapped Consulting
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 
Bigdata and Hadoop
 Bigdata and Hadoop Bigdata and Hadoop
Bigdata and HadoopGirish L
 
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsLars Nielsen
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 

Similaire à CloverETL + Hadoop (20)

HADOOP.pptx
HADOOP.pptxHADOOP.pptx
HADOOP.pptx
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hdfs
HdfsHdfs
Hdfs
 
module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
 
HADOOP
HADOOPHADOOP
HADOOP
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Bigdata and Hadoop
 Bigdata and Hadoop Bigdata and Hadoop
Bigdata and Hadoop
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Unit 1
Unit 1Unit 1
Unit 1
 
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data Systems
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 

Dernier

Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 

Dernier (20)

Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 

CloverETL + Hadoop

  • 1. ✕ CloverETL versus Hadoop in light of transforming very large data sets in parallel a deathmatch or happy together ?
  • 2. = similarities • Both technologies use data parallelism - input data are split into “partitions” which are then processed in parallel. • Each partition is processed the same way (same algorithm used). • At the end of the processing, results of individually processed partitions need to be merged to process final result. Part 1 Part 2 Part 3 Final result data split data merge data process
  • 3. ✕ differences • Hadoop uses Map->Reduce pattern originally developed by Google for Web indexing and searching. Processing is divided into Map phase (filtering&sorting) and Reduction phase (summary operation). Hadoop approach expects that initial large volume of data is reduced to much smaller result -> e.g. search for pages with certain keyword. • CloverETL is based on pipeline-parallelism pattern where individual specialized components perform various operations on flow of data records - parsing, filtering, joining, aggregating,de-duping... Clover is optimized for large volumes of data flowing through it and being transformed on-the-fly.
  • 4. = similarities Both technologies use partitioned&distributed storage of data (filesystem). • Hadoop uses HDFS (Hadoop Distributed Filesystem) with individual DataNodes residing on physical nodes of Hadoop/HDFS cluster. • CloverETL uses Partitioned Sandbox where data are spread over physical nodes of CloverETL Cluster. Each node is also a data processing node typically processing locally stored data (not exclusively). One node can be part of more than one Partitioned Sandbox.
  • 5. ✕ differences HDFS operates at byte level (data are read&written as streams of bytes). It includes data loss prevention through data redundancy. HDFS is based on “write-once, read-many-times” pattern. CloverETL’s Partitioned Sandbox operates at record level (data are read&written as complete records). Data loss prevention is left to be handled by the underlying file system storage. Clover’s Partitioned sandbox supports very high I/O throughput needed for massive data transformations.
  • 6. CloverETL ✕ Hadoop HDFS HDFS stores, splits and distributes data at byte level 4 5 6 , N Y , J OH N n split split CloverETL stores, splits and distributes data at record level 456,NY,JOHNn 457,VA,BILLn 458,MA,SUEn split split
  • 7. Hadoop HDFS organises files into large blocks of bytes (64MB or more) which are then physically stored on different nodes of Hadoop cluster HDFS data file Node 1 Node 2 split record Block1 Block2 data records{ data blocks of 64MB •block 1 •block 3 •block 5 •block 7 •... •block 2 •block 4 •block 6 •block 8 •...
  • 8. Hadoop HDFS partitions, distributes and stores data at byte level 4 5 6 , N Y , J OH N n split 1st part stored 2nd part stored Node 1 Node 2 ☛ One data record in source data can end up being split between two different nodes ☛ Writing or reading such record requires accessing two different nodes via network ☛ HDFS presents files as single continuous stream of data (similar to any local filesystem)
  • 9. Block1 Block2 Hadoop HDFS ☛ Parallel writing to one HDFS file is impossible Two processes can not write to one data block at the same time. Two processes trying to write in parallel to one HDFS file (two different blocks) will face the block boundary issue - with potential collision. not enough space already filled space by 2nd process Node 1 Node 2 n-th record 1st process 2nd process where to write ? data blocks of 64MB output file block 1 block 2 ➟ file grows (blocks added) executed on Node1 executed on Node2 writes to Node1 & 2 writes to Node1 & 2 starts writing to Block2
  • 10. CloverETL Partitioned Sandbox partitions, distributes and stores data at record level 456,NY,JOHNn 457,VA,BILLn 458,MA,SUEn Node 2 gets stored gets stored split split Node 1 Node 2 ☛ Nodes contain complete records. ☛ Writing or reading records means accessing locally stored data only ☛ Partitioned data are located in multiple files located on individual nodes. Clover offers unified user view over those files. When processing, partition files are accessed individually.
  • 11. CloverETL Partitioned Sandbox Node 1 Node 2 ☛ Parallel writing to Partitioned Sandbox is easy Two processes write to two independent partitions of Clover sandbox. Each process writes to partition which is local to node where it runs - no collisions. 1st process 2nd process Partition 1 456,NY,JOHNn 458,VA,WILLIAMn 460,MA,MAGn ➟ Partition 2 457,NJ,ANNn 459,IL,MEGANn 461,WA,RYANn ➟ executed on Node1 executed on Node2 writes to Node1 only writes to Node2 only
  • 12. Fault resiliency ☛ HDFS implements fault tolerance HDFS replicates individual data blocks across cluster nodes thus ensuring fault tolerance. ☛ Clover delegates fault resiliency to local file system Clover provides unified view on data stored locally on nodes. It’s nodes’ setup (OS, filesystem) responsible for fault resiliency.
  • 13. public 16 17 public 18 19 20 21 InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = 24 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public 32 33 34 35 36 37 sum += val.get(); 38 } 39 context.write(key, 40 } 41 } 42 43 public 44 Configuration conf = 45 46 Job job = 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 How Hadoop processes data process 4 reduce() Merge data (partially) Sort temp data Block1 Block2 Block3 process 1 map() map() map() process 5 reduce() process 2 process 3 Map data to key->value pairs output.part1 output.part2 Input data file • Hadoop concentrates transformation logic into 2 stages - map & reduce. • Complex logic must be split to multiple map & reduce phases with temporary data being stored in between • Intense network communication happens when reducers (one or more) merge data from multiple mappers (mappers and reducers may run on different nodes) • If multiple reducers are used (to accelerate processing) the resulting data are located in multiple output files (need to be merged again to produce single final result)
  • 14. How CloverETL processes data Input data file Partition 1 456,NY,JOHNn 458,VA,WILLIAMn ➟ Partition 2 457,NJ,ANNn 459,IL,MEGANn ➟ output.full Transformation logic with pipeline-parallelism Transformation logic with pipeline-parallelism • Clover processes data via set of transformation components running in pipeline-parallelism mode • Even complex transformation can be performed without temporarily storing data • Individual processing nodes obey data locality - each cluster node processes only locally stored data partition • Clover allows partitioned output data be automatically presented as one singe result Wikipedia > Pipeline parallelism - When multiple components run on same data set i.e. when a record is processed in one component and a previous record is being processed in another components.
  • 15. ✕ differences ☛ HDFS optimizes for storage HDFS optimizes for storing vast amount of data across hundreds of cluster nodes. It follows the ““write-once, read-many-times” pattern. ☛ Clover optimizes for I/O throughput Clover optimizes for very fast writing or reading of data in parallel on dozens of cluster nodes. This lends itself nicely to read&process&write (aka ETL)
  • 16. Which approach is better ? it depends.. better for typical data transformation/integration tasks where all/most input data records get transformed and written out. Clover Partitioned sandbox expects short-term storage of data. better when storing vast amount of data which are written by single process and potentially read by several processes. HDFS expects long-term storage of data.
  • 17. ? which one Wouldn’t it be nice to have the best from both worlds ? It’s possible ! • Clover is able to read&write data from HDFS • Clover can read and process HDFS stored data in parallel • Clover can write the results of processing to its Partitioned sandbox in parallel or store them back to HDFS as serial file • Data processing tasks can be visually designed in CloverETL …thus taking advantage of both worlds.
  • 18. CloverETL parallel reading from HDFS Input data file on HDFS Block1 Block2 Block3 Multiple instances of Parallel Reader access HDFS to read data in parallel Standard CloverETL debugging available Final result written as single serial file to local filesystem Data processing performed by CloverETL standard components In this scenario: •HDFS serves as a storage system for raw source data •CloverETL is the data processing engine
  • 20. The (simple) scenario • Apache log stored on HDFS • ~274 million web log records • Extract year, month and IP address • Aggregate data to get number of unique visitors per month • Running on cluster of 4 HW nodes, using: • Hadoop only • Hadoop+Hive • CloverETL only • CloverETL + Hadoop/HDFS
  • 21. The (simple) scenario results Time (sec) Hadoop 329 8 reducers Hadoop Hive Query 127 CloverETL only 59 Partitioned Sandbox CloverETL + Hadoop/HDFS 72 Segmented Parallel Reading from HDFS
  • 22. + synergy CloverETL brings •fast parallel processing •visual design & debugging •support for formats and communication protocols •process automation & monitoring Hadoop/HDFS brings •low cost storage of big data •fault resiliency through controllable data replication “Happy Together” song by The Turtles
  • 23. + synergy For more information on • CloverETL Cluster architecture: http://www.cloveretl.com/products/server/cluster http://www.slideshare.net/cloveretl/cloveretl-cluster • CloverETL in general: http://www.cloveretl.com