CloverETL + Hadoop

✕
CloverETL versus Hadoop
in light of transforming very large data sets
in parallel
a deathmatch or happy together ?

=
similarities
• Both technologies use data parallelism - input data are split into
“partitions” which are then processed in parallel.
• Each partition is processed the same way (same algorithm used).
• At the end of the processing, results of individually processed partitions
need to be merged to process final result.
Part 1
Part 2
Part 3
Final result
data
split
data
merge
data
process

✕
differences
• Hadoop uses Map->Reduce pattern originally developed by Google for
Web indexing and searching. Processing is divided into Map phase
(filtering&sorting) and Reduction phase (summary operation).
Hadoop approach expects that initial large volume of data is reduced to
much smaller result -> e.g. search for pages with certain keyword.
• CloverETL is based on pipeline-parallelism pattern where individual
specialized components perform various operations on flow of data
records - parsing, filtering, joining, aggregating,de-duping...
Clover is optimized for large volumes of data flowing through it and
being transformed on-the-fly.

=
similarities
Both technologies use partitioned&distributed storage of data (filesystem).
• Hadoop uses HDFS (Hadoop Distributed Filesystem) with individual
DataNodes residing on physical nodes of Hadoop/HDFS cluster.
• CloverETL uses Partitioned Sandbox where data are spread over
physical nodes of CloverETL Cluster. Each node is also a data processing
node typically processing locally stored data (not exclusively). One node
can be part of more than one Partitioned Sandbox.

✕
differences
HDFS operates at byte level (data are read&written as streams of
bytes). It includes data loss prevention through data redundancy.
HDFS is based on “write-once, read-many-times” pattern.
CloverETL’s Partitioned Sandbox operates at record level (data are
read&written as complete records). Data loss prevention is left to be
handled by the underlying file system storage. Clover’s Partitioned
sandbox supports very high I/O throughput needed for massive data
transformations.

CloverETL ✕ Hadoop HDFS
HDFS stores, splits and distributes data at byte level
4 5 6 , N Y , J OH N n
split
split
CloverETL stores, splits and distributes data at record level
456,NY,JOHNn 457,VA,BILLn 458,MA,SUEn
split split

Hadoop HDFS
organises files into large blocks of bytes (64MB or more)
which are then physically stored on different nodes of
Hadoop cluster
HDFS data file
Node 1 Node 2
split record
Block1 Block2
data records{
data blocks of 64MB
•block 1
•block 3
•block 5
•block 7
•...
•block 2
•block 4
•block 6
•block 8
•...

Hadoop HDFS
partitions, distributes and stores data at byte level
4 5 6 , N Y , J OH N n
split
1st part stored
2nd part stored
Node 1 Node 2
☛ One data record in source data can end up being split between two different nodes
☛ Writing or reading such record requires accessing two different nodes via network
☛ HDFS presents files as single continuous stream of data (similar to any local filesystem)

Block1
Block2
Hadoop HDFS
☛ Parallel writing to one HDFS file is impossible
Two processes can not write to one data block at the same time.
Two processes trying to write in parallel to one HDFS file (two different blocks)
will face the block boundary issue - with potential collision.
not enough space
already filled space by 2nd process
Node 1
Node 2
n-th record
1st process
2nd process
where to write ?
data blocks of 64MB output file
block 1
block 2
➟ file grows (blocks added)
executed
on Node1
executed
on Node2
writes to
Node1 & 2
writes to
Node1 & 2
starts writing to Block2

CloverETL Partitioned Sandbox
partitions, distributes and stores data at record level
456,NY,JOHNn 457,VA,BILLn 458,MA,SUEn
Node 2
gets stored
gets stored
split split
Node 1 Node 2
☛ Nodes contain complete records.
☛ Writing or reading records means accessing locally stored data only
☛ Partitioned data are located in multiple files located on individual nodes. Clover offers
unified user view over those files. When processing, partition files are accessed individually.

CloverETL Partitioned Sandbox
Node 1
Node 2
☛ Parallel writing to Partitioned Sandbox is easy
Two processes write to two independent partitions of Clover sandbox.
Each process writes to partition which is local to node where it runs - no
collisions.
1st process
2nd process
Partition 1
456,NY,JOHNn 458,VA,WILLIAMn 460,MA,MAGn ➟
Partition 2
457,NJ,ANNn 459,IL,MEGANn 461,WA,RYANn ➟
executed
on Node1
executed
on Node2
writes to
Node1 only
writes to
Node2 only

Fault resiliency
☛ HDFS implements fault tolerance
HDFS replicates individual data blocks across cluster nodes thus ensuring fault
tolerance.
☛ Clover delegates fault resiliency to local file system
Clover provides unified view on data stored locally on nodes. It’s nodes’ setup (OS,
filesystem) responsible for fault resiliency.

public
16
17 public
18
19
20
21
InterruptedException {
22 String line = value.toString();
23 StringTokenizer tokenizer =
24
25 word.set(tokenizer.nextToken());
26 context.write(word, one);
27 }
28 }
29 }
30
31 public
32
33
34
35
36
37 sum += val.get();
38 }
39 context.write(key,
40 }
41 }
42
43 public
44 Configuration conf =
45
46 Job job =
47
48 job.setOutputKeyClass(Text.class);
49 job.setOutputValueClass(IntWritable.class);
50
51 job.setMapperClass(Map.class);
52 job.setReducerClass(Reduce.class);
53
How Hadoop processes data
process 4
reduce()
Merge data
(partially)
Sort temp data
Block1
Block2
Block3
process 1
map()
map()
map()
process 5
reduce()
process 2
process 3
Map data
to key->value pairs
output.part1
output.part2
Input data file
• Hadoop concentrates transformation logic into 2 stages - map & reduce.
• Complex logic must be split to multiple map & reduce phases with temporary data being stored
in between
• Intense network communication happens when reducers (one or more) merge data from multiple
mappers (mappers and reducers may run on different nodes)
• If multiple reducers are used (to accelerate processing) the resulting data are located in multiple
output files (need to be merged again to produce single final result)

How CloverETL processes data
Input data file
Partition 1
456,NY,JOHNn 458,VA,WILLIAMn ➟
Partition 2
457,NJ,ANNn 459,IL,MEGANn ➟
output.full
Transformation logic
with pipeline-parallelism
Transformation logic
with pipeline-parallelism
• Clover processes data via set of transformation components running in pipeline-parallelism mode
• Even complex transformation can be performed without temporarily storing data
• Individual processing nodes obey data locality - each cluster node processes only locally stored
data partition
• Clover allows partitioned output data be automatically presented as one singe result
Wikipedia > Pipeline parallelism - When multiple components run on same data set i.e. when a record is processed in one component and a previous
record is being processed in another components.

✕
differences
☛ HDFS optimizes for
storage
HDFS optimizes for storing vast amount of
data across hundreds of cluster nodes. It
follows the ““write-once, read-many-times”
pattern.
☛ Clover optimizes for I/O
throughput
Clover optimizes for very fast writing or
reading of data in parallel on dozens of
cluster nodes. This lends itself nicely to
read&process&write (aka ETL)

Which approach is better ?
it depends..
better for typical data transformation/integration
tasks where all/most input data records get transformed
and written out.
Clover Partitioned sandbox expects short-term storage of data.
better when storing vast amount of data which
are written by single process and potentially read by
several processes.
HDFS expects long-term storage of data.

?
which one
Wouldn’t it be nice to have the best from both worlds ?
It’s possible !
• Clover is able to read&write data from HDFS
• Clover can read and process HDFS stored data in parallel
• Clover can write the results of processing to its Partitioned sandbox in
parallel or store them back to HDFS as serial file
• Data processing tasks can be visually designed in CloverETL
…thus taking advantage of both worlds.

CloverETL parallel reading from HDFS
Input data file on HDFS
Block1
Block2
Block3
Multiple instances of
Parallel Reader access
HDFS
to read data in parallel
Standard CloverETL
debugging available
Final result written as
single serial file to local
filesystem
Data processing
performed by CloverETL
standard components
In this scenario:
•HDFS serves as a storage system for raw source data
•CloverETL is the data processing engine

The (simple) scenario
• Apache log stored on HDFS
• ~274 million web log records
• Extract year, month and IP address
• Aggregate data to get number of unique visitors per
month
• Running on cluster of 4 HW nodes, using:
• Hadoop only
• Hadoop+Hive
• CloverETL only
• CloverETL + Hadoop/HDFS

The (simple) scenario results
Time (sec)
Hadoop 329
8 reducers
Hadoop Hive Query 127
CloverETL only 59
Partitioned Sandbox
CloverETL + Hadoop/HDFS 72
Segmented Parallel Reading from HDFS

+
synergy
CloverETL brings
•fast parallel processing
•visual design & debugging
•support for formats and
communication protocols
•process automation & monitoring
Hadoop/HDFS brings
•low cost storage of big data
•fault resiliency through
controllable data replication
“Happy Together”
song by
The Turtles

+
synergy
For more information on
• CloverETL Cluster architecture:
http://www.cloveretl.com/products/server/cluster
http://www.slideshare.net/cloveretl/cloveretl-cluster
• CloverETL in general:
http://www.cloveretl.com

CloverETL + Hadoop

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à CloverETL + Hadoop

Similaire à CloverETL + Hadoop (20)

Dernier

Dernier (20)

CloverETL + Hadoop