5. Flume
• Apache Flume is a distributed system for
collecting streaming data.
• Developed by Cloudera, now Apache project
• Popular & supported
• Features:
• Centralized config
• Failover
• Reliability
1/14/2013 5
6. Flume - Responsibilities
• Node – path from source to sink
• Agent – collect data from local host and forwards
to Collector
• Collector – collects the data and writes into
HDFS
• Master – manages configuration and supports
data flow
1/14/2013 6
7. Data in / Data out - other solutions
• Scribe https://github.com/facebook/scribe –
similar to Flume
• Chukwa http://incubator.apache.org/chukwa/
– similar to Flume
• Oozie http://oozie.apache.org/ - workflow
scheduler
1/14/2013 7
8. Sqoop
• Apache project, originally from Cloudera
http://sqoop.apache.org/
• Uses metadata to describe structure in HDFS
• Transport bulk data in & out from relational
database
• Directly reading & writing from Map/Reduce
as an alternative
1/14/2013 8
10. Formats
• Input and Output matter
• Data in files is splitted
• XML and JSON are supported
• Do document per-line or suffer the
consequences ;)
1/14/2013 10
11. Serialization frameworks
• Binary in nature, makes things a bit more
complicated
• Thrift & Protobuf vs SequenceFile & Avro
• Native formats support splitability and
compression
• Avro supports code generation and
versioning, just like Thrift & Protobuf
• Out-of-the-box support in Hadoop
1/14/2013 11
12. Compression
• Deflate (zlib)
• Gzip
• Bzip2 – splittable with additional work, slow
• LZO – block based
• LZOP – splittable with additional work
• Snappy – from Google, fast, but no splittability
1/14/2013 12
13. Testing
• MRUnit – unit testing for Map/Reduce jobs
http://mrunit.apache.org/
• Data sampling for testing
• Data spikes detection
1/14/2013 13
14. Small files
• Small files are problematic because of big
block size
• Can pack them into bigger Avro files
• Can move to Hbase
• Hadoop Archives (HAR) files
1/14/2013 14
16. Pig
• High level language for data analysis
• Uses PigLatin to describe data flows
(translates into MapReduce)
• Filters, Joins, Projections, Groupings, Counts,
etc.
• Example:
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
(John)
(Mary)
1/14/2013 16
17. Hive
• SQL-like interface - HiveQL
• Has its own structure
• Not a pipeline like Pig
• Basically a distributed data warehouse
• Has execution optimization
1/14/2013 17
18. HBase
• Distributed, column oriented store
• Independent of Hadoop
• No translation into Map/Reduce
• Stores data in MapFiles (indexed SequenceFiles)
1/14/2013 18
20. Apache
• Umbrella for Hadoop projects
• No commercial support
• Active community
• Most recent builds
1/14/2013 20
21. Cloudera
• Has its own tuned build – CDH
• Commercial support
• Certification & Training
• Has products on top of Hadoop (like Cloudera
Manager etc.)
• Very high visibility
1/14/2013 21
22. Amazon Elastic MapReduce (EMR)
• Custom build tailored for AWS environment
• Very easy
• Uses S3 as a storage
• Uses SimpleDB for job flow state information
• Supports HBase
1/14/2013 22
23. HortonWorks
• Own platform on top of Hadoop
• Big backers like Microsoft and Yahoo
• Has trainings & certification
1/14/2013 23
25. Future
• Percolator for incremental indexing and
analysis of frequently changing datasets
• Dremel for ad hoc analytics
• Pregel for analyzing graph data
• ZooKeeper & Hadoop de-coupling with new
execution engines to the rescue!
1/14/2013 25