Hadoop Solutions

Hadoop Solutions

By Zenyk Matchyshyn
Staff Engineer @ Lohika

Agenda
• Why?
• Data in / Data out
• Data Formats
• Tools
• Providers
• Future
• Q/A

1/14/2013 2

Why?
• Smart meter analysis
• Genome processing
• Sentiment & social media analysis
• Network capacity trending & management
• Ad targeting
• Fraud detection

1/14/2013 3

DATA IN / DATA OUT

1/14/2013 4

Flume

• Apache Flume is a distributed system for
collecting streaming data.
• Developed by Cloudera, now Apache project
• Popular & supported
• Features:
• Centralized config
• Failover
• Reliability

1/14/2013 5

Flume - Responsibilities
• Node – path from source to sink
• Agent – collect data from local host and forwards
to Collector
• Collector – collects the data and writes into
HDFS
• Master – manages configuration and supports
data flow

1/14/2013 6

Data in / Data out - other solutions

• Scribe https://github.com/facebook/scribe –
similar to Flume
• Chukwa http://incubator.apache.org/chukwa/
– similar to Flume
• Oozie http://oozie.apache.org/ - workflow
scheduler

1/14/2013 7

Sqoop

• Apache project, originally from Cloudera
http://sqoop.apache.org/
• Uses metadata to describe structure in HDFS
• Transport bulk data in & out from relational
database
• Directly reading & writing from Map/Reduce
as an alternative

1/14/2013 8

DATA FORMATS

1/14/2013 9

Formats

• Input and Output matter
• Data in files is splitted
• XML and JSON are supported
• Do document per-line or suffer the
consequences ;)

1/14/2013 10

Serialization frameworks
• Binary in nature, makes things a bit more
complicated
• Thrift & Protobuf vs SequenceFile & Avro
• Native formats support splitability and
compression
• Avro supports code generation and
versioning, just like Thrift & Protobuf
• Out-of-the-box support in Hadoop

1/14/2013 11

Compression

• Deflate (zlib)
• Gzip
• Bzip2 – splittable with additional work, slow
• LZO – block based
• LZOP – splittable with additional work
• Snappy – from Google, fast, but no splittability

1/14/2013 12

Testing
• MRUnit – unit testing for Map/Reduce jobs
http://mrunit.apache.org/
• Data sampling for testing
• Data spikes detection

1/14/2013 13

Small files

• Small files are problematic because of big
block size
• Can pack them into bigger Avro files
• Can move to Hbase
• Hadoop Archives (HAR) files

1/14/2013 14

Pig
• High level language for data analysis
• Uses PigLatin to describe data flows
(translates into MapReduce)
• Filters, Joins, Projections, Groupings, Counts,
etc.
• Example:
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
(John)
(Mary)

1/14/2013 16

Hive

• SQL-like interface - HiveQL
• Has its own structure
• Not a pipeline like Pig
• Basically a distributed data warehouse
• Has execution optimization

1/14/2013 17

HBase

• Distributed, column oriented store
• Independent of Hadoop
• No translation into Map/Reduce
• Stores data in MapFiles (indexed SequenceFiles)

1/14/2013 18

Apache

• Umbrella for Hadoop projects
• No commercial support
• Active community
• Most recent builds

1/14/2013 20

Cloudera

• Has its own tuned build – CDH
• Commercial support
• Certification & Training
• Has products on top of Hadoop (like Cloudera
Manager etc.)
• Very high visibility

1/14/2013 21

Amazon Elastic MapReduce (EMR)
• Custom build tailored for AWS environment
• Very easy
• Uses S3 as a storage
• Uses SimpleDB for job flow state information
• Supports HBase

1/14/2013 22

HortonWorks

• Own platform on top of Hadoop
• Big backers like Microsoft and Yahoo
• Has trainings & certification

1/14/2013 23

Future

• Percolator for incremental indexing and
analysis of frequently changing datasets
• Dremel for ad hoc analytics
• Pregel for analyzing graph data
• ZooKeeper & Hadoop de-coupling with new
execution engines to the rescue!

1/14/2013 25

Hadoop Solutions

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (17)

Similaire à Hadoop Solutions

Similaire à Hadoop Solutions (20)

Plus de zenyk

Plus de zenyk (8)

Dernier

Dernier (20)

Hadoop Solutions