There is a lot more to Hadoop than Map-Reduce. An increasing number of engineers and researchers involved in processing and analyzing large amount of data, regards Hadoop as an ever expanding ecosystem of open sources libraries, including NoSQL, scripting and analytics tools.
Gen AI in Business - Global Trends Report 2024.pdf
Hadoop Ecosystem
1. Hadoop
Ecosystem
ACM Bay Area Data Mining Camp 2011
Patrick Nicolas
September 19, 2011
http://patricknicolas.blogspot.com
http://www.slideshare.net/pnicolas
https://github.com/prnicolas
Copyright 2011 Patrick Nicolas - All rights reserved
1
2. Overview
Beside providing developers and analysts with an open source
implementation of map-reduce functional model, the Hadoop
ecosystem incorporates analytical algorithms, tasks/workflow
managers and NoSQL stores.
Client code, Scripts
NoSQL
Analytics
Key-Values stores Mahout
Document stores
Multi-column stores
Graph databases
Configuration
Zookeeper
Workflow
Hive
Pig
Cascading
Map/Reduce framework
HDFS
Java Virtual Machine
Copyright 2011 Patrick Nicolas - All rights reserved
2
3. Key Components
The Hadoop ecosystem can be described as a data centric
taxonomy to analyze, aggregate, store and report data.
Admin.
File System
GFS,HDFS
MapReduce
K-V Stores
Redis, Memcache, Kyoto Cabinet
Doc Stores
Hadoop
Zookeeper
MongoDB, CouchDB
NoSQL
Multi-column
stores
HBase, Hypertable, BigData,
Cassandra, BerkeleyDB
Graph DB
Script
Workflow
Neo4j, GraphDB, InfiniteGraph
Pig
Cascading
SQL
Analytics
API
Hive
Mahout, Chunkwa
Copyright 2011 Patrick Nicolas - All rights reserved
3
4. NoSQL: Overview
Non relational data stores allow large amount of data to be
collected very efficiently. Contrary to RDBMS, NoSQL
schemas are optimized for sequential writes and therefore are
not appropriate for querying and reporting.
Key
Value
Column families, nested structures
NoSQL storages share the same basic key-value schema but
provide different method to describe values.
Copyright 2011 Patrick Nicolas - All rights reserved
4
5. NoSQL: Document Stores
Key-Value files (HDFS)
<key, value>
Distributed replicable blocks of sequential key-value string pairs
Key-Value stores (Redis, Memcache)
<key*, value>
Language independent, distributed, sorted key value pairs (keys
are list, sets or hashes) with in-memory caching and support for
atomic operations.
Document stores (MongoDB, CouchDB)
{ “k1”:val1, “k2”:val2 }
Fault-tolerant, document centric using dynamic schema of sorted
javascript objects and supports limited SQL like syntax.
Copyright 2011 Patrick Nicolas - All rights reserved
5
6. NoSQL: Tuples & Graphs
Sorted, ordered tuples(Cassandra, HBase,..)
{ name:x value: { key1: {name:key1, value:v1, tstamp:x}, key2:x}}
Fault-tolerant, distributed sorted, ordered, grouped (family)
‘super-column’ (map of unbounded number of columns)
Graph databases(Neo4j, GraphDB, InfiniteGraph,..)
Efficient transactional, traversal & storage of entity (vertice),
attribute & relationship (edge)
Copyright 2011 Patrick Nicolas - All rights reserved
6
7. Data Flow Managers
Map & Reduce tasks can be abstracted to a tasks or workflow
managers using high level language such as scripts, SQL or
UNIX-pipe like API. Those data flow tools hide the functional
complexity of Map-Reduce from domain experts.
Scripting
Pig
SQL
Hive
API: Pipes & flows
Cascading
API
Map
Map
Map
Map
Map
Combine
Combine
Reduce
Reduce
Reduce
Reduce
Copyright 2011 Patrick Nicolas - All rights reserved
7
8. Data Flow Code Samples
Pig Latin
A = LOAD „mydata' USING PigStorage() AS (f1:int, name:string);
B = GROUP A BY f1;
C = FOREACH B GENERATE COUNT ($0);
Hive
LOAD DATA LOCAL INPATH „xxx' OVERWRITE INTO TABLE z;
INSERT OVERWRITE TABLE z SELECT count(*) FROM y GROUP BY f1;
Cascading
Scheme srcScheme = new TextLine( new Fields( “line”));
Tap src = new Hfs(srcScheme, inpath);
Pipe counter = new Pipe (“count”);
counter = new GroupBy( counter, new Fields(“f1”);
FlowConnector connector = new FlowConnector(props);
Flow flow = connector.connect( “count”, src, sink, pipe);
flow.complete();
Copyright 2011 Patrick Nicolas - All rights reserved
8