1. hadoopsphere.com View in Full Screen mode for better readability
Components that
constitute the
open source
Apache Hadoop
ecosystem
-
Summary and categorization of
components available as Apache (ASF)
projects/sub-projects and serving the
Hadoop ecosystem. The document
does not include other open source or
commercial projects/products
Contributed by : Sachin Ghai |@sachinghai
3. hadoopsphere.com
CORE LAYERS
which constitute
the Apache
Hadoop ecosystem
3
4. hadoopsphere.com
PERSIST :
File System & Data
Store –
• HDFS - Distributed file system that
provides high-throughput access.
Comprises of NameNode, Secondary
NameNode and DataNodes
• HBase - Distributed, scalable, big
Persist data store
• Cassandra - Highly scalable,
eventually consistent, distributed,
structured key-value store
• Accumulo - Sorted, distributed
key/value data storage and retrieval
system
4
5. hadoopsphere.com
PERSIST :
Serialization –
• Avro - Data serialization system
• Trevni - A Column File format to
permit compatible, independent
implementations that read and/or
write files in this format
Persist • Thrift - Framework, for scalable
cross-language services
development
5
6. hadoopsphere.com
RUN:
Job Execution –
• MapReduce - Framework for
performing distributed data
processing. Comprises of JobTracker,
TaskTracker and JobHistoryServer
• YARN - Framework that facilitates
writing arbitrary distributed
processing frameworks and
Persist
applications.
• Hama - Pure BSP (Bulk Synchronous
Parallel) computing framework for
massive scientific computations such
as matrix, graph and network
algorithms
6
7. hadoopsphere.com
MANAGE :
Work –
• Oozie - Workflow/coordination
system to manage Hadoop jobs
• Zookeeper - Centralized service
for maintaining configuration
information, naming, providing
distributed synchronization, and
Persist providing group services
7
8. hadoopsphere.com
MANAGE :
Dev –
• Crunch - Framework for writing,
testing, and running MapReduce
pipelines
• MRUnit - Java library that helps
developers unit test Apache
Hadoop MapReduce jobs
• HDT – Hadoop Development
Persist Tools (HDT) comprise Eclipse
based tools for developing
applications on the Hadoop
platform
8
9. hadoopsphere.com
MANAGE :
Ops –
• Ambari - Web-based tool for
provisioning, managing, and
monitoring Apache Hadoop
clusters
• Vaidya - Performance diagnostic
tool for MapReduce jobs
• BigTop - Project for the
Persist development of packaging and
tests and ensuring interoperability
among Apache Hadoop related
projects
• Whirr - Set of libraries for
running cloud services like running
Hadoop clusters on EC2
9
10. hadoopsphere.com
SECURE :
• Knox - System that provides a
single point of secure access for
Apache Hadoop clusters
Persist
10
11. hadoopsphere.com
TRANSFER :
• Flume - Distributed, reliable, and
available service for efficiently
collecting, aggregating, and
moving large amounts of log data
• Sqoop - Tool designed for
efficiently transferring bulk data
between Apache Hadoop and
Persist structured datastores such as
relational databases.
• Chukwa - Open source data
collection system for monitoring
large distributed systems
• Kafka - Distributed publish-
subscribe messaging system
11
12. hadoopsphere.com
ATMOSPHERIC
LAYERS
which build
up the
capabilities
beyond the
core of
Persist
Apache
Hadoop
ecosystem
12
13. hadoopsphere.com
HARDWARE :
• Commodity Hardware -
Low-cost, easily available
hardware working in
parallel
C
o
r
e
L Atm
a osp
y heri
Persist e c
r Laye
s rs
Note: no appliances known to run on pure Apache Hadoop distribution;
SSD and other cheap hardware options not mentioned separately here
13
14. hadoopsphere.com
DATA
INTERACTIONS:
• Pig - Platform for
analyzing large data sets
that consists of a high-
level language for
expressing data analysis
programs, coupled with
infrastructure for
evaluating these
programs
Persist • Hive - Data warehouse
system that facilitates
easy data summarization,
ad-hoc queries and
analysis of large datasets
stored in Hadoop
compatible file systems
14
15. hadoopsphere.com
DATA
INTERACTIONS:
• HCatalog - Table and
storage management
service for data created
using Apache Hadoop
C • Tez - Generic
o
r
application framework
e which can be used to
L Atm process complex data-
a osp
y heri processing task DAGs and
e c
Persist
r Laye runs natively on Apache
s rs
Hadoop YARN
•Gora - Framework for
in-memory data model
and persistence with
MapReduce support
15
16. hadoopsphere.com
ANALYTICS &
INTELLIGENCE :
• Mahout - Scalable
machine learning and
data mining algorithm
library. Supports
Recommendation mining,
Clustering, Classification
and Frequent itemset
mining
Persist • Drill - Distributed
system for interactive
analysis of large-scale
datasets. Comprises of
user interface (CLI, REST),
pluggable query language
and pluggable data
source.
16
17. hadoopsphere.com
DISCOVERY &
VISUALIZATION :
• Lucene - Open-source
search software including
Java based indexing and
search component
Lucene Core and high
performance search
server component Solr
• Blur - Search engine
Persist capable of querying
massive amounts of
structured data at
incredible speeds in a
cloud computing
environment
17
18. hadoopsphere.com
DISCOVERY &
VISUALIZATION :
• Giraph - Graph-
processing framework
leveraging existing
Hadoop infrastructure.
Follows bulk synchronous
parallel model to run
large scale algorithms.
Supports directed,
undirected, weighted,
Persist unweighted and
multigraphs
Note: no pure visualization projects currently as part of
ASF
18
19. hadoopsphere.com
APPLICATION
DOMAINS :
• Distribution - Includes
applications in Travel,
Transport, FMCG, supply
chain e.g. Expedia
• Financial - Includes
applications in financial,
banking, insurance e.g.
Visa
• Government - Includes
Persist applications in
government and public
sector e.g. Aadhar (India
ID card)
• Heavy Industry -
Includes applications in
heavy industrial business
including electronics,
auto, aircraft e.g. Hitachi
19
20. hadoopsphere.com
APPLICATION
DOMAINS :
• Internet - Includes new
age internet applications
including social media,
content distribution e.g.
C Facebook
o
r
• Oil & Energy - Includes
e applications in
L Atm upstream/downstream
a osp
y heri oil, gas business along
c
Persist e
r Laye with those in Energy
s rs
sector. e.g. Chevron
• Research - Includes
applications in new
research e.g. network
analysis & security
• Telecom - Includes
applications in Telecom
business e.g. Korea
Telecom
20
21. hadoopsphere.com
Reference :
• www.apache.org
• http://blogs.gartner.com/merv-adrian/2013/02/21/hadoop
Image courtesy:
• Slide 1 : Getty Images #84480368 Dorling Kindersley
(free thumbnail copy)
• Other images: Original source could not be established
21
22. hadoopsphere.com
About the document :
• Voluntarily contributed by: Sachin Ghai (@sachinghai)
• Publisher : hadoopsphere.com
• Version : 1.0
• Date : 11 March 2013
• Copyright: 2013, All Rights Reserved
• Note: The document does not use official lingo in part
• Contact : Use ‘Contact’ menu option on
www.hadoopsphere.com
• Disclaimer: The project names mentioned in this document
are either registered trademarks or trademarks of the Apache
Software Foundation in the United States. The Apache
Software Foundation has no affiliation with and does not
endorse or review the materials provided in this document.
22