The document provides an overview of the Apache Hadoop ecosystem. It describes Hadoop as a distributed, scalable storage and computation system based on Google's architecture. The ecosystem includes many related projects that interact, such as YARN, HDFS, Impala, Avro, Crunch, and HBase. These projects innovate independently but work together, with Hadoop serving as a flexible data platform at the core.
The Ultimate Guide to Choosing WordPress Pros and Cons
The Apache Hadoop Ecosystem and Emerging Technologies
1. The Apache Hadoop Ecosystem
Eli Collins
Software Engineer, Cloudera
Hadoop Committer/PMC, Apache Software Foundation
@elicollins
1
2. This talk
My perspective on Hadoop & it’s ecosystem
A look at some new, interesting parts of the ecosystem,
including slides I crib’d from Doug
2
3. What is Hadoop?
is a distributed, reliable, scalable, flexible storage and
computation system.
It’s based on the architecture and designs of systems
developed at Google (thanks, Google!)
3
4. Another perspective
Also a generalization of more specialized systems...
• Parallel databases and data warehouses
• Parallel programming (HPC, Beowulf clusters)
• Distributed storage & parallel file systems
• High performance analytics
• Log, stream, event & ETL processing systems
4
5. Yet another perspective
Plat·form (-noun): a hardware
architecture and software framework
for building applications
Also, a place to launch software, so Hadoop is the really
kernel of a “data OS” or “data platform”
5
6. Last perspective
Like a data warehouse, but…
• More data
• More kinds of data
• More flexible analysis
• Open Source
• Industry standard hardware
• More economical
6
7. Why now?
Data Growth
USRCUEDT–0
N UTRDAA8%
T
STRUCTURED DATA – 20%
1980 2013
7
8. Digression …what’s it for?
Data processing – Search index building, log
processing, click stream sessionization, Telco/POS
processing, trade reconciliation, genetics, ETL
processing, image processing, etc.
Analytics – Ad-hoc queries, reporting, fraud analysis,
ML, forecasting, infra management , etc.
Real time serving if you’re brave.
8
10. The Hadoop ecosystem
ec·o·sys·tem (-noun): a system of
interconnecting and interacting parts
• Not centrally planned - interaction and feedback loop
• Components leverage each other, deps & conventions
• Components co-designed in parallel, over time
• Components innovate individually & quickly
• Boundaries are not fixed
10
11. An example interaction
Query
Query execution Impala Hive Metadata
File formats K/V storage
Avro HBase ZooKeeper Coordination
HDFS File storage
Hadoop auth, codecs, RPC, etc. And 3rd party like Google PB & Snappy, etc.
11
12. What are the implications?
Highly adaptable (itself & co-located systems)
Hadoop grows incrementally
Highly parallel development, e.g. “rule of three”
Complex system
Integration is key
Manage change over time
Open source > open standards
12
14. Hadoop Yarn (Yet Another Resource Negotiator)
• Generic scheduler for distributed applications
• Not just MapReduce applications
• Consists of:
• Resource Manager (per cluster)
• Node Manager (per machine)
• Runs Application Managers (per job)
• Runs Application Containers (per task)
• In Hadoop 2.0
• Replaces the Job Tracker and Task Tracker (aka MR1)
14
15. HDFS HA: Automatic failover and QJM
NameNode
Standby
(QuorumJournal NameNode
Manager)
JournalNode JournalNode JournalNode
Local Local Local
disk disk disk
15
16. Impala: a modern SQL engine for Hadoop
• General purpose SQL engine
• Supports queries that take from milliseconds to hours
• Runs directly within Hadoop
• Reads widely used Hadoop formats
• Talks to widely used Hadoop storage managers
• Runs on the same Hadoop nodes
• High Performance
• Completely new engine (no MR)
• Runtime code generation
16
17. Avro: a format for big data
• Expressive
• Records, arrays, unions, enums
• Efficient
• Compact binary, compressed, splittable
• Interoperable
• Langs: C, C++, C#, Java, Perl, Python, Ruby, PHP
• Tools: MR, Pig, Hive, Crunch, Flume, Sqoop, etc.
• Dynamic
• Can read & write w/o generating code first
• Evolvable
17
18. Crunch
• An API for MapReduce
• Alternative to Pig and Hive
• Inspired by Google’s FlumeJava paper
• In Java (& Scala)
• Easier to integrate application logic
• With a full programming language
• Concepts
• PCollection: set of values w/ parallelDo operator
• PTable: key/value mapping w/ groupBy operator
• Pipeline: executor that runs MapReduce jobs
18
19. Crunch Word Count
public class WordCount {
public static void main(String[] args) throws Exception {
Pipeline pipeline = new MRPipeline(WordCount.class);
PCollection lines = pipeline.readTextFile(args[0]);
PCollection words = lines.parallelDo("my splitter", new DoFn() {
public void process(String line, Emitter emitter) {
for (String word : line.split("s+")) {
emitter.emit(word);
}
}
}, Writables.strings());
PTable counts = Aggregate.count(words);
pipeline.writeTextFile(counts, args[1]);
pipeline.run();
}
}
19
20. Scrunch Word Count
class WordCountExample {
val pipeline = new Pipeline[WordCountExample]
def wordCount(fileName: String) = {
pipeline.read(from.textFile(fileName))
.flatMap(_.toLowerCase.split("W+"))
.filter(!_.isEmpty())
.count
}
}
20