The Apache Hadoop Ecosystem and Emerging Technologies

The Apache Hadoop Ecosystem
Eli Collins
Software Engineer, Cloudera
Hadoop Committer/PMC, Apache Software Foundation

@elicollins

1

This talk

My perspective on Hadoop & it’s ecosystem

A look at some new, interesting parts of the ecosystem,
including slides I crib’d from Doug

2

What is Hadoop?

is a distributed, reliable, scalable, flexible storage and
computation system.

It’s based on the architecture and designs of systems
developed at Google (thanks, Google!)

3

Another perspective

Also a generalization of more specialized systems...

• Parallel databases and data warehouses
• Parallel programming (HPC, Beowulf clusters)
• Distributed storage & parallel file systems
• High performance analytics
• Log, stream, event & ETL processing systems

4

Yet another perspective

Plat·form (-noun): a hardware
architecture and software framework
for building applications

Also, a place to launch software, so Hadoop is the really
kernel of a “data OS” or “data platform”

5

Last perspective

Like a data warehouse, but…

• More data
• More kinds of data
• More flexible analysis
• Open Source
• Industry standard hardware
• More economical

6

Why now?
Data Growth

USRCUEDT–0
N UTRDAA8%
T

STRUCTURED DATA – 20%

1980 2013

7

Digression …what’s it for?

Data processing – Search index building, log
processing, click stream sessionization, Telco/POS
processing, trade reconciliation, genetics, ETL
processing, image processing, etc.

Analytics – Ad-hoc queries, reporting, fraud analysis,
ML, forecasting, infra management , etc.

Real time serving if you’re brave.

8

The Hadoop ecosystem

ec·o·sys·tem (-noun): a system of
interconnecting and interacting parts

• Not centrally planned - interaction and feedback loop
• Components leverage each other, deps & conventions
• Components co-designed in parallel, over time
• Components innovate individually & quickly
• Boundaries are not fixed

10

An example interaction
Query

Query execution Impala Hive Metadata

File formats K/V storage

Avro HBase ZooKeeper Coordination

HDFS File storage

Hadoop auth, codecs, RPC, etc. And 3rd party like Google PB & Snappy, etc.

11

What are the implications?

Highly adaptable (itself & co-located systems)
Hadoop grows incrementally
Highly parallel development, e.g. “rule of three”

Complex system
Integration is key
Manage change over time
Open source > open standards

12

Switching gears….

A sample of some new/interesting things.

13

Hadoop Yarn (Yet Another Resource Negotiator)

• Generic scheduler for distributed applications
• Not just MapReduce applications
• Consists of:
• Resource Manager (per cluster)
• Node Manager (per machine)
• Runs Application Managers (per job)
• Runs Application Containers (per task)
• In Hadoop 2.0
• Replaces the Job Tracker and Task Tracker (aka MR1)

14

HDFS HA: Automatic failover and QJM

NameNode
Standby
(QuorumJournal NameNode
Manager)

JournalNode JournalNode JournalNode

Local Local Local
disk disk disk

15

Impala: a modern SQL engine for Hadoop
• General purpose SQL engine
• Supports queries that take from milliseconds to hours
• Runs directly within Hadoop
• Reads widely used Hadoop formats
• Talks to widely used Hadoop storage managers
• Runs on the same Hadoop nodes
• High Performance
• Completely new engine (no MR)
• Runtime code generation

16

Avro: a format for big data
• Expressive
• Records, arrays, unions, enums
• Efficient
• Compact binary, compressed, splittable
• Interoperable
• Langs: C, C++, C#, Java, Perl, Python, Ruby, PHP
• Tools: MR, Pig, Hive, Crunch, Flume, Sqoop, etc.
• Dynamic
• Can read & write w/o generating code first
• Evolvable

17

Crunch
• An API for MapReduce
• Alternative to Pig and Hive
• Inspired by Google’s FlumeJava paper
• In Java (& Scala)
• Easier to integrate application logic
• With a full programming language
• Concepts
• PCollection: set of values w/ parallelDo operator
• PTable: key/value mapping w/ groupBy operator
• Pipeline: executor that runs MapReduce jobs

18

Crunch Word Count

public class WordCount {
public static void main(String[] args) throws Exception {
Pipeline pipeline = new MRPipeline(WordCount.class);
PCollection lines = pipeline.readTextFile(args[0]);

PCollection words = lines.parallelDo("my splitter", new DoFn() {
public void process(String line, Emitter emitter) {
for (String word : line.split("s+")) {
emitter.emit(word);
}
}
}, Writables.strings());

PTable counts = Aggregate.count(words);

pipeline.writeTextFile(counts, args[1]);
pipeline.run();
}
}

19

Scrunch Word Count

class WordCountExample {
val pipeline = new Pipeline[WordCountExample]

def wordCount(fileName: String) = {
pipeline.read(from.textFile(fileName))
.flatMap(_.toLowerCase.split("W+"))
.filter(!_.isEmpty())
.count
}
}

20

Thank You!
Eli Collins

@elicollins

21

The Apache Hadoop Ecosystem and Emerging Technologies

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à The Apache Hadoop Ecosystem and Emerging Technologies

Similaire à The Apache Hadoop Ecosystem and Emerging Technologies (20)

Plus de Cloudera, Inc.

Plus de Cloudera, Inc. (20)

Dernier

Dernier (20)

The Apache Hadoop Ecosystem and Emerging Technologies