SlideShare a Scribd company logo
1 of 57
Download to read offline
(More)
Apache Hadoop
Philip Zeyliger (Math, Dunster ‘04)
philip@cloudera.com
@philz42 @cloudera
October 19, 2009
CS 264
Who am I?
Software Engineer
Zak’s classmate
Worked at
(Interns)
Outline
Review of last Wednesday
Your Homework
Data Warehousing
Some Hadoop Internals
Research & Hadoop
Short Break
Last Wednesday
The Basics
Clusters, not
individual machines
Scale Linearly
Separate App Code
from Fault-Tolerant
Distributed Systems
Code
Systems
Programmers Statisticians
Some Big Numbers
Yahoo! Hadoop Clusters: > 82PB, >25k machines
(Eric14, HadoopWorld NYC ’09)
Google: 40 GB/s GFS read/write load (Jeff Dean,
LADIS ’09) [~3,500 TB/day]
Facebook: 4TB new data per day; DW: 4800 cores, 5.5
PB (Dhruba Borthakur, HadoopWorld)
Physical Flow
M-R Model
Logical
Physical
Important APIs
Input Format
Mapper
Reducer
Partitioner
Combiner
Out. Format
M/R
Flow
Other
Writable
JobClient
*Context
Filesystem
K₁,V₁→K₂,V₂
data→K₁,V₁
K₂,iter(V₂)→K₂,V₂
K₂,V₂→int
K₂, iter(V₂)→K₃,V₃
K₃,V₃→data
→ is 1:many
public int run(String[] args)
throws Exception {
if (args.length < 3) {
System.out.println("Grep
<inDir> <outDir> <regex>
[<group>]");
ToolRunner.printGenericCommandUsage
(System.out);
return -1;
}
Path tempDir = new Path("grep-
temp-"+Integer.toString(new
Random().nextInt(Integer.MAX_VALUE)
));
JobConf grepJob = new
JobConf(getConf(), Grep.class);
try {
grepJob.setJobName("grep-
search");
FileInputFormat.setInputPaths(grepJ
ob, args[0]);
grepJob.setMapperClass(RegexMapper.
class);
grepJob.set("mapred.mapper.regex",
args[2]);
if (args.length == 4)
grepJob.set("mapred.mapper.regex.gr
oup", args[3]);
grepJob.setCombinerClass(LongSumRed
ucer.class);
grepJob.setReducerClass(LongSumRedu
cer.class);
FileOutputFormat.setOutputPath(grep
Job, tempDir);
grepJob.setOutputFormat(SequenceFil
eOutputFormat.class);
grepJob.setOutputKeyClass(Text.clas
s);
grepJob.setOutputValueClass(LongWri
table.class);
JobClient.runJob(grepJob);
JobConf sortJob = new
JobConf(Grep.class);
sortJob.setJobName("grep-
sort");
FileInputFormat.setInputPaths(sortJ
ob, tempDir);
sortJob.setInputFormat(SequenceFile
InputFormat.class);
sortJob.setMapperClass(InverseMappe
r.class);
// write a single file
sortJob.setNumReduceTasks(1);
FileOutputFormat.setOutputPath(sort
Job, new Path(args[1]));
// sort by decreasing freq
sortJob.setOutputKeyComparatorClass
(LongWritable.DecreasingComparator.
class);
JobClient.runJob(sortJob);
} finally {
FileSystem.get(grepJob).delete(temp
Dir, true);
}
return 0;
}
the “grep”
example
$ cat input.txt
adams dunster kirkland dunster
kirland dudley dunster
adams dunster winthrop
$ bin/hadoop jar hadoop-0.18.3-
examples.jar grep input.txt output1
'dunster|adams'
$ cat output1/part-00000
4 dunster
2 adams
JobConf grepJob = new JobConf(getConf(), Grep.class);
try {
grepJob.setJobName("grep-search");
FileInputFormat.setInputPaths(grepJob, args[0]);
grepJob.setMapperClass(RegexMapper.class);
grepJob.set("mapred.mapper.regex", args[2]);
if (args.length == 4)
grepJob.set("mapred.mapper.regex.group", args[3]);
grepJob.setCombinerClass(LongSumReducer.class);
grepJob.setReducerClass(LongSumReducer.class);
FileOutputFormat.setOutputPath(grepJob, tempDir);
grepJob.setOutputFormat(SequenceFileOutputFormat.class);
grepJob.setOutputKeyClass(Text.class);
grepJob.setOutputValueClass(LongWritable.class);
JobClient.runJob(grepJob);
} ...
Job
1of 2
JobConf sortJob = new JobConf(Grep.class);
sortJob.setJobName("grep-sort");
FileInputFormat.setInputPaths(sortJob, tempDir);
sortJob.setInputFormat(SequenceFileInputFormat.class);
sortJob.setMapperClass(InverseMapper.class);
// write a single file
sortJob.setNumReduceTasks(1);
FileOutputFormat.setOutputPath(sortJob, new Path(args[1]));
// sort by decreasing freq
sortJob.setOutputKeyComparatorClass(
LongWritable.DecreasingComparator.class);
JobClient.runJob(sortJob);
} finally {
FileSystem.get(grepJob).delete(tempDir, true);
}
return 0;
}
Job
2 of 2
(implicit identity reducer)
The types there...
?,Text
Text, Long
Long,Text
Text, list(Long)
Text, Long
A Simple Join
Id Last First
1 Washington George
2 Lincoln Abraham
Location Id Time
Dunster 1 11:00am
Dunster 2 11:02am
Kirkland 2 11:08am
You want to track individuals throughout the day.
How would you do this in M/R, if you had to?
People
Key
Entry
Log
(white-board)
Your Homework
(this is the only lolcat in this lecture)
Mental Challenges
Learn an
algorithm
Adapt it to M/R
Model
Practical Challenges
Learn Finicky
Software
Debug an unfamiliar
environment
Implement PageRank overWikipedia Pages
Tackle Parts Separately
Algorithm
Implementing in M/R
(What are the type signatures?)
Starting a cluster on EC2
Small dataset
Large dataset
Advice
More Advice
Wealth of “Getting Started”
materials online
Feel free to work together
Don’t be a perfectionist about
it; data is dirty!
if (____ ≫ Java), use “streaming”
Good Luck!
Data Warehousing
101
What is DW?
a.k.a. BI “Business Intelligence”
Provides data to support decisions
Not the operational/transactional
database
e.g., answers “what has our inventory
been over time?”, not “what is our
inventory now?”
Why DW?
Learn from data
Reporting
Ad-hoc analysis
e.g.: which trail mix should TJ’s
discontinue? (and other important
business questions)
17
A bit smaller than
natural peanut butte
TraderJoe’s Mini M
excellent for snacki
for chocolate chips
cream. We’re sellin
Chocolat
$1.99
Do you have a first
it involve nearly br
Trader Joe’s, you c
it takes is a bit of
Coated Granola B
No rock-hard-teeth
oats, organic rice cr
The bottoms are cov
chocolate. They’re
these little chocolat
Trader Joe’s Cho
Bars are definitely
healthier when we
flavors, colors or pr
fats. And because
deliciously affordable,
pitting and popping cherries into our mouths at a rate of more
than 157 million pounds over a three month period. Wow!
So what becomes of the other 53 million pounds? Well,
some of the fruit is frozen, some used for jams and preserves
and some is used to make Trader Joe’s Cherry Cider. Our
Cherry Cider is a 100% juice blend – cherry, apple, plum
and pineapple juices from concentrate – that makes ample
use of Bing cherries from the Pacific Northwest. It has big,
bold cherry sweetness and no added sugar. We’re selling
Cherry Cider in a 64 fluid ounce bottle for $3.69, every day.
I told you, hands off the
Chocolate Chip Granola Bars!
Geez, lighten up. You
get six in every box.
You could share.
Traditionally...
Big databases
Schemas
Dimensional Modelling (Ralph Kimball)
Magnetic
Agile
Deep
“MAD Skills”
MAD Skills: New Analysis Practices for Big Data
Jeffrey Cohen
Greenplum
Brian Dolan
Fox Interactive Media
Mark Dunlap
Evergreen Technologies
Joseph M. Hellerstein
U.C. Berkeley
Caleb Welton
Greenplum
ABSTRACT
As massive data acquisition and storage becomes increas-
ingly affordable, a wide variety of enterprises are employing
statisticians to engage in sophisticated data analysis. In this
paper we highlight the emerging practice of Magnetic, Ag-
ile, Deep (MAD) data analysis as a radical departure from
traditional Enterprise Data Warehouses and Business Intel-
ligence. We present our design philosophy, techniques and
experience providing MAD analytics for one of the world’s
largest advertising networks at Fox Interactive Media, us-
ing the Greenplum parallel database system. We describe
database design methodologies that support the agile work-
ing style of analysts in these settings. We present data-
parallel algorithms for sophisticated statistical techniques,
with a focus on density methods. Finally, we reflect on
database system features that enable agile design and flexi-
ble algorithm development using both SQL and MapReduce
interfaces over a variety of storage mechanisms.
1. INTRODUCTION
If you are looking for a career where your services will be
in high demand, you should find something where you provide
a scarce, complementary service to something that is getting
ubiquitous and cheap. So what’s getting ubiquitous and cheap?
Data. And what is complementary to data? Analysis.
– Prof. Hal Varian, UC Berkeley, Chief Economist at Google [5]
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills.
– UrbanDictionary.com [22]
Standard business practices for large-scale data analysis cen-
ter on the notion of an “Enterprise Data Warehouse” (EDW)
that is queried by “Business Intelligence” (BI) software. BI
tools produce reports and interactive interfaces that summa-
into groups. This was the topic of significant academic re-
search and industrial development throughout the 1990’s.
Traditionally, a carefully designed EDW is considered to
have a central role in good IT practice. The design and
evolution of a comprehensive EDW schema serves as the
rallying point for disciplined data integration within a large
enterprise, rationalizing the outputs and representations of
all business processes. The resulting database serves as the
repository of record for critical business functions. In addi-
tion, the database server storing the EDW has traditionally
been a major computational asset, serving as the central,
scalable engine for key enterprise analytics. The concep-
tual and computational centrality of the EDW makes it a
mission-critical, expensive resource, used for serving data-
intensive reports targeted at executive decision-makers. It is
traditionally controlled by a dedicated IT staff that not only
maintains the system, but jealously controls access to ensure
that executives can rely on a high quality of service. [12]
While this orthodox EDW approach continues today in
many settings, a number of factors are pushing towards a
very different philosophy for large-scale data management in
the enterprise. First, storage is now so cheap that small sub-
groups within an enterprise can develop an isolated database
of astonishing scale within their discretionary budget. The
world’s largest data warehouse from just over a decade ago
can be stored on less than 20 commodity disks priced at
under $100 today. A department can pay for 1-2 orders
of magnitude more storage than that without coordinating
with management. Meanwhile, the number of massive-scale
data sources in an enterprise has grown remarkably: mas-
sive databases arise today even from single sources like click-
streams, software logs, email and discussion forum archives,
etc. Finally, the value of data analysis has entered com-
mon culture, with numerous companies showing how sophis-
ticated data analysis leads to cost savings and even direct
revenue. The end result of these opportunities is a grassroots
move to collect and leverage data in multiple organizational
MADness is Enabling
Instrumentation
Collection
Storage (Raw Data)
ETL (Extraction,Transform, Load)
RDBMS (Aggregates)
BI / Reporting
Traditional DW
} Ad-hoc
Queries?
Data Mining?
Data Mining
Instrumentation
Collection
Storage (Raw Data)
ETL (Extraction,Transform, Load)
RDBMS (Aggregates)
BI / Reporting
Traditional DW
} Ad-hoc
Queries
Facebook’s DW (phase N)
Oracle Database
Server
Data Collection
Server
MySQL Tier
be Tier
Facebook’s DW (phase M)
M > N
Facebook Data Infrastructure
2008
MySQL Tier
Scribe Tier
Hadoop Tier
Oracle RAC Servers
Wednesday, April 1, 2009
Short Break
Hadoop Internals
HDFS
Namenode
Datanodes
One Rack A Different Rack
3x64MB file, 3 rep
4x64MB file, 3 rep
Small file, 7 rep
HDFS Write Path
HDFS Failures?
Datanode crash?
Clients read another copy
Background rebalance
Namenode crash?
uh-oh
M/R
Tasktrackers on the same
machines as datanodes
One Rack A Different Rack
Job on stars
Different job
Idle
M/R
Task fails
Try again?
Try again somewhere else?
Report failure
Retries possible because of idempotence
M/R Failures
Programming these
systems...
Everything can fail
Inherently multi-threaded
Toolset still young
Mental models are different...
Research &
Hadoop
Scheduling & Sharing
Mixed use
Batch
Interactive
Real-time
Isolation
Text
Metrics: Latency,Throughput, Utilization (per resource)
Scheduling
Fair and LATE
Scheduling (Berkeley)
Nexus (Berkeley)
Quincy (MSR)
Implementation
BOOM Project
(Berkeley)
Overlog (Berkeley)
APPENDIX
A. NARADA IN OverLog
Here we provide an executable OverLog implementation
of Narada’s mesh maintenance algorithms. Current limita-
tions of the P2 parser and planner require slightly wordier
syntax for some of our constructs. Specifically, handling of
negation is still incomplete, requiring that we rewrite some
rules to eliminate negation. Furthermore, our planner cur-
rently handles rules with collocated terms only. The Over-
Log specification below is directly parsed and executed by
our current codebase.
/** Base tables */
materialize(member, infinity, infinity, keys(2)).
materialize(sequence, infinity, 1, keys(2)).
materialize(neighbor, infinity, infinity, keys(2)).
/* Environment table containing configuration
values */
materialize(env, infinity, infinity, keys(2,3)).
/* Setup of configuration values */
E0 neighbor@X(X,Y) :- periodic@X(X,E,0,1), env@X(X,
H, Y), H == "neighbor".
/** Start with sequence number 0 */
S0 sequence@X(X, Sequence) :- periodic@X(X, E, 0,
1), Sequence := 0.
/** Periodically start a refresh */
R1 refreshEvent@X(X) :- periodic@X(X, E, 3).
/** Increment my own sequence number */
R2 refreshSequence@X(X, NewSequence) :-
refreshEvent@X(X), sequence@X(X, Sequence),
NewSequence := Sequence + 1.
/** Save my incremented sequence */
R3 sequence@X(X, NewSequence) :-
refreshSequence@X(X, NewSequence).
/** Send a refresh to all neighbors with my current
membership */
R4 refresh@Y(Y, X, NewSequence, Address, ASequence,
ALive) :- refreshSequence@X(X, NewSequence),
member@X(X, Address, ASequence, Time, ALive),
neighbor@X(X, Y).
/** How many member entries that match the member
in a refresh message (but not myself) do I have? */
R5 membersFound@X(X, Address, ASeq, ALive,
count<*>) :- refresh@X(X, Y, YSeq, Address, ASeq,
ALive), member@X(X, Address, MySeq, MyTime,
MyLive), X != Address.
/** If I have none, just store what I got */
R6 member@X(X, Address, ASequence, T, ALive) :-
membersFound@X(X, Address, ASequence, ALive, C),
C == 0, T := f_now().
/** If I have some, just update with the
information I received if it has a higher
sequence number. */
R7 member@X(X, Address, ASequence, T, ALive) :-
membersFound@X(X, Address, ASequence, ALive, C),
C > 0, T := f_now(), member@X(X, Address,
MySequence, MyT, MyLive), MySequence < ASequence.
/** Update my neighbor’s member entry */
R8 member@X(X, Y, YSeq, T, YLive) :- refresh@X(X,
Y, YSeq, A, AS, AL), T := f_now(), YLive := 1.
/** Add anyone from whom I receive a refresh
message to my neighbors */
N1 neighbor@X(X, Y) :- refresh@X(X, Y,
YS, A, AS, L).
/** Probing of neighbor liveness */
L1 neighborProbe@X(X) :- periodic@X(X, E, 1).
L2 deadNeighbor@X(X, Y) :- neighborProbe@X(X), T :=
f_now(), neighbor@X(X, Y), member@X(X, Y, YS, YT,
L), T - YT > 20.
L3 delete neighbor@X(X, Y) :- deadNeighbor@X(X, Y).
L4 member@X(X, Neighbor, DeadSequence, T, Live) :-
deadNeighbor@X(X, Neighbor), member@X(X,
Neighbor, S, T1, L), Live := 0, DeadSequence := S
+ 1, T:= f_now().
B. CHORD IN OverLog
Here we provide the full OverLog specification for Chord.
This specification deals with lookups, ring maintenance with
a fixed number of successors, finger-table maintenance and
opportunistic finger table population, joins, stabilization,
and node failure detection.
/* The base tuples */
materialize(node, infinity, 1, keys(1)).
materialize(finger, 180, 160, keys(2)).
materialize(bestSucc, infinity, 1, keys(1)).
materialize(succDist, 10, 100, keys(2)).
materialize(succ, 10, 100, keys(2)).
materialize(pred, infinity, 100, keys(1)).
materialize(succCount, infinity, 1, keys(1)).
materialize(join, 10, 5, keys(1)).
materialize(landmark, infinity, 1, keys(1)).
materialize(fFix, infinity, 160, keys(2)).
materialize(nextFingerFix, infinity, 1, keys(1)).
materialize(pingNode, 10, infinity, keys(2)).
materialize(pendingPing, 10, infinity, keys(2)).
/** Lookups */
L1 lookupResults@R(R,K,S,SI,E) :- node@NI(NI,N),
lookup@NI(NI,K,R,E), bestSucc@NI(NI,S,SI), K in
15
Debugging and
Visualization
0 100 200 300 400
0
10
20
30
40
Time/s
Per-task
Task durations (RandomWriter: 100GB written: 4 hosts): All nodes
JT_Map
0 200 400 600 800
0
50
100
150
Time/s
Per-task
Task durations (Sort: 20GB input: 4 hosts): All nodes
JT_Map
JT_Reduce
Figure 5: Summarized Swimlanes plot for RandomWriter (top) and Sort (bottom)
0 200 400 600 800
0
10
20
30
40
50
60
Time/s
Per-task
Task durations (Matrix-Vec Multiply, Inefficient # Reducers): Per-node
JT_Map
JT_Reduce
JT_Map
JT_Reduce
JT_Map
JT_Reduce
JT_Map
JT_Reduce
0 100 200 300 400 500 600 700
0
20
40
60
Time/s
Per-task
Task durations (Matrix-Vec Multiply, Efficient # Reducers): Per-node
JT_Map
JT_Reduce
JT_Map
JT_Reduce
JT_Map
JT_Reduce
JT_Map
JT_Reduce
Figure 6: Matrix-vector Multiplication before optimization (above), and after optimization (below)
4 Examples of Mochi’s Value
We demonstrate the use of Mochi’s visualizations (using mainly Swimlanes due to space constraints). All
of the data is derived from log traces from the Yahoo! M45 [11] production cluster. The examples in § 4.1,
§ 4.2 involve 5-node clusters (4-slave, 1-master), and the example in § 4.3 is from a 25-node cluster. Mochi’s
analysis and visualizations have run on real-world data from 300-node Hadoop production clusters, but we
omit these results for lack of space; furthermore, at that scale, Mochi’s interactive visualization (zooming
in/out and targeted inspection) is of more benefit, rather than a static one.
4.1 Understanding Hadoop Job Structure
Figure 5 shows the Swimlanes plots from the Sort and RandomWriter benchmark workloads (part of the
Mochi (CMU)
Parallax (UW)
Usability
Performance
Need for
benchmarks
(besides GraySort)
Low-hanging fruit!
Higher-Level Languages
Hive (a lot like SQL) (Facebook/Apache)
Pig Latin (Yahoo!/Apache)
DryadLINQ (Microsoft)
Sawzall (Google)
SCOPE (Microsoft)
JAQL (IBM)
Optimizations
For a single query....
For a single workflow...
Across workflows...
Bring out last century’s DB
research! (joins)
And file system research
too! (RAID)
HadoopDB (Yale)
Data Formats (yes, in ’09)
New Datastore Models
File System
Bigtable, Dynamo,
Cassanda, ...
Database
New Computation
Models
MPI
M/R
Online M/R
Dryad
Pregel for Graphs
Iterative ML Algorithms
Hardware
Data Center Design
(Hamilton, Barroso,
Hölzle)
Energy-Efficiency
Network Topology and
Hardware
What does flash mean in
this context?
What about multi-core?
Larger-Scale Computing
Synchronization,
Coordination, and
Consistency
Chubby, ZooKeeper, Paxos, ...
Eventual Consistency
Applied Research
(research using M/R)
“Unreasonable
Effectiveness of Data”
WebTables (Cafarella)
Translation
ML...
Conferences...
(some in exotic locales)
SIGMOD
VLDB
ICDE
CIDR
HPTS
SOSP
LADIS
OSDI
SIGCOMM
HotCloud
NSDI
SC/ISC
SoCC
Others
(ask a prof!)
Parting Thoughts
The Wheel
Don’t Re-invent
Focus on your
data/problem
What about...
Reliability,
Durability,
Stability,Tooling
19
crispies) to create a salty, savory snack that dares to thin
outside the snack box. Sound a little strange? Perhaps. Bu
once you try them, we think you’ll be back for more. We’r
selling Trader Joe’s Sesame Seaweed Rice Balls in a fiv
ounce bag for only $1.49.
Baby Swiss from a Master • Only $3.99 a Pound!
Trader Joe’s Baby Swiss Cheese comes to us from a
Wisconsin farmer-owned cheese co-op that has been
producing craftsman cheeses since 1885. It is an artisan-
made cheese produced under the watchful eye of a Master
Cheesemaker who has been creating quality cheeses fo
more than 30 years.
Baby Swiss is similar to Swiss cheese but is aged for a shorte
period of time, resulting in a milder cheese with significantl
smaller “eyes” than its grown-up namesake. From a flavo
standpoint, it’s buttery, a little nutty and a touch sweet. I
chunks well for salads, melts beautifully on burgers an
slices easily for snacks. We’re selling random weight block
of Master-crafted Trader Joe’s Baby Swiss Cheese fo
$3.99 a pound, every day – a terrific value, and the sam
great price we offered on this cheese back in 2005!
Sweet & Nutty… Just Like We Are!
“The Original”
Honey Roasted Peanuts
Remember the sweet and crunchy taste of the original honey
roasted peanuts? Remember the first time you tried a knock-
off version and felt sadness, coupled with disappointment,
enveloped in ennui, longing for a snack that was as good
as the original? Trader Joe’s has the power to make you
ennui-free.
When the original purveyor of honey roasted peanuts became
yet another victim of corporate reorganization, one of our
industrious nut suppliers bought exclusive rights to their
original honey roasted peanut recipe, and we’ve been selling
truckloads of them ever since. Honey Roasted Peanuts are
a natural for snacking any time – to satisfy the afternoon
munchies, out on a long hike, or just sitting in front of the
TV watching a game.
Proof that our nut buyer is as industrious as our nut supplier,
we’re selling this one-of-a-kind product at a one-of-a-kind
price – each 16 ounce bag of Trader Joe’s The Original
Honey Roasted Peanuts is $2.69, every day.
the flour, and are able to sell a five pound bag to you for only
$2.99. Our flour is made from 100% U.S. grown hard wheat
– All Purpose is a blend of hard winter and spring wheat
and White Whole Wheat is 100% hard white winter wheat
– and both have four grams of protein in every quarter-cup
serving. You’ll find both Baker Josef’s Flours directly at
the source – your neighborhood Trader Joe’s.
Uh-oh. Looks like Joe’s been reinventing the wheel again.
“Look, there are lots of different types of wheels!” –Todd Lipcon
Re-invent!
Lots of new
possibilities!
New Models!
New implementations!
Better optimizations!
Conclusion
It’s a great time to be in Distributed Systems.
Participate!
Build!
Collaborate!
Questions?
philip@cloudera.com
(we’re hiring) (interns)

More Related Content

What's hot

concurrency gpars
concurrency gparsconcurrency gpars
concurrency gparsPaul King
 
Data access 2.0? Please welcome: Spring Data!
Data access 2.0? Please welcome: Spring Data!Data access 2.0? Please welcome: Spring Data!
Data access 2.0? Please welcome: Spring Data!Oliver Gierke
 
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitJDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitPROIDEA
 
Utah Big Mountain Conference: AncestryDNA, HBase, Hadoop (9-7-2013)
Utah Big Mountain Conference:  AncestryDNA, HBase, Hadoop (9-7-2013)Utah Big Mountain Conference:  AncestryDNA, HBase, Hadoop (9-7-2013)
Utah Big Mountain Conference: AncestryDNA, HBase, Hadoop (9-7-2013)William Yetman
 
Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02
Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02
Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02Nick Baguley
 
Industrial Strength Groovy - Tools for the Professional Groovy Developer: Pau...
Industrial Strength Groovy - Tools for the Professional Groovy Developer: Pau...Industrial Strength Groovy - Tools for the Professional Groovy Developer: Pau...
Industrial Strength Groovy - Tools for the Professional Groovy Developer: Pau...Paul King
 
MongoDB: Easy Java Persistence with Morphia
MongoDB: Easy Java Persistence with MorphiaMongoDB: Easy Java Persistence with Morphia
MongoDB: Easy Java Persistence with MorphiaScott Hernandez
 
MongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduceMongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduceTakahiro Inoue
 
No JS and DartCon
No JS and DartConNo JS and DartCon
No JS and DartConanandvns
 
The State of NoSQL
The State of NoSQLThe State of NoSQL
The State of NoSQLBen Scofield
 
Xsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceXsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceLiquidHub
 
Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter
Huahin Framework for Hadoop, Hadoop Conference Japan 2013 WinterHuahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter
Huahin Framework for Hadoop, Hadoop Conference Japan 2013 WinterRyu Kobayashi
 
Dachis group pigout_101
Dachis group pigout_101Dachis group pigout_101
Dachis group pigout_101ktsafford
 
Mythbusting: Understanding How We Measure the Performance of MongoDB
Mythbusting: Understanding How We Measure the Performance of MongoDBMythbusting: Understanding How We Measure the Performance of MongoDB
Mythbusting: Understanding How We Measure the Performance of MongoDBMongoDB
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...MongoDB
 

What's hot (16)

concurrency gpars
concurrency gparsconcurrency gpars
concurrency gpars
 
Data access 2.0? Please welcome: Spring Data!
Data access 2.0? Please welcome: Spring Data!Data access 2.0? Please welcome: Spring Data!
Data access 2.0? Please welcome: Spring Data!
 
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitJDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
 
Utah Big Mountain Conference: AncestryDNA, HBase, Hadoop (9-7-2013)
Utah Big Mountain Conference:  AncestryDNA, HBase, Hadoop (9-7-2013)Utah Big Mountain Conference:  AncestryDNA, HBase, Hadoop (9-7-2013)
Utah Big Mountain Conference: AncestryDNA, HBase, Hadoop (9-7-2013)
 
Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02
Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02
Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02
 
Industrial Strength Groovy - Tools for the Professional Groovy Developer: Pau...
Industrial Strength Groovy - Tools for the Professional Groovy Developer: Pau...Industrial Strength Groovy - Tools for the Professional Groovy Developer: Pau...
Industrial Strength Groovy - Tools for the Professional Groovy Developer: Pau...
 
Qtp test
Qtp testQtp test
Qtp test
 
MongoDB: Easy Java Persistence with Morphia
MongoDB: Easy Java Persistence with MorphiaMongoDB: Easy Java Persistence with Morphia
MongoDB: Easy Java Persistence with Morphia
 
MongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduceMongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduce
 
No JS and DartCon
No JS and DartConNo JS and DartCon
No JS and DartCon
 
The State of NoSQL
The State of NoSQLThe State of NoSQL
The State of NoSQL
 
Xsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceXsl Tand X Path Quick Reference
Xsl Tand X Path Quick Reference
 
Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter
Huahin Framework for Hadoop, Hadoop Conference Japan 2013 WinterHuahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter
Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter
 
Dachis group pigout_101
Dachis group pigout_101Dachis group pigout_101
Dachis group pigout_101
 
Mythbusting: Understanding How We Measure the Performance of MongoDB
Mythbusting: Understanding How We Measure the Performance of MongoDBMythbusting: Understanding How We Measure the Performance of MongoDB
Mythbusting: Understanding How We Measure the Performance of MongoDB
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
 

Viewers also liked

Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and ImplementationDistributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementationtugrulh
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in SparkReynold Xin
 

Viewers also liked (6)

Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and ImplementationDistributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Hadoop Lecture for Harvard's CS 264 -- October 19, 2009

  • 1. (More) Apache Hadoop Philip Zeyliger (Math, Dunster ‘04) philip@cloudera.com @philz42 @cloudera October 19, 2009 CS 264
  • 2. Who am I? Software Engineer Zak’s classmate Worked at (Interns)
  • 3. Outline Review of last Wednesday Your Homework Data Warehousing Some Hadoop Internals Research & Hadoop Short Break
  • 5. The Basics Clusters, not individual machines Scale Linearly Separate App Code from Fault-Tolerant Distributed Systems Code Systems Programmers Statisticians
  • 6. Some Big Numbers Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ’09) Google: 40 GB/s GFS read/write load (Jeff Dean, LADIS ’09) [~3,500 TB/day] Facebook: 4TB new data per day; DW: 4800 cores, 5.5 PB (Dhruba Borthakur, HadoopWorld)
  • 8. Important APIs Input Format Mapper Reducer Partitioner Combiner Out. Format M/R Flow Other Writable JobClient *Context Filesystem K₁,V₁→K₂,V₂ data→K₁,V₁ K₂,iter(V₂)→K₂,V₂ K₂,V₂→int K₂, iter(V₂)→K₃,V₃ K₃,V₃→data → is 1:many
  • 9. public int run(String[] args) throws Exception { if (args.length < 3) { System.out.println("Grep <inDir> <outDir> <regex> [<group>]"); ToolRunner.printGenericCommandUsage (System.out); return -1; } Path tempDir = new Path("grep- temp-"+Integer.toString(new Random().nextInt(Integer.MAX_VALUE) )); JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName("grep- search"); FileInputFormat.setInputPaths(grepJ ob, args[0]); grepJob.setMapperClass(RegexMapper. class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.gr oup", args[3]); grepJob.setCombinerClass(LongSumRed ucer.class); grepJob.setReducerClass(LongSumRedu cer.class); FileOutputFormat.setOutputPath(grep Job, tempDir); grepJob.setOutputFormat(SequenceFil eOutputFormat.class); grepJob.setOutputKeyClass(Text.clas s); grepJob.setOutputValueClass(LongWri table.class); JobClient.runJob(grepJob); JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep- sort"); FileInputFormat.setInputPaths(sortJ ob, tempDir); sortJob.setInputFormat(SequenceFile InputFormat.class); sortJob.setMapperClass(InverseMappe r.class); // write a single file sortJob.setNumReduceTasks(1); FileOutputFormat.setOutputPath(sort Job, new Path(args[1])); // sort by decreasing freq sortJob.setOutputKeyComparatorClass (LongWritable.DecreasingComparator. class); JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(temp Dir, true); } return 0; } the “grep” example
  • 10. $ cat input.txt adams dunster kirkland dunster kirland dudley dunster adams dunster winthrop $ bin/hadoop jar hadoop-0.18.3- examples.jar grep input.txt output1 'dunster|adams' $ cat output1/part-00000 4 dunster 2 adams
  • 11. JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName("grep-search"); FileInputFormat.setInputPaths(grepJob, args[0]); grepJob.setMapperClass(RegexMapper.class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.group", args[3]); grepJob.setCombinerClass(LongSumReducer.class); grepJob.setReducerClass(LongSumReducer.class); FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class); JobClient.runJob(grepJob); } ... Job 1of 2
  • 12. JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempDir); sortJob.setInputFormat(SequenceFileInputFormat.class); sortJob.setMapperClass(InverseMapper.class); // write a single file sortJob.setNumReduceTasks(1); FileOutputFormat.setOutputPath(sortJob, new Path(args[1])); // sort by decreasing freq sortJob.setOutputKeyComparatorClass( LongWritable.DecreasingComparator.class); JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(tempDir, true); } return 0; } Job 2 of 2 (implicit identity reducer)
  • 13. The types there... ?,Text Text, Long Long,Text Text, list(Long) Text, Long
  • 14. A Simple Join Id Last First 1 Washington George 2 Lincoln Abraham Location Id Time Dunster 1 11:00am Dunster 2 11:02am Kirkland 2 11:08am You want to track individuals throughout the day. How would you do this in M/R, if you had to? People Key Entry Log
  • 16. Your Homework (this is the only lolcat in this lecture)
  • 17. Mental Challenges Learn an algorithm Adapt it to M/R Model Practical Challenges Learn Finicky Software Debug an unfamiliar environment Implement PageRank overWikipedia Pages
  • 18. Tackle Parts Separately Algorithm Implementing in M/R (What are the type signatures?) Starting a cluster on EC2 Small dataset Large dataset Advice
  • 19. More Advice Wealth of “Getting Started” materials online Feel free to work together Don’t be a perfectionist about it; data is dirty! if (____ ≫ Java), use “streaming”
  • 22. What is DW? a.k.a. BI “Business Intelligence” Provides data to support decisions Not the operational/transactional database e.g., answers “what has our inventory been over time?”, not “what is our inventory now?”
  • 23. Why DW? Learn from data Reporting Ad-hoc analysis e.g.: which trail mix should TJ’s discontinue? (and other important business questions) 17 A bit smaller than natural peanut butte TraderJoe’s Mini M excellent for snacki for chocolate chips cream. We’re sellin Chocolat $1.99 Do you have a first it involve nearly br Trader Joe’s, you c it takes is a bit of Coated Granola B No rock-hard-teeth oats, organic rice cr The bottoms are cov chocolate. They’re these little chocolat Trader Joe’s Cho Bars are definitely healthier when we flavors, colors or pr fats. And because deliciously affordable, pitting and popping cherries into our mouths at a rate of more than 157 million pounds over a three month period. Wow! So what becomes of the other 53 million pounds? Well, some of the fruit is frozen, some used for jams and preserves and some is used to make Trader Joe’s Cherry Cider. Our Cherry Cider is a 100% juice blend – cherry, apple, plum and pineapple juices from concentrate – that makes ample use of Bing cherries from the Pacific Northwest. It has big, bold cherry sweetness and no added sugar. We’re selling Cherry Cider in a 64 fluid ounce bottle for $3.69, every day. I told you, hands off the Chocolate Chip Granola Bars! Geez, lighten up. You get six in every box. You could share.
  • 25. Magnetic Agile Deep “MAD Skills” MAD Skills: New Analysis Practices for Big Data Jeffrey Cohen Greenplum Brian Dolan Fox Interactive Media Mark Dunlap Evergreen Technologies Joseph M. Hellerstein U.C. Berkeley Caleb Welton Greenplum ABSTRACT As massive data acquisition and storage becomes increas- ingly affordable, a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis. In this paper we highlight the emerging practice of Magnetic, Ag- ile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intel- ligence. We present our design philosophy, techniques and experience providing MAD analytics for one of the world’s largest advertising networks at Fox Interactive Media, us- ing the Greenplum parallel database system. We describe database design methodologies that support the agile work- ing style of analysts in these settings. We present data- parallel algorithms for sophisticated statistical techniques, with a focus on density methods. Finally, we reflect on database system features that enable agile design and flexi- ble algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms. 1. INTRODUCTION If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. – Prof. Hal Varian, UC Berkeley, Chief Economist at Google [5] mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills. – UrbanDictionary.com [22] Standard business practices for large-scale data analysis cen- ter on the notion of an “Enterprise Data Warehouse” (EDW) that is queried by “Business Intelligence” (BI) software. BI tools produce reports and interactive interfaces that summa- into groups. This was the topic of significant academic re- search and industrial development throughout the 1990’s. Traditionally, a carefully designed EDW is considered to have a central role in good IT practice. The design and evolution of a comprehensive EDW schema serves as the rallying point for disciplined data integration within a large enterprise, rationalizing the outputs and representations of all business processes. The resulting database serves as the repository of record for critical business functions. In addi- tion, the database server storing the EDW has traditionally been a major computational asset, serving as the central, scalable engine for key enterprise analytics. The concep- tual and computational centrality of the EDW makes it a mission-critical, expensive resource, used for serving data- intensive reports targeted at executive decision-makers. It is traditionally controlled by a dedicated IT staff that not only maintains the system, but jealously controls access to ensure that executives can rely on a high quality of service. [12] While this orthodox EDW approach continues today in many settings, a number of factors are pushing towards a very different philosophy for large-scale data management in the enterprise. First, storage is now so cheap that small sub- groups within an enterprise can develop an isolated database of astonishing scale within their discretionary budget. The world’s largest data warehouse from just over a decade ago can be stored on less than 20 commodity disks priced at under $100 today. A department can pay for 1-2 orders of magnitude more storage than that without coordinating with management. Meanwhile, the number of massive-scale data sources in an enterprise has grown remarkably: mas- sive databases arise today even from single sources like click- streams, software logs, email and discussion forum archives, etc. Finally, the value of data analysis has entered com- mon culture, with numerous companies showing how sophis- ticated data analysis leads to cost savings and even direct revenue. The end result of these opportunities is a grassroots move to collect and leverage data in multiple organizational
  • 26. MADness is Enabling Instrumentation Collection Storage (Raw Data) ETL (Extraction,Transform, Load) RDBMS (Aggregates) BI / Reporting Traditional DW } Ad-hoc Queries? Data Mining?
  • 27. Data Mining Instrumentation Collection Storage (Raw Data) ETL (Extraction,Transform, Load) RDBMS (Aggregates) BI / Reporting Traditional DW } Ad-hoc Queries
  • 28. Facebook’s DW (phase N) Oracle Database Server Data Collection Server MySQL Tier be Tier
  • 29. Facebook’s DW (phase M) M > N Facebook Data Infrastructure 2008 MySQL Tier Scribe Tier Hadoop Tier Oracle RAC Servers Wednesday, April 1, 2009
  • 32. HDFS Namenode Datanodes One Rack A Different Rack 3x64MB file, 3 rep 4x64MB file, 3 rep Small file, 7 rep
  • 34. HDFS Failures? Datanode crash? Clients read another copy Background rebalance Namenode crash? uh-oh
  • 35. M/R Tasktrackers on the same machines as datanodes One Rack A Different Rack Job on stars Different job Idle
  • 36. M/R
  • 37. Task fails Try again? Try again somewhere else? Report failure Retries possible because of idempotence M/R Failures
  • 38. Programming these systems... Everything can fail Inherently multi-threaded Toolset still young Mental models are different...
  • 40. Scheduling & Sharing Mixed use Batch Interactive Real-time Isolation Text Metrics: Latency,Throughput, Utilization (per resource)
  • 41. Scheduling Fair and LATE Scheduling (Berkeley) Nexus (Berkeley) Quincy (MSR)
  • 42. Implementation BOOM Project (Berkeley) Overlog (Berkeley) APPENDIX A. NARADA IN OverLog Here we provide an executable OverLog implementation of Narada’s mesh maintenance algorithms. Current limita- tions of the P2 parser and planner require slightly wordier syntax for some of our constructs. Specifically, handling of negation is still incomplete, requiring that we rewrite some rules to eliminate negation. Furthermore, our planner cur- rently handles rules with collocated terms only. The Over- Log specification below is directly parsed and executed by our current codebase. /** Base tables */ materialize(member, infinity, infinity, keys(2)). materialize(sequence, infinity, 1, keys(2)). materialize(neighbor, infinity, infinity, keys(2)). /* Environment table containing configuration values */ materialize(env, infinity, infinity, keys(2,3)). /* Setup of configuration values */ E0 neighbor@X(X,Y) :- periodic@X(X,E,0,1), env@X(X, H, Y), H == "neighbor". /** Start with sequence number 0 */ S0 sequence@X(X, Sequence) :- periodic@X(X, E, 0, 1), Sequence := 0. /** Periodically start a refresh */ R1 refreshEvent@X(X) :- periodic@X(X, E, 3). /** Increment my own sequence number */ R2 refreshSequence@X(X, NewSequence) :- refreshEvent@X(X), sequence@X(X, Sequence), NewSequence := Sequence + 1. /** Save my incremented sequence */ R3 sequence@X(X, NewSequence) :- refreshSequence@X(X, NewSequence). /** Send a refresh to all neighbors with my current membership */ R4 refresh@Y(Y, X, NewSequence, Address, ASequence, ALive) :- refreshSequence@X(X, NewSequence), member@X(X, Address, ASequence, Time, ALive), neighbor@X(X, Y). /** How many member entries that match the member in a refresh message (but not myself) do I have? */ R5 membersFound@X(X, Address, ASeq, ALive, count<*>) :- refresh@X(X, Y, YSeq, Address, ASeq, ALive), member@X(X, Address, MySeq, MyTime, MyLive), X != Address. /** If I have none, just store what I got */ R6 member@X(X, Address, ASequence, T, ALive) :- membersFound@X(X, Address, ASequence, ALive, C), C == 0, T := f_now(). /** If I have some, just update with the information I received if it has a higher sequence number. */ R7 member@X(X, Address, ASequence, T, ALive) :- membersFound@X(X, Address, ASequence, ALive, C), C > 0, T := f_now(), member@X(X, Address, MySequence, MyT, MyLive), MySequence < ASequence. /** Update my neighbor’s member entry */ R8 member@X(X, Y, YSeq, T, YLive) :- refresh@X(X, Y, YSeq, A, AS, AL), T := f_now(), YLive := 1. /** Add anyone from whom I receive a refresh message to my neighbors */ N1 neighbor@X(X, Y) :- refresh@X(X, Y, YS, A, AS, L). /** Probing of neighbor liveness */ L1 neighborProbe@X(X) :- periodic@X(X, E, 1). L2 deadNeighbor@X(X, Y) :- neighborProbe@X(X), T := f_now(), neighbor@X(X, Y), member@X(X, Y, YS, YT, L), T - YT > 20. L3 delete neighbor@X(X, Y) :- deadNeighbor@X(X, Y). L4 member@X(X, Neighbor, DeadSequence, T, Live) :- deadNeighbor@X(X, Neighbor), member@X(X, Neighbor, S, T1, L), Live := 0, DeadSequence := S + 1, T:= f_now(). B. CHORD IN OverLog Here we provide the full OverLog specification for Chord. This specification deals with lookups, ring maintenance with a fixed number of successors, finger-table maintenance and opportunistic finger table population, joins, stabilization, and node failure detection. /* The base tuples */ materialize(node, infinity, 1, keys(1)). materialize(finger, 180, 160, keys(2)). materialize(bestSucc, infinity, 1, keys(1)). materialize(succDist, 10, 100, keys(2)). materialize(succ, 10, 100, keys(2)). materialize(pred, infinity, 100, keys(1)). materialize(succCount, infinity, 1, keys(1)). materialize(join, 10, 5, keys(1)). materialize(landmark, infinity, 1, keys(1)). materialize(fFix, infinity, 160, keys(2)). materialize(nextFingerFix, infinity, 1, keys(1)). materialize(pingNode, 10, infinity, keys(2)). materialize(pendingPing, 10, infinity, keys(2)). /** Lookups */ L1 lookupResults@R(R,K,S,SI,E) :- node@NI(NI,N), lookup@NI(NI,K,R,E), bestSucc@NI(NI,S,SI), K in 15
  • 43. Debugging and Visualization 0 100 200 300 400 0 10 20 30 40 Time/s Per-task Task durations (RandomWriter: 100GB written: 4 hosts): All nodes JT_Map 0 200 400 600 800 0 50 100 150 Time/s Per-task Task durations (Sort: 20GB input: 4 hosts): All nodes JT_Map JT_Reduce Figure 5: Summarized Swimlanes plot for RandomWriter (top) and Sort (bottom) 0 200 400 600 800 0 10 20 30 40 50 60 Time/s Per-task Task durations (Matrix-Vec Multiply, Inefficient # Reducers): Per-node JT_Map JT_Reduce JT_Map JT_Reduce JT_Map JT_Reduce JT_Map JT_Reduce 0 100 200 300 400 500 600 700 0 20 40 60 Time/s Per-task Task durations (Matrix-Vec Multiply, Efficient # Reducers): Per-node JT_Map JT_Reduce JT_Map JT_Reduce JT_Map JT_Reduce JT_Map JT_Reduce Figure 6: Matrix-vector Multiplication before optimization (above), and after optimization (below) 4 Examples of Mochi’s Value We demonstrate the use of Mochi’s visualizations (using mainly Swimlanes due to space constraints). All of the data is derived from log traces from the Yahoo! M45 [11] production cluster. The examples in § 4.1, § 4.2 involve 5-node clusters (4-slave, 1-master), and the example in § 4.3 is from a 25-node cluster. Mochi’s analysis and visualizations have run on real-world data from 300-node Hadoop production clusters, but we omit these results for lack of space; furthermore, at that scale, Mochi’s interactive visualization (zooming in/out and targeted inspection) is of more benefit, rather than a static one. 4.1 Understanding Hadoop Job Structure Figure 5 shows the Swimlanes plots from the Sort and RandomWriter benchmark workloads (part of the Mochi (CMU) Parallax (UW)
  • 46. Higher-Level Languages Hive (a lot like SQL) (Facebook/Apache) Pig Latin (Yahoo!/Apache) DryadLINQ (Microsoft) Sawzall (Google) SCOPE (Microsoft) JAQL (IBM)
  • 47. Optimizations For a single query.... For a single workflow... Across workflows... Bring out last century’s DB research! (joins) And file system research too! (RAID) HadoopDB (Yale) Data Formats (yes, in ’09)
  • 48. New Datastore Models File System Bigtable, Dynamo, Cassanda, ... Database
  • 49. New Computation Models MPI M/R Online M/R Dryad Pregel for Graphs Iterative ML Algorithms
  • 50. Hardware Data Center Design (Hamilton, Barroso, Hölzle) Energy-Efficiency Network Topology and Hardware What does flash mean in this context? What about multi-core? Larger-Scale Computing
  • 52. Applied Research (research using M/R) “Unreasonable Effectiveness of Data” WebTables (Cafarella) Translation ML...
  • 53. Conferences... (some in exotic locales) SIGMOD VLDB ICDE CIDR HPTS SOSP LADIS OSDI SIGCOMM HotCloud NSDI SC/ISC SoCC Others (ask a prof!)
  • 55. The Wheel Don’t Re-invent Focus on your data/problem What about... Reliability, Durability, Stability,Tooling 19 crispies) to create a salty, savory snack that dares to thin outside the snack box. Sound a little strange? Perhaps. Bu once you try them, we think you’ll be back for more. We’r selling Trader Joe’s Sesame Seaweed Rice Balls in a fiv ounce bag for only $1.49. Baby Swiss from a Master • Only $3.99 a Pound! Trader Joe’s Baby Swiss Cheese comes to us from a Wisconsin farmer-owned cheese co-op that has been producing craftsman cheeses since 1885. It is an artisan- made cheese produced under the watchful eye of a Master Cheesemaker who has been creating quality cheeses fo more than 30 years. Baby Swiss is similar to Swiss cheese but is aged for a shorte period of time, resulting in a milder cheese with significantl smaller “eyes” than its grown-up namesake. From a flavo standpoint, it’s buttery, a little nutty and a touch sweet. I chunks well for salads, melts beautifully on burgers an slices easily for snacks. We’re selling random weight block of Master-crafted Trader Joe’s Baby Swiss Cheese fo $3.99 a pound, every day – a terrific value, and the sam great price we offered on this cheese back in 2005! Sweet & Nutty… Just Like We Are! “The Original” Honey Roasted Peanuts Remember the sweet and crunchy taste of the original honey roasted peanuts? Remember the first time you tried a knock- off version and felt sadness, coupled with disappointment, enveloped in ennui, longing for a snack that was as good as the original? Trader Joe’s has the power to make you ennui-free. When the original purveyor of honey roasted peanuts became yet another victim of corporate reorganization, one of our industrious nut suppliers bought exclusive rights to their original honey roasted peanut recipe, and we’ve been selling truckloads of them ever since. Honey Roasted Peanuts are a natural for snacking any time – to satisfy the afternoon munchies, out on a long hike, or just sitting in front of the TV watching a game. Proof that our nut buyer is as industrious as our nut supplier, we’re selling this one-of-a-kind product at a one-of-a-kind price – each 16 ounce bag of Trader Joe’s The Original Honey Roasted Peanuts is $2.69, every day. the flour, and are able to sell a five pound bag to you for only $2.99. Our flour is made from 100% U.S. grown hard wheat – All Purpose is a blend of hard winter and spring wheat and White Whole Wheat is 100% hard white winter wheat – and both have four grams of protein in every quarter-cup serving. You’ll find both Baker Josef’s Flours directly at the source – your neighborhood Trader Joe’s. Uh-oh. Looks like Joe’s been reinventing the wheel again. “Look, there are lots of different types of wheels!” –Todd Lipcon Re-invent! Lots of new possibilities! New Models! New implementations! Better optimizations!
  • 56. Conclusion It’s a great time to be in Distributed Systems. Participate! Build! Collaborate!