SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Big Data looks
tiny

from

Stratosphere
Kostas Tzoumas
kostas.tzoumas@tu-berlin.de
Data is an important asset
video & audio streams, sensor data, RFID, GPS, user online
behavior, scientific simulations, web archives, ...

Volume
Handle petabytes of data

Velocity
Handle high data arrival rates

Variety
Handle many heterogeneous data sources

Veracity
2

Handle inherent uncertainty of data
Data
Analysis
3
Four “I”s for Big Analysis
text mining, interactive and ad hoc analysis, machine
learning, graph analysis, statistical algorithms

Iterative
Model the data, do not just describe it

Incremental
Maintain the model under high arrival rates

Interactive
Step-by-step data exploration on very large data

Integrative
4

Fluent unified interfaces for different data models
Hadoop
Hadoop’s selling point is its
low effective storage cost.
Hadoop clusters are becoming a data vortex, attracting
cross-departmental data and changing the data usage
culture in companies.
Hadoop MapReduce was the wrong abstraction and
implementation to begin with and will be superseded
by better systems.
5
Advanced
Analytics
Analytics that model the data to reveal hidden
relationships, not just describe the data.
E.g., machine learning, predictive stats, graph analysis

Increasingly important from a market perspective.
Very different than SQL analytics: different languages and
access patterns (iterative vs. one-pass programs).
Hadoop toolchain poor; R, Matlab, etc not parallel.
6
SQL

MapReduce
BigAnalytics

BigSQL
NoMapReduce
scripting
XQuery?
SQL--

column a query
store++
plan

wrong
platform

scalable
parallel sort
8
Data Scientist:

The Sexiest Job of the 21st Century
Meet the people who
can coax treasure out of
messy, unstructured data.
FROM!(!
by Thomas H. Davenport
!!FROM!pv_users!
and D.J. Patil
!!MAP!pv_users.userid,!pv_users.date!
!!USING!'map_script'!
!!AS!dt,!uid!
!!CLUSTER0BY0dt)!map_output!
INSERT0OVERWRITE0TABLE0pv_users_reduced!
!!REDUCE!map_output.dt,!map_output.uid!
!!USING!'reduce_script'!
!!AS!date,!count;!

≠

hen Jonathan Goldman arrived for work in June 2006
at LinkedIn, the business
networking site, the place still
felt like a start-up. The company had just under 8 million
accounts, and the number was
A"="load"'WordcountInput.txt';"
growing quickly as existing memB"="MAPREDUCE"wordcount.jar"store"A"into"'inputDir‘"load"
""""'outputDir'"as"(word:chararray,"count:"int)" colbers invited their friends and
""""'org.myorg.WordCount"inputDir"outputDir';" weren’t
leagues to join. But users
C"="sort"B"by"count;"
seeking out connections with the people who were already on the site
at the rate executives had expected. Something was apparently missing in the social experience. As one LinkedIn manager put it, “It was
like arriving at a conference reception and realizing you don’t know
anyone. So you just stand in the corner sipping your drink—and you
9
Taken from http://www.oracle.com/technetwork/java/jvmls2013vitek-2013524.pdf
10
Hadoop is...
1. A programming model called MapReduce
2. An implementation of said programming
model, called Hadoop MapReduce
3. A file system, called HDFS
4. A resource manager, called Yarn
5. Interfaces to Hadoop MapReduce
(Pig, Hive, Cascading, ...)
6. An ML library called Mahout.
7. Recently, a collection of runtime
systems (Tez, Impala, Spark,
Stratosphere, ...)
11

* Inspired by
Jens Dittrich
1. A programming model called
MapReduce
val!input!=!TextFile(textInput)
val!words!=!input.flatMap!{!line!=>!line.split(“!“)!}
val!counts!=!words.groupBy!{!word!=>!word!}.count()!
val!output!=!counts.write6(wordsOutput,!CsvOutputFormat())

map

reduce

(
(

) [
) [

“Romeo, Romeo,
wherefore art
thou Romeo?”

=

(Romeo,(1,1,1))
(wherefore,1), (art,1)
(thou,1)

]

(Romeo,1), (Romeo,1)
(wherefore,1), (art,1)
(thou,1), (Romeo,1)

=

12

(Romeo,3)
(wherefore,1), (art,1)
(thou,1)

]
(Romeo, (1,1,1))
(art, (1,1))
(thou, (1,1))

Reduce

(Romeo, 3)
(art, 2)
(thou, 2)

(What, 1)
(art, 1)
(thou, 1)
(hurt, 1)

(wherefore, 1)
(What, 1)
(hurt, 1)

Reduce

“What, art thou
hurt?”

Map

“Romeo, Romeo,
wherefore art thou
Romeo?”

(Romeo, 1)
(Romeo, 1)
(wherefore, 1)
(art, 1)
(thou, 1)
(Romeo, 1)

Map

2. An implementation of said
programming model, called Hadoop
MapReduce

(wherefore, 1)
(What, 1)
(hurt, 1)

Data written
to disk

Data shuffled
over network
13
Hand-coded join in Hadoop
MapReduce

public!class!ReduceSideBookAndAuthorJoin!extends!HadoopJob!{
!!private!static!final!Pattern!SEPARATOR!=!Pattern.compile("t");
!!@Override
!!public!int!run(String[]!args)!throws!Exception!{
!!!!Map<String,String>!parsedArgs!=!parseArgs(args);
!!!!Path!authors!=!new!Path(parsedArgs.get("OOauthors"));
!!!!Path!books!=!new!Path(parsedArgs.get("OObooks"));
!!!!Path!outputPath!=!new!Path(parsedArgs.get("OOoutput"));
!!!!Job!join!=!new!Job(new!Configuration(getConf()));
!!!!Configuration!jobConf!=!join.getConfiguration();
!!!!MultipleInputs.addInputPath(join,!authors,!TextInputFormat.class,!ConvertAuthorsMapper.class);
!!!!MultipleInputs.addInputPath(join,!books,!TextInputFormat.class,!ConvertBooksMapper.class);
!!!!join.setMapOutputKeyClass(SecondarySortedAuthorID.class);
!!!!join.setMapOutputValueClass(AuthorOrTitleAndYearOfPublication.class);
!!!!jobConf.setBoolean("mapred.compress.map.output",!true);
!!!!join.setReducerClass(JoinReducer.class);
!!!!join.setOutputKeyClass(Text.class);
!!!!join.setOutputValueClass(NullWritable.class);
!!!!join.setJarByClass(JoinReducer.class);
!!!!join.setJobName("reduceSideBookAuthorJoin");
!!!!join.setOutputFormatClass(TextOutputFormat.class);
!!!!jobConf.set("mapred.output.dir",!outputPath.toString());
!!!!join.setGroupingComparatorClass(SecondarySortedAuthorID.GroupingComparator.class);
!!!!join.waitForCompletion(true);
!!!!return!0;
!!}
!!static!class!ConvertAuthorsMapper
!!!!!!extends!Mapper<Object,Text,SecondarySortedAuthorID,AuthorOrTitleAndYearOfPublication>!{
!!!!@Override
!!!!protected!void!map(Object!key,!Text!value,!Context!ctx)!throws!IOException,!InterruptedException!
{
!!!!!!String!line!=!value.toString();
!!!!!!if!(line.length()!>!0)!{
!!!!!!!!String[]!tokens!=!SEPARATOR.split(line.toString());
!!!!!!!!long!authorID!=!Long.parseLong(tokens[0]);
!!!!!!!!String!author!=!tokens[1];
!!!!!!!!ctx.write(new!SecondarySortedAuthorID(authorID,!true),!new!
AuthorOrTitleAndYearOfPublication(author));
!!!!!!}
!!!!}
!!}
!!static!class!ConvertBooksMapper
!!!!!!extends!Mapper<Object,Text,SecondarySortedAuthorID,AuthorOrTitleAndYearOfPublication>!{
!!!!@Override
!!!!protected!void!map(Object!key,!Text!line,!Context!ctx)!throws!IOException,!InterruptedException!{
!!!!!!String[]!tokens!=!SEPARATOR.split(line.toString());
!!!!!!long!authorID!=!Long.parseLong(tokens[0]);
!!!!!!short!yearOfPublication!=!Short.parseShort(tokens[1]);
!!!!!!String!title!=!tokens[2];
!!!!!!ctx.write(new!SecondarySortedAuthorID(authorID,!false),!new!
AuthorOrTitleAndYearOfPublication(title,
!!!!!!!!!!yearOfPublication));
!!!!}
!!}
!!static!class!JoinReducer
!!!!!!extends!Reducer<SecondarySortedAuthorID,AuthorOrTitleAndYearOfPublication,Text,NullWritable>!{
!!!!@Override
!!!!protected!void!reduce(SecondarySortedAuthorID!key,!Iterable<AuthorOrTitleAndYearOfPublication>!
values,!Context!ctx)
!!!!!!!!throws!IOException,!InterruptedException!{
!!!!!!String!author!=!null;
!!!!!!for!(AuthorOrTitleAndYearOfPublication!value!:!values)!{
!!!!!!!!if!(author!==!null!&&!!value.containsAuthor())!{
!!!!!!!!!!throw!new!IllegalStateException("No!author!found!for!book:!"!+!value.getTitle());
!!!!!!!!}!else!if!(author!==!null!&&!value.containsAuthor())!{
!!!!!!!!!!author!=!value.getAuthor();
!!!!!!!!}!else!{
!!!!!!!!!!ctx.write(new!Text(author!+!'t'!+!value.getTitle()!+!'t'!+!value.getYearOfPublication()),
!!!!!!!!!!!!!!NullWritable.get());
!!!!!!!!}
!!!!!!}
!!!!}
!!}

!!static!class!SecondarySortedAuthorID!implements!WritableComparable<SecondarySortedAuthorID>!{
!!!!private!boolean!containsAuthor;
!!!!private!long!id;
!!!!static!{
!!!!!!WritableComparator.define(SecondarySortedAuthorID.class,!new!SecondarySortComparator());
!!!!}
!!!!SecondarySortedAuthorID()!{}
!!!!SecondarySortedAuthorID(long!id,!boolean!containsAuthor)!{
!!!!!!this.id!=!id;
!!!!!!this.containsAuthor!=!containsAuthor;
!!!!}
!!!!@Override
!!!!public!int!compareTo(SecondarySortedAuthorID!other)!{
!!!!!!return!ComparisonChain.start()
!!!!!!!!!!.compare(id,!other.id)
!!!!!!!!!!.result();
!!!!}
!!!!@Override
!!!!public!void!write(DataOutput!out)!throws!IOException!{
!!!!!!out.writeBoolean(containsAuthor);
!!!!!!out.writeLong(id);
!!!!}
!!!!@Override
!!!!public!void!readFields(DataInput!in)!throws!IOException!{
!!!!!!containsAuthor!=!in.readBoolean();
!!!!!!id!=!in.readLong();
!!!!}
!!!!@Override
!!!!public!boolean!equals(Object!o)!{
!!!!!!if!(o!instanceof!SecondarySortedAuthorID)!{
!!!!!!!!return!id!==!((SecondarySortedAuthorID)!o).id;
!!!!!!}
!!!!!!return!false;
!!!!}
!!!!@Override
!!!!public!int!hashCode()!{
!!!!!!return!Longs.hashCode(id);
!!!!}
!!!!static!class!SecondarySortComparator!extends!WritableComparator!implements!Serializable!{
!!!!!!protected!SecondarySortComparator()!{
!!!!!!!!super(SecondarySortedAuthorID.class,!true);
!!!!!!}
!!!!!!@Override
!!!!!!public!int!compare(WritableComparable!a,!WritableComparable!b)!{
!!!!!!!!SecondarySortedAuthorID!keyA!=!(SecondarySortedAuthorID)!a;
!!!!!!!!SecondarySortedAuthorID!keyB!=!(SecondarySortedAuthorID)!b;
!!!!!!!!return!ComparisonChain.start()
!!!!!!!!!!!!.compare(keyA.id,!keyB.id)
!!!!!!!!!!!!.compare(!keyA.containsAuthor,!!keyB.containsAuthor)
!!!!!!!!!!!!.result();
!!!!!!}
!!!!}
!!!!static!class!GroupingComparator!extends!WritableComparator!implements!Serializable!{
!!!!!!protected!GroupingComparator()!{
!!!!!!!!super(SecondarySortedAuthorID.class,!true);
!!!!!!}
!!!!}
!!}

14

!!static!class!AuthorOrTitleAndYearOfPublication!implements!
Writable!{
!!!!private!boolean!containsAuthor;
!!!!private!String!author;
!!!!private!String!title;
!!!!private!Short!yearOfPublication;
!!!!AuthorOrTitleAndYearOfPublication()!{}
!!!!AuthorOrTitleAndYearOfPublication(String!author)!{
!!!!!!this.containsAuthor!=!true;
!!!!!!this.author!=!Preconditions.checkNotNull(author);
!!!!}
!!!!AuthorOrTitleAndYearOfPublication(String!title,!short!
yearOfPublication)!{
!!!!!!this.containsAuthor!=!false;
!!!!!!this.title!=!Preconditions.checkNotNull(title);
!!!!!!this.yearOfPublication!=!yearOfPublication;
!!!!}
!!!!public!boolean!containsAuthor()!{
!!!!!!return!containsAuthor;
!!!!}
!!!!public!String!getAuthor()!{
!!!!!!return!author;
!!!!}
!!!!public!String!getTitle()!{
!!!!!!return!title;
!!!!}
!!!!public!Short!getYearOfPublication()!{
!!!!!!return!yearOfPublication;
!!!!}
!!!!@Override
!!!!public!void!write(DataOutput!out)!throws!IOException!{
!!!!!!out.writeBoolean(containsAuthor);
!!!!!!if!(containsAuthor)!{
!!!!!!!!out.writeUTF(author);
!!!!!!}!else!{
!!!!!!!!out.writeUTF(title);
!!!!!!!!out.writeShort(yearOfPublication);
!!!!!!}
!!!!}
!!!!@Override
!!!!public!void!readFields(DataInput!in)!throws!IOException!{
!!!!!!author!=!null;
!!!!!!title!=!null;
!!!!!!yearOfPublication!=!null;
!!!!!!containsAuthor!=!in.readBoolean();
!!!!!!if!(containsAuthor)!{
!!!!!!!!author!=!in.readUTF();
!!!!!!}!else!{
!!!!!!!!title!=!in.readUTF();
!!!!!!!!yearOfPublication!=!in.readShort();
!!!!!!}
!!!!}
!!}
}
5. Interfaces to Hadoop MapReduce
(Pig, Hive, Cascading, ...)

Reduce

Reduce
Reduce

Map

Map
Map

Lacking in
declarativity

15

Operators
exchange data via
HDFS
Sort the only
grouping operator
Need many
MapReduce rounds
6. An ML library called Mahout.
Iterative programs in Hadoop
Client

16

Reduce

Iteration 3
Map

Reduce

Iteration 2
Map

Reduce

Map

Iteration 1
Iterations in MapReduce too
slow. Design a new runtime
system and use the Hadoop
Incremental Iterations matter
scheduler to exploit sparse
computational dependencies.
■ Changes to the iteration's result for Connected Components
in each superstep

# Vertices (thousands)

1400
1200
1000
800
600
400
200
0
0

2

4

6

8

10 12 14 16 18 20 22 24 26 28 30 32 34

Naïve (Bulk)

Superstep
17

Incremental
Observations
1. MapReduce programming model good for grouping & counting.
2. MapReduce programming model not good for much else.
3. Hadoop implementation of MapReduce trades performance
for fault-tolerance (disk-based data shuffling).
4. MapReduce programming model not suited for SQL. Need to
hack around it with multiple MapReduce rounds.
5. Hadoop’s implementation of MapReduce not suited for SQL.
6. MapReduce programming model and its Hadoop
implementation not suited for iterations. Need to hack around it
with implementing iterations in client or embedding a new
runtime in a Map function.
18
Stratosphere

Big Data

19
Stratosphere: a brief history
2009: DFG-funded research group from
TUB, HUB, HPI starts research on
“Information Management in the Cloud.”
2010-2012: Stratosphere released as open
source (v0.1, v0.2) and becomes known in
academic community. Companies and
Universities in Europe become part of
Stratosphere.
2013 and beyond: Transition from a
research project to a stable and usable
open source system, developer
community, and real-world use cases.
20
Stratosphere status
Next stable release (v0.4) coming up
around end of November. Snapshot
available to download; maturity
equivalent to Apache incubations.

21

Community picking up:
external developers
from Universities (KTH,
SICS, Inria, and others),
hackathons in Berlin,
Paris, Budapest,
companies are starting
to use Stratosphere
(Deutsche Telekom,
Internet Memory,
Mediaplus).
22
Desiderata for next-gen big
data platforms: Usability

10 million
Excel users

3 million
R users

70,000
Hadoop
users
23

“the market faces
certain challenges
such as unavailability
of qualified and
experienced work
professionals, who can
effectively handle the
Hadoop architecture.”
Desiderata for next-gen big
data platforms: Performance
Stratosphere!

Hadoop!

0!

100!

200!

300!

400!

500!

600!

700!

Performance difference from days to minutes enables
real time decision making and widespread use of data
within the organization.
24
Data characteristics change

Each color is a differently written
program that produces the same result but has very
different performance depending on small changes
in the data set and the analysis requirements

Query optimizers: the
enabling technology for SQL
data warehousing and BI
Successful industrial
application of artificial
intelligence

Data characteristics change

Currently, only Stratosphere
can optimize non-relational
data analysis programs.

(a) Complex Plan Diagram

(b) Reduced Plan Diagram

Figure 2: Complex Plan and Reduced Plan Diagram (Query 8, OptA)
25
filter

MATC

aggregate

project

Enumeration Algorithm: 4 5 6 7 8r0
0 123
r0 = this Read Set

UDF Code Analysis

0 123456 78
= this
r1 = @parameter0 // Iter
ter0 //
project Iterator
r2 = @parameter1 // Coll
ter1 // Collector

Use a combination of compiler and
• Checks reorder conditions and switches successive operators
database technology to lift optimization
Supported Transformations:
• Filter push-down
beyond relational algebra. Derive
• Join reordering
• Invariant group transformations
properties of user-defined functions via
• Non-relational operators are integrated
code analysis and use these to mimic a
relational database optimizer.

r0 = this
r1 = @parameter0 // Record
/ Reco
Prerequisites:
r2 = @parameter1 // Collector
/ Coll

r1 = @parameter0 // Record
/ Reco

• Descents data //filter
r2 = @parameter1 ow recursively top-down
/ Collector
Coll

$r5 = r1.getField(8)
8)

Control-Flow, Def-Use, Use-Def lists
$r6 = r0.date_lb
$i0 = $r5.compareTo($r6)
o($r6)
• Fixed API to access records

$d0 = r1.getField(6) r0 = this
r1
/ Reco
r3 = r1.next() = @parameter0 // Record
r0.extendedprice = $d0 = @parameter0 // Record = r
r2 = @parameter1 // Collector
/ Coll
r1
/ Reco
d0
r3.getField(4)
)
$d1 = r1.getField(7) r2 = @parameter1 // Collector
/ Coll
$d0 = r1.getField(6)
goto 2
r0.discount = $d1
r0.extendedprice = $d0
$r5 = r1.getField(8)
8)

$d1
$r6 = r0.date_lb
1: r3 = r1.next()= r1.getField(7)
$r7 = r0.revenue // PactRecord
r0.discount = $d1
$i0 = $r5.compareTo($r6)
o($r6)
$d1 = r3.getField(4)
Extracted Information:
$d2 = r0.extendedprice $i0 < 0 goto 1
if
$r9 = r1.getField(8)
8)
d0 = d0 + $d1 $r7 = r0.revenue // PactRecord
$d3 =
• Field sets= r0.date_ub write accesses on records r0.discount
track read and
$d2 = r0.extendedprice
$r10
$r9 = r1.getField(8)
8)
$d4 = 0 - $d3
2: $z0 = r1.hasNext()
$d3 = r0.discount
• Upper and$r9.compareTo($r10)
$i1 = lower output cardinality bounds
$r10 = r0.date_ub
$d5 = $d2 * $d4
$d4 1
if $z0 != 0 goto = 0 - $d3
if $i1 >= 0 goto 1
$i1 = $r9.compareTo($r10)
$d5 = $d2 * $d4
$r7.setValue($d5)
if $i1 >= 0 goto 1
r3.setField(4, d0)
4,
Safety:
$r7.setValue($d5)
r1.setNull(6)
)
r2.collect(r1)
r2.collect(r3)
r1.setNull(6)
)
• All record access instructions are detected
r2.collect(r1)
r1.setNull(7)
r1.setNull(7)
• Supersets of actual Read/Write sets are returned
r1.setNull(8)
1: return
r1.setNull(8)
1: return

if $i0 < 0 goto 1

• Supersets allow fewer but always safe transformations
$r8 = r0.revenue

nds

0 123456 78

Details Set[HPS+12] Bounds
Write in Ou
et Out-Card

[0,1]

0 123456 78

0 123456 78

[0,1]

Data Flow Transformations

Reorder Conditions:

output

[0,1]

0 123456 78

[0,1]

0 123456 78

Physical Optimization

output

0 123456 78

0 12345678
0 12345678

0 123456 78

5
0 123456 78

0
0 123456 78

0 123456 78

0 123456 78

5
0 12345678

MATCH

5
0 123456 78

0 123456 78

0 123456 78

0
0 123456 78

aggregate

0 123456 78

0 123456 78

0 12345678

0 123456 7
supplier8

Interesting Properties:
5
0 123456 78

0
0 123456 78
0 123456 78

0 123456 78

0 123456 78

• Checks reorder conditions and switches successive operators
REDUCE
join

0 12345678

REDUCE

aggregate

MAP

supplier

project

5
0 123456 78

0 123456 78

0 123456 78

Details in [HPS+12]

0 123456 78

0 123456 78

0 123456 78

0 123456 78

0 123456 78

0 123456 78

0 123456 78

REDUCE

aggregate

0 123456 78

0 123456 78

0 123456 78

MAP

lineitem

filter

Partition

project

output

lineitem

Execution Plan Selection:
Details in [BEH+10]
• Chooses execution strategies for 2nd-order functions

Local Forward
MATCH
Hybrid-Hash

• Chooses shipping strategies to distribute data
[WK09]
Warneke, Kao,
• Strategies known from parallel databases

Local Forward

Parallel Execution

REDUCE

0 12345678
0 12345678
0 12345678
0 12345678
0 12345678
0 12345678

Partition

supplier

MAP
filter
0 123456 78

MAP

• Massively parallel execution of
26 DAG-structured data ows
• Sequential processing tasks

•
•

R

0 123456 78

0 12345678
0 12345678
0 12345678
0 12345678

MAP
0 123456 78
Pipeline

Local Forward

0 12345678
0 12345678

•

lineitem

MAP
Pipeline

•

0 123456 78

Execution Engine:

•

0 123456 78

0 123456 78

lineitem

E

filter

COMBINE 3 4 5 6 7 8
0 12
0 1
Part-Sort2 3 4 5 6 7 8
MAP
Local Forward
project

0 123456 78

lineitem

0 123456 78
Partition
REDUCE

0 123456 78

• Exploits UDF annotations for size estimates
• Cost model combines network, disk I/O and CPU costs
Parallel Execution

lineitem

Pa

0 123456 78

aggregate

supplier

MAP

0 123456 78

0 123456 78

hysical Optimization

[0,1]

5
REDUCE 2 3 4 5 6 7 8
0 1
Sort 0 1 2 3 4 5 6 7 8 supplier

0 123456 78

project

0 123456 78

0 123456 78

supplier

0 123456 78

0 123456 78

MAP
Cost-based Plan Selection:

0 123456 78

0 123456 78

Local Forward

5
0 123456 78

0 123456 78

filter

lineitem

Local Forward
MATCH
join
MATCH
5
0
0
Hybrid-Hash 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 123456 78

0 123456 78

0 123456 78

0 123456 78

0 123456 78

0 123456 78

0 123456 78

0 123456 78

MAP

0
0 123456 78

0 123456 78

• Sorting, Grouping, Partitioning
MAP
MAP
MAP
supplier
filter
filter
project
• Invariant group project
transformations
• Property preservation reasoning with write sets

• Non-relational operators are integrated

filter
0 123456 78

0 12345678

5
0 123456 78

0 123456 78

0
• Filter push-down 1 2 3 4 5 6 7 8
0 123456 78
• Join reordering MAP

r3.setField(4, d0)
4,
r2.collect(r3)

MAP

0 12345678

0 12345678

• Chooses shipping strategies to distribute data
join
Enumeration Algorithm:
• Descents data ow recursively top-down
• Strategies known from parallel databases
MATCH

Supported Transformations:

2: $z0 = r1.hasNext()
if $z0 != 0 goto 1

0 123456 78

5
0 123456 78

0
0 123456 78

0 123456 78

0 123456 78

0 123456 78

5
0 123456 78

0 123456 78

0 123456 78

0 12345678

0 123456 78

0 12345678
0 12345678

0 123456 78

0 123456 78

1: r3 = r1.next()
$d1 = r3.getField(4)
d0 = d0 + $d1

output

0 12345678
0 12345678

0 123456 78

0 12345678

5
0 123456 78

project

goto 2

output

output

0 12345678
0 12345678

REDUCE
MATCH
Execution Plan Selection: aggregate
join
2. Preservation of groupsREDUCE
for grouping operators
nd-order functions
MATCH
• Chooses executionMATCH
strategies for 2
• Groups must remain unchanged or be completely removed
join
aggregate
join
0 123456 78

MAP

r3 = r1.next()
d0 = r
r3.getField(4)
)

output

output

1. No Write-Read / Write-Write con icts on 0 1 2 3 4 5 6 7 8
record elds
0 12345678
0 12345 78
• Similar to con ict detection 6in optimistic 1concurrency control
0 2345678

[0,1]

0

r0 = this
r1 = @parameter0 // Iter 0 1 2 3 4 5 6 7 8
ter0 // Iterator
r2 = @parameter1 // Coll 0 1 2 3 4 5 6 7 8
ter1 // Collector

$r8 = r0.revenue
r1.setField(4, $r8)
r2.collect(r1)
1)

r1.setField(4, $r8)
r2.collect(r1)
1)

Details in [HPS+12] and [HKT12]

5
0 123456 78

aggregate

r0 = this

• Static Code Analysis Framework provides

join

0 123456 78

Local Forward
output

output

output

lineitem
MATCH
Hybrid-Hash

REDUCE
Sort

MATCH
Hybrid-Hash

supplier

REDUCE
Sort

MATCH
Hybrid-Hash

supplier

REDUCE
Sort

D

supplier

[H
one pass
dataflow

many pass
dataflow

MapReduce

Impala, ...

Stratosphere

Text

✔

✔

✔

Aggregation

✔

✔

✔

ETL

✔

✔

✔

SQL

Hive is too
slow

✔

✔

Advanced
analytics

Mahout is slow
and low level

Madlib is
too slow

✔

A fast, massively parallel
database-inspired backend.

map
reduce

Truly scales to disk-resident
large data sets using database
technology (e.g., hybrid hashing
and external sort-merge for
implementing key matching).
Built-in support for iterative
programs via “iterate”
operator: predictive and
advanced analytics (machine
learning, graph processing,
stats) are all iterative.

27
Giraph is a Stratosphere
Incremental
program Iterations: Doing Pregel
Working Set has messages sent by the vertices

Wi+1
Create Messages
from new state

Graph
Topology

Delta set has state of changed vertices

Di+1
Aggregate
messages and
derive new state

Match

.

U

CoGroup

N

(left outer)

Wi

Si
Stratosphere – Parallel Analytics Beyond MapReduce
28
To recap:
Stratosphere is an open-source system that runs
on top of Hadoop Yarn and HDFS, but replaces
Hadoop MapReduce with a new runtime engine
designed for iterative and DAG-shaped programs,
offers a program optimizer that frees programmer
from low-level decisions, is scalable to large clusters
and disk-resident data sets, and is programmable in
Java and Scala (and more to come).

29
A next-generation Big Data
platform is being developed
in Berlin.

Help us shape
the future of
Stratosphere!

30

http://www.flickr.com/photos/andiearbeit/4354455624/lightbox/

Contenu connexe

Tendances

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataIMC Institute
 
Moving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesMoving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesIJRESJOURNAL
 
Essential Tools For Your Big Data Arsenal
Essential Tools For Your Big Data ArsenalEssential Tools For Your Big Data Arsenal
Essential Tools For Your Big Data ArsenalMongoDB
 
Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?Edureka!
 
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Oomph! Recruitment
 
TDWI checklist - Evolving to Modern DW
TDWI checklist - Evolving to Modern DWTDWI checklist - Evolving to Modern DW
TDWI checklist - Evolving to Modern DWJeannette Browning
 
Hortonworks.HadoopPatternsOfUse.201304
Hortonworks.HadoopPatternsOfUse.201304Hortonworks.HadoopPatternsOfUse.201304
Hortonworks.HadoopPatternsOfUse.201304James Kenney
 
Introduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIntroduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIMC Institute
 
Fraud Detection with Graphs at the Danish Business Authority
Fraud Detection with Graphs at the Danish Business AuthorityFraud Detection with Graphs at the Danish Business Authority
Fraud Detection with Graphs at the Danish Business AuthorityNeo4j
 
Ibm presentation unlocking new insights in dark data
Ibm presentation   unlocking new insights in dark dataIbm presentation   unlocking new insights in dark data
Ibm presentation unlocking new insights in dark dataDr. Wilfred Lin (Ph.D.)
 
Introduction to graph databases GraphDays
Introduction to graph databases  GraphDaysIntroduction to graph databases  GraphDays
Introduction to graph databases GraphDaysNeo4j
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Gregg Barrett
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1gauravsc36
 
Making Sense of your data - eLearning Network April 2014
Making Sense of your data - eLearning Network April 2014Making Sense of your data - eLearning Network April 2014
Making Sense of your data - eLearning Network April 2014Andy Wooler
 

Tendances (20)

Combining hadoop with big data analytics
Combining hadoop with big data analyticsCombining hadoop with big data analytics
Combining hadoop with big data analytics
 
ANALYTICS OF DATA USING HADOOP-A REVIEW
ANALYTICS OF DATA USING HADOOP-A REVIEWANALYTICS OF DATA USING HADOOP-A REVIEW
ANALYTICS OF DATA USING HADOOP-A REVIEW
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Moving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesMoving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and Perspectives
 
Essential Tools For Your Big Data Arsenal
Essential Tools For Your Big Data ArsenalEssential Tools For Your Big Data Arsenal
Essential Tools For Your Big Data Arsenal
 
Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?
 
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
 
TDWI checklist - Evolving to Modern DW
TDWI checklist - Evolving to Modern DWTDWI checklist - Evolving to Modern DW
TDWI checklist - Evolving to Modern DW
 
Hortonworks.HadoopPatternsOfUse.201304
Hortonworks.HadoopPatternsOfUse.201304Hortonworks.HadoopPatternsOfUse.201304
Hortonworks.HadoopPatternsOfUse.201304
 
Introduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIntroduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data Science
 
Fraud Detection with Graphs at the Danish Business Authority
Fraud Detection with Graphs at the Danish Business AuthorityFraud Detection with Graphs at the Danish Business Authority
Fraud Detection with Graphs at the Danish Business Authority
 
Ibm presentation unlocking new insights in dark data
Ibm presentation   unlocking new insights in dark dataIbm presentation   unlocking new insights in dark data
Ibm presentation unlocking new insights in dark data
 
Introduction to graph databases GraphDays
Introduction to graph databases  GraphDaysIntroduction to graph databases  GraphDays
Introduction to graph databases GraphDays
 
#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"
 
Unlocking big data
Unlocking big dataUnlocking big data
Unlocking big data
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
 
The 25 Predictions About The Future Of Big Data
The 25 Predictions About The Future Of Big DataThe 25 Predictions About The Future Of Big Data
The 25 Predictions About The Future Of Big Data
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Making Sense of your data - eLearning Network April 2014
Making Sense of your data - eLearning Network April 2014Making Sense of your data - eLearning Network April 2014
Making Sense of your data - eLearning Network April 2014
 

En vedette

Innovation for Development: Converting Knowledge to Value – Summary Report
Innovation for Development: Converting Knowledge to Value – Summary ReportInnovation for Development: Converting Knowledge to Value – Summary Report
Innovation for Development: Converting Knowledge to Value – Summary ReportiBoP Asia
 
Best ppt on solar system
Best ppt on solar systemBest ppt on solar system
Best ppt on solar systemMake Megenius
 
Solar System Power Point
Solar System Power PointSolar System Power Point
Solar System Power Pointkornackk
 

En vedette (6)

Innovation for Development: Converting Knowledge to Value – Summary Report
Innovation for Development: Converting Knowledge to Value – Summary ReportInnovation for Development: Converting Knowledge to Value – Summary Report
Innovation for Development: Converting Knowledge to Value – Summary Report
 
Stratosphere learning in a connected world
Stratosphere learning in a connected world Stratosphere learning in a connected world
Stratosphere learning in a connected world
 
Best ppt on solar system
Best ppt on solar systemBest ppt on solar system
Best ppt on solar system
 
Solar System Power Point
Solar System Power PointSolar System Power Point
Solar System Power Point
 
Solar System Ppt
Solar System PptSolar System Ppt
Solar System Ppt
 
Solar System Facts Slideshow
Solar System Facts SlideshowSolar System Facts Slideshow
Solar System Facts Slideshow
 

Similaire à Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (Nov. 20, 2013)

How Startups can leverage big data?
How Startups can leverage big data?How Startups can leverage big data?
How Startups can leverage big data?Rackspace
 
RDBMS to Graph Webinar
RDBMS to Graph WebinarRDBMS to Graph Webinar
RDBMS to Graph WebinarNeo4j
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Sciencesarith divakar
 
A modern data platform meets the needs of each type of data in your business
A modern data platform meets the needs of each type of data in your businessA modern data platform meets the needs of each type of data in your business
A modern data platform meets the needs of each type of data in your businessMarcos Quezada
 
GraphLab Conference 2014 Keynote - Carlos Guestrin
GraphLab Conference 2014 Keynote - Carlos GuestrinGraphLab Conference 2014 Keynote - Carlos Guestrin
GraphLab Conference 2014 Keynote - Carlos GuestrinTuri, Inc.
 
Data APIs as a Foundation for Systems of Engagement
Data APIs as a Foundation for Systems of EngagementData APIs as a Foundation for Systems of Engagement
Data APIs as a Foundation for Systems of EngagementVictor Olex
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data miningEmran Hossain
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoSpark Summit
 
Job Data Analysis Reveals Key Skills Required for Data Scientists
Job Data Analysis Reveals Key Skills Required for Data ScientistsJob Data Analysis Reveals Key Skills Required for Data Scientists
Job Data Analysis Reveals Key Skills Required for Data ScientistsJobsPikr
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)mark madsen
 
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...Hortonworks
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlKhanderao Kand
 
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...Cloudera, Inc.
 
Rapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopRapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopPeter Skomoroch
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond Rajesh Kumar
 
Big Data World Singapore 2017 - Moving Towards Digitization & Artificial Inte...
Big Data World Singapore 2017 - Moving Towards Digitization & Artificial Inte...Big Data World Singapore 2017 - Moving Towards Digitization & Artificial Inte...
Big Data World Singapore 2017 - Moving Towards Digitization & Artificial Inte...Garrett Teoh Hor Keong
 

Similaire à Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (Nov. 20, 2013) (20)

How Startups can leverage big data?
How Startups can leverage big data?How Startups can leverage big data?
How Startups can leverage big data?
 
RDBMS to Graph Webinar
RDBMS to Graph WebinarRDBMS to Graph Webinar
RDBMS to Graph Webinar
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
A modern data platform meets the needs of each type of data in your business
A modern data platform meets the needs of each type of data in your businessA modern data platform meets the needs of each type of data in your business
A modern data platform meets the needs of each type of data in your business
 
GraphLab Conference 2014 Keynote - Carlos Guestrin
GraphLab Conference 2014 Keynote - Carlos GuestrinGraphLab Conference 2014 Keynote - Carlos Guestrin
GraphLab Conference 2014 Keynote - Carlos Guestrin
 
Data APIs as a Foundation for Systems of Engagement
Data APIs as a Foundation for Systems of EngagementData APIs as a Foundation for Systems of Engagement
Data APIs as a Foundation for Systems of Engagement
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data mining
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott Cordo
 
Big Data
Big DataBig Data
Big Data
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
Job Data Analysis Reveals Key Skills Required for Data Scientists
Job Data Analysis Reveals Key Skills Required for Data ScientistsJob Data Analysis Reveals Key Skills Required for Data Scientists
Job Data Analysis Reveals Key Skills Required for Data Scientists
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
 
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
 
Rapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopRapid Data Exploration With Hadoop
Rapid Data Exploration With Hadoop
 
Data engineering design patterns
Data engineering design patternsData engineering design patterns
Data engineering design patterns
 
Hadoop
Hadoop Hadoop
Hadoop
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
 
Big Data World Singapore 2017 - Moving Towards Digitization & Artificial Inte...
Big Data World Singapore 2017 - Moving Towards Digitization & Artificial Inte...Big Data World Singapore 2017 - Moving Towards Digitization & Artificial Inte...
Big Data World Singapore 2017 - Moving Towards Digitization & Artificial Inte...
 

Dernier

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Dernier (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (Nov. 20, 2013)

  • 1. Big Data looks tiny from Stratosphere Kostas Tzoumas kostas.tzoumas@tu-berlin.de
  • 2. Data is an important asset video & audio streams, sensor data, RFID, GPS, user online behavior, scientific simulations, web archives, ... Volume Handle petabytes of data Velocity Handle high data arrival rates Variety Handle many heterogeneous data sources Veracity 2 Handle inherent uncertainty of data
  • 4. Four “I”s for Big Analysis text mining, interactive and ad hoc analysis, machine learning, graph analysis, statistical algorithms Iterative Model the data, do not just describe it Incremental Maintain the model under high arrival rates Interactive Step-by-step data exploration on very large data Integrative 4 Fluent unified interfaces for different data models
  • 5. Hadoop Hadoop’s selling point is its low effective storage cost. Hadoop clusters are becoming a data vortex, attracting cross-departmental data and changing the data usage culture in companies. Hadoop MapReduce was the wrong abstraction and implementation to begin with and will be superseded by better systems. 5
  • 6. Advanced Analytics Analytics that model the data to reveal hidden relationships, not just describe the data. E.g., machine learning, predictive stats, graph analysis Increasingly important from a market perspective. Very different than SQL analytics: different languages and access patterns (iterative vs. one-pass programs). Hadoop toolchain poor; R, Matlab, etc not parallel. 6
  • 9. Data Scientist: The Sexiest Job of the 21st Century Meet the people who can coax treasure out of messy, unstructured data. FROM!(! by Thomas H. Davenport !!FROM!pv_users! and D.J. Patil !!MAP!pv_users.userid,!pv_users.date! !!USING!'map_script'! !!AS!dt,!uid! !!CLUSTER0BY0dt)!map_output! INSERT0OVERWRITE0TABLE0pv_users_reduced! !!REDUCE!map_output.dt,!map_output.uid! !!USING!'reduce_script'! !!AS!date,!count;! ≠ hen Jonathan Goldman arrived for work in June 2006 at LinkedIn, the business networking site, the place still felt like a start-up. The company had just under 8 million accounts, and the number was A"="load"'WordcountInput.txt';" growing quickly as existing memB"="MAPREDUCE"wordcount.jar"store"A"into"'inputDir‘"load" """"'outputDir'"as"(word:chararray,"count:"int)" colbers invited their friends and """"'org.myorg.WordCount"inputDir"outputDir';" weren’t leagues to join. But users C"="sort"B"by"count;" seeking out connections with the people who were already on the site at the rate executives had expected. Something was apparently missing in the social experience. As one LinkedIn manager put it, “It was like arriving at a conference reception and realizing you don’t know anyone. So you just stand in the corner sipping your drink—and you 9
  • 11. Hadoop is... 1. A programming model called MapReduce 2. An implementation of said programming model, called Hadoop MapReduce 3. A file system, called HDFS 4. A resource manager, called Yarn 5. Interfaces to Hadoop MapReduce (Pig, Hive, Cascading, ...) 6. An ML library called Mahout. 7. Recently, a collection of runtime systems (Tez, Impala, Spark, Stratosphere, ...) 11 * Inspired by Jens Dittrich
  • 12. 1. A programming model called MapReduce val!input!=!TextFile(textInput) val!words!=!input.flatMap!{!line!=>!line.split(“!“)!} val!counts!=!words.groupBy!{!word!=>!word!}.count()! val!output!=!counts.write6(wordsOutput,!CsvOutputFormat()) map reduce ( ( ) [ ) [ “Romeo, Romeo, wherefore art thou Romeo?” = (Romeo,(1,1,1)) (wherefore,1), (art,1) (thou,1) ] (Romeo,1), (Romeo,1) (wherefore,1), (art,1) (thou,1), (Romeo,1) = 12 (Romeo,3) (wherefore,1), (art,1) (thou,1) ]
  • 13. (Romeo, (1,1,1)) (art, (1,1)) (thou, (1,1)) Reduce (Romeo, 3) (art, 2) (thou, 2) (What, 1) (art, 1) (thou, 1) (hurt, 1) (wherefore, 1) (What, 1) (hurt, 1) Reduce “What, art thou hurt?” Map “Romeo, Romeo, wherefore art thou Romeo?” (Romeo, 1) (Romeo, 1) (wherefore, 1) (art, 1) (thou, 1) (Romeo, 1) Map 2. An implementation of said programming model, called Hadoop MapReduce (wherefore, 1) (What, 1) (hurt, 1) Data written to disk Data shuffled over network 13
  • 14. Hand-coded join in Hadoop MapReduce public!class!ReduceSideBookAndAuthorJoin!extends!HadoopJob!{ !!private!static!final!Pattern!SEPARATOR!=!Pattern.compile("t"); !!@Override !!public!int!run(String[]!args)!throws!Exception!{ !!!!Map<String,String>!parsedArgs!=!parseArgs(args); !!!!Path!authors!=!new!Path(parsedArgs.get("OOauthors")); !!!!Path!books!=!new!Path(parsedArgs.get("OObooks")); !!!!Path!outputPath!=!new!Path(parsedArgs.get("OOoutput")); !!!!Job!join!=!new!Job(new!Configuration(getConf())); !!!!Configuration!jobConf!=!join.getConfiguration(); !!!!MultipleInputs.addInputPath(join,!authors,!TextInputFormat.class,!ConvertAuthorsMapper.class); !!!!MultipleInputs.addInputPath(join,!books,!TextInputFormat.class,!ConvertBooksMapper.class); !!!!join.setMapOutputKeyClass(SecondarySortedAuthorID.class); !!!!join.setMapOutputValueClass(AuthorOrTitleAndYearOfPublication.class); !!!!jobConf.setBoolean("mapred.compress.map.output",!true); !!!!join.setReducerClass(JoinReducer.class); !!!!join.setOutputKeyClass(Text.class); !!!!join.setOutputValueClass(NullWritable.class); !!!!join.setJarByClass(JoinReducer.class); !!!!join.setJobName("reduceSideBookAuthorJoin"); !!!!join.setOutputFormatClass(TextOutputFormat.class); !!!!jobConf.set("mapred.output.dir",!outputPath.toString()); !!!!join.setGroupingComparatorClass(SecondarySortedAuthorID.GroupingComparator.class); !!!!join.waitForCompletion(true); !!!!return!0; !!} !!static!class!ConvertAuthorsMapper !!!!!!extends!Mapper<Object,Text,SecondarySortedAuthorID,AuthorOrTitleAndYearOfPublication>!{ !!!!@Override !!!!protected!void!map(Object!key,!Text!value,!Context!ctx)!throws!IOException,!InterruptedException! { !!!!!!String!line!=!value.toString(); !!!!!!if!(line.length()!>!0)!{ !!!!!!!!String[]!tokens!=!SEPARATOR.split(line.toString()); !!!!!!!!long!authorID!=!Long.parseLong(tokens[0]); !!!!!!!!String!author!=!tokens[1]; !!!!!!!!ctx.write(new!SecondarySortedAuthorID(authorID,!true),!new! AuthorOrTitleAndYearOfPublication(author)); !!!!!!} !!!!} !!} !!static!class!ConvertBooksMapper !!!!!!extends!Mapper<Object,Text,SecondarySortedAuthorID,AuthorOrTitleAndYearOfPublication>!{ !!!!@Override !!!!protected!void!map(Object!key,!Text!line,!Context!ctx)!throws!IOException,!InterruptedException!{ !!!!!!String[]!tokens!=!SEPARATOR.split(line.toString()); !!!!!!long!authorID!=!Long.parseLong(tokens[0]); !!!!!!short!yearOfPublication!=!Short.parseShort(tokens[1]); !!!!!!String!title!=!tokens[2]; !!!!!!ctx.write(new!SecondarySortedAuthorID(authorID,!false),!new! AuthorOrTitleAndYearOfPublication(title, !!!!!!!!!!yearOfPublication)); !!!!} !!} !!static!class!JoinReducer !!!!!!extends!Reducer<SecondarySortedAuthorID,AuthorOrTitleAndYearOfPublication,Text,NullWritable>!{ !!!!@Override !!!!protected!void!reduce(SecondarySortedAuthorID!key,!Iterable<AuthorOrTitleAndYearOfPublication>! values,!Context!ctx) !!!!!!!!throws!IOException,!InterruptedException!{ !!!!!!String!author!=!null; !!!!!!for!(AuthorOrTitleAndYearOfPublication!value!:!values)!{ !!!!!!!!if!(author!==!null!&&!!value.containsAuthor())!{ !!!!!!!!!!throw!new!IllegalStateException("No!author!found!for!book:!"!+!value.getTitle()); !!!!!!!!}!else!if!(author!==!null!&&!value.containsAuthor())!{ !!!!!!!!!!author!=!value.getAuthor(); !!!!!!!!}!else!{ !!!!!!!!!!ctx.write(new!Text(author!+!'t'!+!value.getTitle()!+!'t'!+!value.getYearOfPublication()), !!!!!!!!!!!!!!NullWritable.get()); !!!!!!!!} !!!!!!} !!!!} !!} !!static!class!SecondarySortedAuthorID!implements!WritableComparable<SecondarySortedAuthorID>!{ !!!!private!boolean!containsAuthor; !!!!private!long!id; !!!!static!{ !!!!!!WritableComparator.define(SecondarySortedAuthorID.class,!new!SecondarySortComparator()); !!!!} !!!!SecondarySortedAuthorID()!{} !!!!SecondarySortedAuthorID(long!id,!boolean!containsAuthor)!{ !!!!!!this.id!=!id; !!!!!!this.containsAuthor!=!containsAuthor; !!!!} !!!!@Override !!!!public!int!compareTo(SecondarySortedAuthorID!other)!{ !!!!!!return!ComparisonChain.start() !!!!!!!!!!.compare(id,!other.id) !!!!!!!!!!.result(); !!!!} !!!!@Override !!!!public!void!write(DataOutput!out)!throws!IOException!{ !!!!!!out.writeBoolean(containsAuthor); !!!!!!out.writeLong(id); !!!!} !!!!@Override !!!!public!void!readFields(DataInput!in)!throws!IOException!{ !!!!!!containsAuthor!=!in.readBoolean(); !!!!!!id!=!in.readLong(); !!!!} !!!!@Override !!!!public!boolean!equals(Object!o)!{ !!!!!!if!(o!instanceof!SecondarySortedAuthorID)!{ !!!!!!!!return!id!==!((SecondarySortedAuthorID)!o).id; !!!!!!} !!!!!!return!false; !!!!} !!!!@Override !!!!public!int!hashCode()!{ !!!!!!return!Longs.hashCode(id); !!!!} !!!!static!class!SecondarySortComparator!extends!WritableComparator!implements!Serializable!{ !!!!!!protected!SecondarySortComparator()!{ !!!!!!!!super(SecondarySortedAuthorID.class,!true); !!!!!!} !!!!!!@Override !!!!!!public!int!compare(WritableComparable!a,!WritableComparable!b)!{ !!!!!!!!SecondarySortedAuthorID!keyA!=!(SecondarySortedAuthorID)!a; !!!!!!!!SecondarySortedAuthorID!keyB!=!(SecondarySortedAuthorID)!b; !!!!!!!!return!ComparisonChain.start() !!!!!!!!!!!!.compare(keyA.id,!keyB.id) !!!!!!!!!!!!.compare(!keyA.containsAuthor,!!keyB.containsAuthor) !!!!!!!!!!!!.result(); !!!!!!} !!!!} !!!!static!class!GroupingComparator!extends!WritableComparator!implements!Serializable!{ !!!!!!protected!GroupingComparator()!{ !!!!!!!!super(SecondarySortedAuthorID.class,!true); !!!!!!} !!!!} !!} 14 !!static!class!AuthorOrTitleAndYearOfPublication!implements! Writable!{ !!!!private!boolean!containsAuthor; !!!!private!String!author; !!!!private!String!title; !!!!private!Short!yearOfPublication; !!!!AuthorOrTitleAndYearOfPublication()!{} !!!!AuthorOrTitleAndYearOfPublication(String!author)!{ !!!!!!this.containsAuthor!=!true; !!!!!!this.author!=!Preconditions.checkNotNull(author); !!!!} !!!!AuthorOrTitleAndYearOfPublication(String!title,!short! yearOfPublication)!{ !!!!!!this.containsAuthor!=!false; !!!!!!this.title!=!Preconditions.checkNotNull(title); !!!!!!this.yearOfPublication!=!yearOfPublication; !!!!} !!!!public!boolean!containsAuthor()!{ !!!!!!return!containsAuthor; !!!!} !!!!public!String!getAuthor()!{ !!!!!!return!author; !!!!} !!!!public!String!getTitle()!{ !!!!!!return!title; !!!!} !!!!public!Short!getYearOfPublication()!{ !!!!!!return!yearOfPublication; !!!!} !!!!@Override !!!!public!void!write(DataOutput!out)!throws!IOException!{ !!!!!!out.writeBoolean(containsAuthor); !!!!!!if!(containsAuthor)!{ !!!!!!!!out.writeUTF(author); !!!!!!}!else!{ !!!!!!!!out.writeUTF(title); !!!!!!!!out.writeShort(yearOfPublication); !!!!!!} !!!!} !!!!@Override !!!!public!void!readFields(DataInput!in)!throws!IOException!{ !!!!!!author!=!null; !!!!!!title!=!null; !!!!!!yearOfPublication!=!null; !!!!!!containsAuthor!=!in.readBoolean(); !!!!!!if!(containsAuthor)!{ !!!!!!!!author!=!in.readUTF(); !!!!!!}!else!{ !!!!!!!!title!=!in.readUTF(); !!!!!!!!yearOfPublication!=!in.readShort(); !!!!!!} !!!!} !!} }
  • 15. 5. Interfaces to Hadoop MapReduce (Pig, Hive, Cascading, ...) Reduce Reduce Reduce Map Map Map Lacking in declarativity 15 Operators exchange data via HDFS Sort the only grouping operator Need many MapReduce rounds
  • 16. 6. An ML library called Mahout. Iterative programs in Hadoop Client 16 Reduce Iteration 3 Map Reduce Iteration 2 Map Reduce Map Iteration 1
  • 17. Iterations in MapReduce too slow. Design a new runtime system and use the Hadoop Incremental Iterations matter scheduler to exploit sparse computational dependencies. ■ Changes to the iteration's result for Connected Components in each superstep # Vertices (thousands) 1400 1200 1000 800 600 400 200 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 Naïve (Bulk) Superstep 17 Incremental
  • 18. Observations 1. MapReduce programming model good for grouping & counting. 2. MapReduce programming model not good for much else. 3. Hadoop implementation of MapReduce trades performance for fault-tolerance (disk-based data shuffling). 4. MapReduce programming model not suited for SQL. Need to hack around it with multiple MapReduce rounds. 5. Hadoop’s implementation of MapReduce not suited for SQL. 6. MapReduce programming model and its Hadoop implementation not suited for iterations. Need to hack around it with implementing iterations in client or embedding a new runtime in a Map function. 18
  • 20. Stratosphere: a brief history 2009: DFG-funded research group from TUB, HUB, HPI starts research on “Information Management in the Cloud.” 2010-2012: Stratosphere released as open source (v0.1, v0.2) and becomes known in academic community. Companies and Universities in Europe become part of Stratosphere. 2013 and beyond: Transition from a research project to a stable and usable open source system, developer community, and real-world use cases. 20
  • 21. Stratosphere status Next stable release (v0.4) coming up around end of November. Snapshot available to download; maturity equivalent to Apache incubations. 21 Community picking up: external developers from Universities (KTH, SICS, Inria, and others), hackathons in Berlin, Paris, Budapest, companies are starting to use Stratosphere (Deutsche Telekom, Internet Memory, Mediaplus).
  • 22. 22
  • 23. Desiderata for next-gen big data platforms: Usability 10 million Excel users 3 million R users 70,000 Hadoop users 23 “the market faces certain challenges such as unavailability of qualified and experienced work professionals, who can effectively handle the Hadoop architecture.”
  • 24. Desiderata for next-gen big data platforms: Performance Stratosphere! Hadoop! 0! 100! 200! 300! 400! 500! 600! 700! Performance difference from days to minutes enables real time decision making and widespread use of data within the organization. 24
  • 25. Data characteristics change Each color is a differently written program that produces the same result but has very different performance depending on small changes in the data set and the analysis requirements Query optimizers: the enabling technology for SQL data warehousing and BI Successful industrial application of artificial intelligence Data characteristics change Currently, only Stratosphere can optimize non-relational data analysis programs. (a) Complex Plan Diagram (b) Reduced Plan Diagram Figure 2: Complex Plan and Reduced Plan Diagram (Query 8, OptA) 25
  • 26. filter MATC aggregate project Enumeration Algorithm: 4 5 6 7 8r0 0 123 r0 = this Read Set UDF Code Analysis 0 123456 78 = this r1 = @parameter0 // Iter ter0 // project Iterator r2 = @parameter1 // Coll ter1 // Collector Use a combination of compiler and • Checks reorder conditions and switches successive operators database technology to lift optimization Supported Transformations: • Filter push-down beyond relational algebra. Derive • Join reordering • Invariant group transformations properties of user-defined functions via • Non-relational operators are integrated code analysis and use these to mimic a relational database optimizer. r0 = this r1 = @parameter0 // Record / Reco Prerequisites: r2 = @parameter1 // Collector / Coll r1 = @parameter0 // Record / Reco • Descents data //filter r2 = @parameter1 ow recursively top-down / Collector Coll $r5 = r1.getField(8) 8) Control-Flow, Def-Use, Use-Def lists $r6 = r0.date_lb $i0 = $r5.compareTo($r6) o($r6) • Fixed API to access records $d0 = r1.getField(6) r0 = this r1 / Reco r3 = r1.next() = @parameter0 // Record r0.extendedprice = $d0 = @parameter0 // Record = r r2 = @parameter1 // Collector / Coll r1 / Reco d0 r3.getField(4) ) $d1 = r1.getField(7) r2 = @parameter1 // Collector / Coll $d0 = r1.getField(6) goto 2 r0.discount = $d1 r0.extendedprice = $d0 $r5 = r1.getField(8) 8) $d1 $r6 = r0.date_lb 1: r3 = r1.next()= r1.getField(7) $r7 = r0.revenue // PactRecord r0.discount = $d1 $i0 = $r5.compareTo($r6) o($r6) $d1 = r3.getField(4) Extracted Information: $d2 = r0.extendedprice $i0 < 0 goto 1 if $r9 = r1.getField(8) 8) d0 = d0 + $d1 $r7 = r0.revenue // PactRecord $d3 = • Field sets= r0.date_ub write accesses on records r0.discount track read and $d2 = r0.extendedprice $r10 $r9 = r1.getField(8) 8) $d4 = 0 - $d3 2: $z0 = r1.hasNext() $d3 = r0.discount • Upper and$r9.compareTo($r10) $i1 = lower output cardinality bounds $r10 = r0.date_ub $d5 = $d2 * $d4 $d4 1 if $z0 != 0 goto = 0 - $d3 if $i1 >= 0 goto 1 $i1 = $r9.compareTo($r10) $d5 = $d2 * $d4 $r7.setValue($d5) if $i1 >= 0 goto 1 r3.setField(4, d0) 4, Safety: $r7.setValue($d5) r1.setNull(6) ) r2.collect(r1) r2.collect(r3) r1.setNull(6) ) • All record access instructions are detected r2.collect(r1) r1.setNull(7) r1.setNull(7) • Supersets of actual Read/Write sets are returned r1.setNull(8) 1: return r1.setNull(8) 1: return if $i0 < 0 goto 1 • Supersets allow fewer but always safe transformations $r8 = r0.revenue nds 0 123456 78 Details Set[HPS+12] Bounds Write in Ou et Out-Card [0,1] 0 123456 78 0 123456 78 [0,1] Data Flow Transformations Reorder Conditions: output [0,1] 0 123456 78 [0,1] 0 123456 78 Physical Optimization output 0 123456 78 0 12345678 0 12345678 0 123456 78 5 0 123456 78 0 0 123456 78 0 123456 78 0 123456 78 5 0 12345678 MATCH 5 0 123456 78 0 123456 78 0 123456 78 0 0 123456 78 aggregate 0 123456 78 0 123456 78 0 12345678 0 123456 7 supplier8 Interesting Properties: 5 0 123456 78 0 0 123456 78 0 123456 78 0 123456 78 0 123456 78 • Checks reorder conditions and switches successive operators REDUCE join 0 12345678 REDUCE aggregate MAP supplier project 5 0 123456 78 0 123456 78 0 123456 78 Details in [HPS+12] 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 REDUCE aggregate 0 123456 78 0 123456 78 0 123456 78 MAP lineitem filter Partition project output lineitem Execution Plan Selection: Details in [BEH+10] • Chooses execution strategies for 2nd-order functions Local Forward MATCH Hybrid-Hash • Chooses shipping strategies to distribute data [WK09] Warneke, Kao, • Strategies known from parallel databases Local Forward Parallel Execution REDUCE 0 12345678 0 12345678 0 12345678 0 12345678 0 12345678 0 12345678 Partition supplier MAP filter 0 123456 78 MAP • Massively parallel execution of 26 DAG-structured data ows • Sequential processing tasks • • R 0 123456 78 0 12345678 0 12345678 0 12345678 0 12345678 MAP 0 123456 78 Pipeline Local Forward 0 12345678 0 12345678 • lineitem MAP Pipeline • 0 123456 78 Execution Engine: • 0 123456 78 0 123456 78 lineitem E filter COMBINE 3 4 5 6 7 8 0 12 0 1 Part-Sort2 3 4 5 6 7 8 MAP Local Forward project 0 123456 78 lineitem 0 123456 78 Partition REDUCE 0 123456 78 • Exploits UDF annotations for size estimates • Cost model combines network, disk I/O and CPU costs Parallel Execution lineitem Pa 0 123456 78 aggregate supplier MAP 0 123456 78 0 123456 78 hysical Optimization [0,1] 5 REDUCE 2 3 4 5 6 7 8 0 1 Sort 0 1 2 3 4 5 6 7 8 supplier 0 123456 78 project 0 123456 78 0 123456 78 supplier 0 123456 78 0 123456 78 MAP Cost-based Plan Selection: 0 123456 78 0 123456 78 Local Forward 5 0 123456 78 0 123456 78 filter lineitem Local Forward MATCH join MATCH 5 0 0 Hybrid-Hash 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 MAP 0 0 123456 78 0 123456 78 • Sorting, Grouping, Partitioning MAP MAP MAP supplier filter filter project • Invariant group project transformations • Property preservation reasoning with write sets • Non-relational operators are integrated filter 0 123456 78 0 12345678 5 0 123456 78 0 123456 78 0 • Filter push-down 1 2 3 4 5 6 7 8 0 123456 78 • Join reordering MAP r3.setField(4, d0) 4, r2.collect(r3) MAP 0 12345678 0 12345678 • Chooses shipping strategies to distribute data join Enumeration Algorithm: • Descents data ow recursively top-down • Strategies known from parallel databases MATCH Supported Transformations: 2: $z0 = r1.hasNext() if $z0 != 0 goto 1 0 123456 78 5 0 123456 78 0 0 123456 78 0 123456 78 0 123456 78 0 123456 78 5 0 123456 78 0 123456 78 0 123456 78 0 12345678 0 123456 78 0 12345678 0 12345678 0 123456 78 0 123456 78 1: r3 = r1.next() $d1 = r3.getField(4) d0 = d0 + $d1 output 0 12345678 0 12345678 0 123456 78 0 12345678 5 0 123456 78 project goto 2 output output 0 12345678 0 12345678 REDUCE MATCH Execution Plan Selection: aggregate join 2. Preservation of groupsREDUCE for grouping operators nd-order functions MATCH • Chooses executionMATCH strategies for 2 • Groups must remain unchanged or be completely removed join aggregate join 0 123456 78 MAP r3 = r1.next() d0 = r r3.getField(4) ) output output 1. No Write-Read / Write-Write con icts on 0 1 2 3 4 5 6 7 8 record elds 0 12345678 0 12345 78 • Similar to con ict detection 6in optimistic 1concurrency control 0 2345678 [0,1] 0 r0 = this r1 = @parameter0 // Iter 0 1 2 3 4 5 6 7 8 ter0 // Iterator r2 = @parameter1 // Coll 0 1 2 3 4 5 6 7 8 ter1 // Collector $r8 = r0.revenue r1.setField(4, $r8) r2.collect(r1) 1) r1.setField(4, $r8) r2.collect(r1) 1) Details in [HPS+12] and [HKT12] 5 0 123456 78 aggregate r0 = this • Static Code Analysis Framework provides join 0 123456 78 Local Forward output output output lineitem MATCH Hybrid-Hash REDUCE Sort MATCH Hybrid-Hash supplier REDUCE Sort MATCH Hybrid-Hash supplier REDUCE Sort D supplier [H
  • 27. one pass dataflow many pass dataflow MapReduce Impala, ... Stratosphere Text ✔ ✔ ✔ Aggregation ✔ ✔ ✔ ETL ✔ ✔ ✔ SQL Hive is too slow ✔ ✔ Advanced analytics Mahout is slow and low level Madlib is too slow ✔ A fast, massively parallel database-inspired backend. map reduce Truly scales to disk-resident large data sets using database technology (e.g., hybrid hashing and external sort-merge for implementing key matching). Built-in support for iterative programs via “iterate” operator: predictive and advanced analytics (machine learning, graph processing, stats) are all iterative. 27
  • 28. Giraph is a Stratosphere Incremental program Iterations: Doing Pregel Working Set has messages sent by the vertices Wi+1 Create Messages from new state Graph Topology Delta set has state of changed vertices Di+1 Aggregate messages and derive new state Match . U CoGroup N (left outer) Wi Si Stratosphere – Parallel Analytics Beyond MapReduce 28
  • 29. To recap: Stratosphere is an open-source system that runs on top of Hadoop Yarn and HDFS, but replaces Hadoop MapReduce with a new runtime engine designed for iterative and DAG-shaped programs, offers a program optimizer that frees programmer from low-level decisions, is scalable to large clusters and disk-resident data sets, and is programmable in Java and Scala (and more to come). 29
  • 30. A next-generation Big Data platform is being developed in Berlin. Help us shape the future of Stratosphere! 30 http://www.flickr.com/photos/andiearbeit/4354455624/lightbox/