Presentation by Dr. Kostas Tzoumas at the Big Data Beers Meetup [1] (Nov. 20, 2013) introducing the Stratosphere Platform for Big Data Analytics.
Check out http://stratosphere.eu for more information.
[1] http://www.meetup.com/Big-Data-Beers/events/147397982/
2. Data is an important asset
video & audio streams, sensor data, RFID, GPS, user online
behavior, scientific simulations, web archives, ...
Volume
Handle petabytes of data
Velocity
Handle high data arrival rates
Variety
Handle many heterogeneous data sources
Veracity
2
Handle inherent uncertainty of data
4. Four “I”s for Big Analysis
text mining, interactive and ad hoc analysis, machine
learning, graph analysis, statistical algorithms
Iterative
Model the data, do not just describe it
Incremental
Maintain the model under high arrival rates
Interactive
Step-by-step data exploration on very large data
Integrative
4
Fluent unified interfaces for different data models
5. Hadoop
Hadoop’s selling point is its
low effective storage cost.
Hadoop clusters are becoming a data vortex, attracting
cross-departmental data and changing the data usage
culture in companies.
Hadoop MapReduce was the wrong abstraction and
implementation to begin with and will be superseded
by better systems.
5
6. Advanced
Analytics
Analytics that model the data to reveal hidden
relationships, not just describe the data.
E.g., machine learning, predictive stats, graph analysis
Increasingly important from a market perspective.
Very different than SQL analytics: different languages and
access patterns (iterative vs. one-pass programs).
Hadoop toolchain poor; R, Matlab, etc not parallel.
6
9. Data Scientist:
The Sexiest Job of the 21st Century
Meet the people who
can coax treasure out of
messy, unstructured data.
FROM!(!
by Thomas H. Davenport
!!FROM!pv_users!
and D.J. Patil
!!MAP!pv_users.userid,!pv_users.date!
!!USING!'map_script'!
!!AS!dt,!uid!
!!CLUSTER0BY0dt)!map_output!
INSERT0OVERWRITE0TABLE0pv_users_reduced!
!!REDUCE!map_output.dt,!map_output.uid!
!!USING!'reduce_script'!
!!AS!date,!count;!
≠
hen Jonathan Goldman arrived for work in June 2006
at LinkedIn, the business
networking site, the place still
felt like a start-up. The company had just under 8 million
accounts, and the number was
A"="load"'WordcountInput.txt';"
growing quickly as existing memB"="MAPREDUCE"wordcount.jar"store"A"into"'inputDir‘"load"
""""'outputDir'"as"(word:chararray,"count:"int)" colbers invited their friends and
""""'org.myorg.WordCount"inputDir"outputDir';" weren’t
leagues to join. But users
C"="sort"B"by"count;"
seeking out connections with the people who were already on the site
at the rate executives had expected. Something was apparently missing in the social experience. As one LinkedIn manager put it, “It was
like arriving at a conference reception and realizing you don’t know
anyone. So you just stand in the corner sipping your drink—and you
9
11. Hadoop is...
1. A programming model called MapReduce
2. An implementation of said programming
model, called Hadoop MapReduce
3. A file system, called HDFS
4. A resource manager, called Yarn
5. Interfaces to Hadoop MapReduce
(Pig, Hive, Cascading, ...)
6. An ML library called Mahout.
7. Recently, a collection of runtime
systems (Tez, Impala, Spark,
Stratosphere, ...)
11
* Inspired by
Jens Dittrich
15. 5. Interfaces to Hadoop MapReduce
(Pig, Hive, Cascading, ...)
Reduce
Reduce
Reduce
Map
Map
Map
Lacking in
declarativity
15
Operators
exchange data via
HDFS
Sort the only
grouping operator
Need many
MapReduce rounds
16. 6. An ML library called Mahout.
Iterative programs in Hadoop
Client
16
Reduce
Iteration 3
Map
Reduce
Iteration 2
Map
Reduce
Map
Iteration 1
17. Iterations in MapReduce too
slow. Design a new runtime
system and use the Hadoop
Incremental Iterations matter
scheduler to exploit sparse
computational dependencies.
■ Changes to the iteration's result for Connected Components
in each superstep
# Vertices (thousands)
1400
1200
1000
800
600
400
200
0
0
2
4
6
8
10 12 14 16 18 20 22 24 26 28 30 32 34
Naïve (Bulk)
Superstep
17
Incremental
18. Observations
1. MapReduce programming model good for grouping & counting.
2. MapReduce programming model not good for much else.
3. Hadoop implementation of MapReduce trades performance
for fault-tolerance (disk-based data shuffling).
4. MapReduce programming model not suited for SQL. Need to
hack around it with multiple MapReduce rounds.
5. Hadoop’s implementation of MapReduce not suited for SQL.
6. MapReduce programming model and its Hadoop
implementation not suited for iterations. Need to hack around it
with implementing iterations in client or embedding a new
runtime in a Map function.
18
20. Stratosphere: a brief history
2009: DFG-funded research group from
TUB, HUB, HPI starts research on
“Information Management in the Cloud.”
2010-2012: Stratosphere released as open
source (v0.1, v0.2) and becomes known in
academic community. Companies and
Universities in Europe become part of
Stratosphere.
2013 and beyond: Transition from a
research project to a stable and usable
open source system, developer
community, and real-world use cases.
20
21. Stratosphere status
Next stable release (v0.4) coming up
around end of November. Snapshot
available to download; maturity
equivalent to Apache incubations.
21
Community picking up:
external developers
from Universities (KTH,
SICS, Inria, and others),
hackathons in Berlin,
Paris, Budapest,
companies are starting
to use Stratosphere
(Deutsche Telekom,
Internet Memory,
Mediaplus).
23. Desiderata for next-gen big
data platforms: Usability
10 million
Excel users
3 million
R users
70,000
Hadoop
users
23
“the market faces
certain challenges
such as unavailability
of qualified and
experienced work
professionals, who can
effectively handle the
Hadoop architecture.”
24. Desiderata for next-gen big
data platforms: Performance
Stratosphere!
Hadoop!
0!
100!
200!
300!
400!
500!
600!
700!
Performance difference from days to minutes enables
real time decision making and widespread use of data
within the organization.
24
25. Data characteristics change
Each color is a differently written
program that produces the same result but has very
different performance depending on small changes
in the data set and the analysis requirements
Query optimizers: the
enabling technology for SQL
data warehousing and BI
Successful industrial
application of artificial
intelligence
Data characteristics change
Currently, only Stratosphere
can optimize non-relational
data analysis programs.
(a) Complex Plan Diagram
(b) Reduced Plan Diagram
Figure 2: Complex Plan and Reduced Plan Diagram (Query 8, OptA)
25
27. one pass
dataflow
many pass
dataflow
MapReduce
Impala, ...
Stratosphere
Text
✔
✔
✔
Aggregation
✔
✔
✔
ETL
✔
✔
✔
SQL
Hive is too
slow
✔
✔
Advanced
analytics
Mahout is slow
and low level
Madlib is
too slow
✔
A fast, massively parallel
database-inspired backend.
map
reduce
Truly scales to disk-resident
large data sets using database
technology (e.g., hybrid hashing
and external sort-merge for
implementing key matching).
Built-in support for iterative
programs via “iterate”
operator: predictive and
advanced analytics (machine
learning, graph processing,
stats) are all iterative.
27
28. Giraph is a Stratosphere
Incremental
program Iterations: Doing Pregel
Working Set has messages sent by the vertices
Wi+1
Create Messages
from new state
Graph
Topology
Delta set has state of changed vertices
Di+1
Aggregate
messages and
derive new state
Match
.
U
CoGroup
N
(left outer)
Wi
Si
Stratosphere – Parallel Analytics Beyond MapReduce
28
29. To recap:
Stratosphere is an open-source system that runs
on top of Hadoop Yarn and HDFS, but replaces
Hadoop MapReduce with a new runtime engine
designed for iterative and DAG-shaped programs,
offers a program optimizer that frees programmer
from low-level decisions, is scalable to large clusters
and disk-resident data sets, and is programmable in
Java and Scala (and more to come).
29
30. A next-generation Big Data
platform is being developed
in Berlin.
Help us shape
the future of
Stratosphere!
30
http://www.flickr.com/photos/andiearbeit/4354455624/lightbox/