6. 01/30/15 Bill Howe, UW 6
Science is reducing to a database problem
Old model: âQuery the worldâ (Data acquisition coupled to a specific hypothesis)
New model: âDownload the worldâ (Data acquired en masse, in support of many hypotheses)
īŽ Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
īŽ Oceanography: high-resolution models, cheap sensors, satellites
īŽ Biology: lab automation, high-throughput sequencing,
7. 01/30/15 Bill Howe, UW 7
Biology
Oceanography
Astronomy
Two dimensions#ofbytes
# of apps
LSST
SDSS
Galaxy
BioMart
GEO
IOOS
OOI
LANL
HIVPathway
Commons
PanSTARRS
8. 01/30/15 Bill Howe, UW 8
Roadmap
īŽ Introduction
īŽ Context: RDBMS, MapReduce, etc.
īŽ New Extensions for Science
īŽ Spatial Clustering
īŽ Recursive MapReduce
īŽ Skew Handling
9. 01/30/15 Bill Howe, UW 9
What Does Scalable Mean?
īŽ In the past: Out-of-core
īŽ âWorks even if data doesnât fit in main memoryâ
īŽ Now: Parallel
īŽ âCan make use of 1000s of independent
computersâ
10. 01/30/15 Bill Howe, UW 10
Taxonomy of Parallel Architectures
Easiest to program, but
$$$$
Scales to 1000s of computers
11. 01/30/15 Bill Howe, UW 11
Design Space
11
ThroughputLatency
Internet
Private
data
center
Data-
parallel
Shared
memory
DISC
This talk
slide src: Michael Isard, MSR
12. 01/30/15 Bill Howe, UW 12
Some distributed algorithmâĻ
Map
(Shuffle)
Reduce
13. 01/30/15 Bill Howe, UW 13
MapReduce Programming Model
īŽ Input & Output: each a set of key/value pairs
īŽ Programmer specifies two functions:
īŽ Processes input key/value pair
īŽ Produces set of intermediate pairs
īŽ Combines all intermediate values for a particular key
īŽ Produces a set of merged output values (usually just one)
map (in_key, in_value) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)
Inspired by primitives from functional programming
languages such as Lisp, Scheme, and Haskell
slide source: Google, Inc.
14. 01/30/15 Bill Howe, UW 14
Example: What does this do?
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, 1);
reduce(String output_key, Iterator intermediate_values):
// output_key: word
// output_values: ????
int result = 0;
for each v in intermediate_values:
result += v;
Emit(result);
slide source: Google, Inc.
15. 01/30/15 Bill Howe, UW 15
Example: Rendering
Bronson et al. Vis 2010 (submitted)
16. 01/30/15 Bill Howe, UW 16
Example: Isosurface Extraction
Bronson et al. Vis 2010 (submitted)
17. 01/30/15 Bill Howe, UW 17
Large-Scale Data Processing
īŽ Many tasks process big data, produce big data
īŽ Want to use hundreds or thousands of CPUs
īŽ ... but this needs to be easy
īŽ Parallel databases exist, but they are expensive, difficult to set up, and
do not necessarily scale to hundreds of nodes.
īŽ MapReduce is a lightweight framework, providing:
īŽ Automatic parallelization and distribution
īŽ Fault-tolerance
īŽ I/O scheduling
īŽ Status and monitoring
18. 01/30/15 Bill Howe, UW 18
Whatâs wrong with MapReduce?
īŽ Literally Map then Reduce and thatâs itâĻ
īŽ Reducers write to replicated storage
īŽ Complex jobs pipeline multiple stages
īŽ No fault tolerance between stages
īŽ
Map assumes its data is always available: simple!
īŽ What else?
20. 01/30/15 Bill Howe, UW 20
Pre-Relational: if your data changed, your application broke.
Early RDBMS were buggy and slow (and often reviled), but
required only 5% of the application code.
âActivities of users at terminals and most application programs
should remain unaffected when the internal representation of data
is changed and even when some aspects of the external
representation are changed.â
Key Ideas: Programs that manipulate tabular data exhibit an
algebraic structure allowing reasoning and manipulation
independently of physical data representation
Relational Database History
-- Codd 1979
21. 01/30/15 Bill Howe, UW 21
Key Idea: Data Independence
physical data independence
logical data independence
files and
pointers
relations
view
s
SELECT *
FROM my_sequences
SELECT seq
FROM ncbi_sequences
WHERE seq =
âGATTACGATATTAâ;
f = fopen(âtable_fileâ);
fseek(10030440);
while (True) {
fread(&buf, 1, 8192, f);
if (buf == GATTACGATATTA) {
. . .
22. 01/30/15 Bill Howe, UW 22
Key Idea: Indexes
īŽ Databases are especially, but exclusively, effective at
âNeedle in Haystackâ problems:
īŽ Extracting small results from big datasets
īŽ Transparently provide âold styleâ scalability
īŽ Your query will always* finish, regardless of dataset size.
īŽ Indexes are easily built and automatically used when
appropriateCREATE INDEX seq_idx ON sequence(seq);
SELECT seq
FROM sequence
WHERE seq = âGATTACGATATTAâ;
*almost
23. 01/30/15 Bill Howe, UW 23
Key Idea: An Algebra of Tables
select
project
join join
Other operators: aggregate, union, difference, cross product
24. 01/30/15 Bill Howe, UW 24
Key Idea: Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra!
25. 01/30/15 Bill Howe, UW 25
Key Idea: Declarative Languages
SELECT *
FROM Order o, Item i
WHERE o.item = i.item
AND o.date = today()
join
select
scan scan
date = today()
o.item = i.item
Order oItem i
Find all orders from today, along with the items ordered
26. 01/30/15 Bill Howe, UW 26
Shared Nothing Parallel Databases
īŽ Teradata
īŽ Greenplum
īŽ Netezza
īŽ Aster Data Systems
īŽ Datallegro
īŽ Vertica
īŽ MonetDB
Microsoft
Recently commercialized as âVectorwiseâ
27. 01/30/15 Bill Howe, UW 27
Example System: Teradata
AMP = unit of parallelism
28. 01/30/15 Bill Howe, UW 28
Example System: Teradata
SELECT *
FROM Orders o, Lines i
WHERE o.item = i.item
AND o.date = today()
join
select
scan scan
date = today()
o.item = i.item
Order oItem i
Find all orders from today, along with the items ordered
29. 01/30/15 Bill Howe, UW 29
Example System: Teradata
AMP 1 AMP 2 AMP 3
select
date=today()
select
date=today()
select
date=today()
scan
Order o
scan
Order o
scan
Order o
hash
h(item)
hash
h(item)
hash
h(item)
AMP 4 AMP 5 AMP 6
30. 01/30/15 Bill Howe, UW 30
Example System: Teradata
AMP 1 AMP 2 AMP 3
scan
Item i
AMP 4 AMP 5 AMP 6
hash
h(item)
scan
Item i
hash
h(item)
scan
Item i
hash
h(item)
31. 01/30/15 Bill Howe, UW 31
Example System: Teradata
AMP 4 AMP 5 AMP 6
join join join
o.item = i.item o.item = i.item o.item = i.item
contains all orders and all lines
where hash(item) = 1
contains all orders and all lines
where hash(item) = 2
contains all orders and all lines
where hash(item) = 3
32. 01/30/15 Bill Howe, UW 32
MapReduce Contemporaries
īŽ Dryad (Microsoft)
īŽ Relational Algebra
īŽ Pig (Yahoo)
īŽ Near Relational Algebra over MapReduce
īŽ HIVE (Facebook)
īŽ SQL over MapReduce
īŽ Cascading
īŽ Relational Algebra
īŽ Clustera
īŽ U of Wisconsin
īŽ Hbase
īŽ Indexing on HDFS
39. 01/30/15 Bill Howe, UW 39
Whatâs the problem?
īŽ L is loaded and shuffled in each iteration
īŽ L never changes
m01
m02
m03
r01
r02
Ri
L-split1
L-split0
[Bu et al. VLDB 2010 (submitted)]
40. 01/30/15 Bill Howe, UW 40
HaLoop: Loop-aware Hadoop
īŽ Hadoop: Loop control in client program
īŽ HaLoop: Loop control in master node
Hadoop HaLoop
[Bu et al. VLDB 2010 (submitted)]
42. 01/30/15 Bill Howe, UW 42
HaLoop Architecture
[Bu et al. VLDB 2010 (submitted)]
43. 01/30/15 Bill Howe, UW 43
Experiments
īŽ Amazon EC2
īŽ 20, 50, 90 default small instances
īŽ Datasets
īŽ Billions of Triples (120GB)
īŽ Freebase (12GB)
īŽ Livejournal social network (18GB)
īŽ Queries
īŽ Transitive Closure
īŽ PageRank
īŽ k-means [Bu et al. VLDB 2010 (submitted)]
44. 01/30/15 Bill Howe, UW 44
Application Run Time
īŽ Transitive Closure
īŽ PageRank
[Bu et al. VLDB 2010 (submitted)]
45. 01/30/15 Bill Howe, UW 45
Join Time
īŽ Transitive Closure
īŽ PageRank
[Bu et al. VLDB 2010 (submitted)]
46. 01/30/15 Bill Howe, UW 46
Run Time Distribution
īŽ Transitive Closure
īŽ PageRank
[Bu et al. VLDB 2010 (submitted)]
47. 01/30/15 Bill Howe, UW 47
Fixpoint Evaluation
īŽ PageRank
[Bu et al. VLDB 2010 (submitted)]
48. 01/30/15 Bill Howe, UW 48
Roadmap
īŽ Introduction
īŽ Context: RDBMS, MapReduce, etc.
īŽ New Extensions for Science
īŽ Recursive MapReduce
īŽ Skew Handling
49. 01/30/15 Bill Howe, UW 49
N-body Astrophysics Simulation
âĸ 15 years in dev
âĸ 109
particles
âĸ Months to run
âĸ 7.5 million
CPU hours
âĸ 500 timesteps
âĸ Big Bang to now
Simulations from Tom Quinnâs Lab, work by Sarah Loebman, YongChul
Kwon, Bill Howe, Jeff Gardner, Magda Balazinska
50. 01/30/15 Bill Howe, UW 50
Q1: Find Hot Gas
SELECT id
FROM gas
WHERE temp > 150000
51. 01/30/15 Bill Howe, UW 51
Single Node: Query 1
169 MB 1.4 GB 36 GB
[IASDS 09]
53. 01/30/15 Bill Howe, UW 53
Q4: Gas Deletion
SELECT gas1.id
FROM gas1
FULL OUTER JOIN gas2
ON gas1.id=gas2.id
WHERE gas2.id=NULL
Particles removed
between two timesteps
[IASDS 09]
56. 01/30/15 Bill Howe, UW 56
New Task: Scalable Clustering
īŽ Group particles into spatial clusters
QuickTimeâĸ and a
decompressor
are needed to see this picture.
[Kwon SSDBM 2010]
57. 01/30/15 Bill Howe, UW 57
Scalable Clustering
QuickTimeâĸ and a
decompressor
are needed to see this picture.
[Kwon SSDBM 2010]
58. 01/30/15 Bill Howe, UW 58
Scalable Clustering in Dryad
QuickTimeâĸ and a
decompressor
are needed to see this picture.
[Kwon SSDBM 2010]
59. 01/30/15 Bill Howe, UW 59
Scalable Clustering in Dryad
YongChul Kwon, Dylan Nunlee, Jeff Gardner, Sarah Loebman, Magda
Balazinska, Bill Howe
QuickTimeâĸ and a
decompressor
are needed to see this picture.
non-skewed skewed
60. 01/30/15 Bill Howe, UW 60
Roadmap
īŽ Introduction
īŽ Context: RDBMS, MapReduce, etc.
īŽ New Extensions for Science
īŽ Recursive MapReduce
īŽ Skew Handling
61. 01/30/15 Bill Howe, UW 61
Example: Friends of Friends
P1
II
C1 C2
C3
P3
I
P4
P2
I
C4
C5 C6
62. 01/30/15 Bill Howe, UW 62
Example: Friends of Friends
P1
II
C1 C2
C3
P3
I
P4
P2
I
C4
C5 C6
mergemerge
P1
C1 C2
C3
P3
I
P4
P2
I
C4
C5 C6
C5 â C3
C6 â C4
Merge P1, P3
Merge P2, P4
64. 01/30/15 Bill Howe, UW 64
Whatâs going on?!
Local FoF
Merge
Example: Unbalanced Computation
The top red line runs for 1.5 hours
5 minutes
65. 01/30/15 Bill Howe, UW 65
Which one is better?
īŽ How to decompose space?
īŽ How to schedule?
īŽ How to avoid memory overrun?
66. 01/30/15 Bill Howe, UW 66
Optimal Partitioning Plan Non-Trivial
īŽ Fine grained partitions
īŽ Less data = Less skew
īŽ Framework overhead dominates
īŽ Finding optimal point is
time consuming
īŽ No guarantee of
successful merge phase
Can we find a good partitioning plan
without trial and error?
67. 01/30/15 Bill Howe, UW 67
Skew Reduce Framework
īŽ User provides three functions
īŽ Plus (optionally) two cost functions
S = sample of the input block; Îą and B
are metadata about the block
68. 01/30/15 Bill Howe, UW 68
Skew Reduce Framework
SampleSample
Static
Plan
Static
Plan
Partition Process Merge Finalize
InputInput Local ResultLocal Result OutputOutput
âĸUser supplied cost function
âĸCould run in offline
âĸUser supplied cost function
âĸCould run in offline
âĸ Hierarchically
reconcile local result
âĸ Hierarchically
reconcile local result
âĸ Update local result
and produce final result
âĸ Update local result
and produce final result
Process Merge
Local ResultLocal Result
Data at boundary
+ Reconcile State
Local Result Intermediate
Reconciliation State
69. 01/30/15 Bill Howe, UW 69
Contiribution: SkewReduce
īŽ Two algorithms: Serial/Merge algorithm
īŽ Two cost functions for each algorithm
īŽ Find a good partition plan and schedule
Serial
Algorithm
Merge
Algorithm
Cost
functions
Cost
functions
70. 01/30/15 Bill Howe, UW 70
Does SkewReduce work?
īŽ Static plan yields 2 ~ 8 times faster running time
Coarse Fine Finer Finest Manual Opt
14.1 8.8 4.1 5.7 2.0 1.6
87.2 63.1 77.7 98.7 - 14.1
Hours
Minutes
73. 01/30/15 Bill Howe, UW 73
Visualization + Data Management
âTransferring the whole data generated âĻ to a storage device or a visualization
machine could become a serious bottleneck, because I/O would take most of the âĻ
time. A more feasible approach is to reduce and prepare the data in situ for
subsequent visualization and data analysis tasks.â
-- SciDAC Review
We can no longer afford two separate systems
74. 01/30/15 Bill Howe, UW 74
Converging Requirements
Core vis techniques (isosurfaces, volume rendering, âĻ)
Emphasis on interactive performance
Mesh data as a first-class citizen
Vis DB
75. 01/30/15 Bill Howe, UW 75
Converging Requirements
Declarative languages
Automatic data-parallelism
Algebraic optimization
Vis DB
76. 01/30/15 Bill Howe, UW 76
Converging Requirements
Vis: âQuery-driven Visualizationâ
Vis: âIn Situ Visualizationâ
Vis: âRemote Visualizationâ
DB: âPush the computation to the dataâ
Vis DB
77. 01/30/15 Bill Howe, UW 77
Desiderata for a âVisDBâ
īŽ New Data Model
īŽ Structured and Unstructured
Grids
īŽ New Query Language
īŽ Scalable grid-aware operators
īŽ Native visualization primitives
īŽ New indexing and
optimization techniques
īŽ âSmartâ Query Results
īŽ Interactive Apps/Dashboards
Editor's Notes
Drowning in data; starving for information
Weâre at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone elseâs problem.
âTypical large pharmas today are generating 20 terabytes of data daily. Thatâs probably going up to 100 terabytes per day in the next year or so.â
âtens of terabytes of data per dayâ -- genome center at Washignton University
Increase data collection exponentially with flowcam
The goal here is to make Shared Nothing Architecturs easier to program.
<number>
Dryad is optimized for: throughput, data-parallel computation, in a private data-center.
A dryad application is composed of a collection of processing vertices (processes).
The vertices communicate with each other through channels.
The vertices and channels should always compose into a directed acyclic graph.
It provides a means of describing data with its natural structure only--that is, without superimposing any additional structure for machine representation purposes. Accordingly, it provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation on the other.
It turns out that you can express a wide variety of computations using only a handful of operators.
Pig compiles and optimizes the PigLatin query into map reduce jobs. It submits these jobs to the Hadoop cluster and receives the result.
<number>
<number>
<number>
<number>
<number>
<number>
<number>
<number>
<number>
<number>
Analytics and Visualization are mutually dependent
Scalability
Fault-tolerance
Exploit shared-nothing, commodity clusters
In general: Move computation to the data
Data is ending up in the cloud; we need to figure out how to use it.