HaLoop: Efficient Iterative Processing on Large-Scale Clusters
Query-Driven Visualization in the Cloud with MapReduce
1. Query-Driven Visualization in
the Cloud with MapReduce
Bill Howe, UW
Huy Vo, Utah
Claudio Silva, Utah
Juliana Freire, Utah
YingYi Bu, UW
QuickTime™ and a
decompressor
are needed to see this picture.
2. 3/12/09 Bill Howe, UW 2VisTrails + GridFields
All Science is reducing to a database problem
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, independent of hypotheses)
Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
Medicine: ubiquitous digital records, MRI, ultrasound
Oceanography: high-resolution models, cheap sensors, satellites
Biology: lab automation, high-throughput sequencing
“Increase Data Collection Exponentially in Less Time, with FlowCAM”
Empirical X Analytical X Computational X X-informatics
3. 3/12/09 Bill Howe, UW 3VisTrails + GridFields
Why “Query-driven”?
Vis perspective:
query = subsetting
DB perspective:
query = manipulation, preparation, restructuring, index-building,
aggregation, regridding, downsampling, simplification,
reformatting, etc.
Database Maxims:
1. Push the computation to the data.
2. Declarative programming is a good thing.
4. 3/12/09 Bill Howe, UW 4VisTrails + GridFields
Why Visualization?
Super-charged aggregation
High bandwidth of the human visual cortex
Query-writing presupposes a known goal
“What does the salt wedge look like?”
5. 3/12/09 Bill Howe, UW 5VisTrails + GridFields
Why Cloud?
“Cloud”?
Software as a Service (SaaS)
Infrastructure as a Service (IaaS)
Platform as a Service (PaaS)
Working definition:
General, elastic, data-intensive, scalable computing
This work: Vis techniques + DB techniques in the Cloud
6. 3/12/09 Bill Howe, UW 6VisTrails + GridFields
Visualization + Data Management
“Transferring the whole data generated … to a storage device or a visualization
machine could become a serious bottleneck, because I/O would take most of the …
time. A more feasible approach is to reduce and prepare the data in situ for
subsequent visualization and data analysis tasks.”
-- SciDAC Review
We can no longer afford two separate systems
7. 3/12/09 Bill Howe, UW 7VisTrails + GridFields
Converging Requirements
Core vis techniques (isosurfaces, volume rendering, …)
Emphasis on interactive performance
Mesh data as a first-class citizen
Vis DB
8. 3/12/09 Bill Howe, UW 8VisTrails + GridFields
Converging Requirements
Declarative languages
Automatic data-parallelism
Algebraic optimization
Vis DB
9. 3/12/09 Bill Howe, UW 9VisTrails + GridFields
Converging Requirements
Vis: “Query-driven Visualization”
Vis: “In Situ Visualization”
Vis: “Remote Visualization”
DB: “Push the computation to the data”
Vis DB
10. 3/12/09 Bill Howe, UW 10VisTrails + GridFields
Thesis
We can no longer afford to build separate
visualization and data management systems
Data is increasingly destined for the cloud
At least two approaches:
1. Build “cloud” Vis platform with DM capabilities
2. Extend “cloud” DM platforms with Vis capabilities
We are assessing option 2.
11. 3/12/09 Bill Howe, UW 11VisTrails + GridFields
This Talk
Brief Technology Review
Relational Databases
MapReduce: Data-Intensive Scalable Programming
GridFields: Mesh Algebra
VisTrails: Workflow and Provenance
Preliminary Results with Hadoop/MapReduce
Climatology queries on a shared cloud
Core vis algorithms on a private cluster
12. 3/12/09 Bill Howe, UW 12VisTrails + GridFields
Pre-Relational: if your data changed, your application broke.
Early RDBMS were buggy and slow (and often reviled), but
required only 5% of the application code.
“Activities of users at terminals and most application programs
should remain unaffected when the internal representation of data
is changed and even when some aspects of the external
representation are changed.”
Key Ideas: Programs that manipulate tabular data exhibit an
algebraic structure allowing reasoning and manipulation
independently of physical data representation
Relational Database History
-- Codd 1979
13. 3/12/09 Bill Howe, UW 13VisTrails + GridFields
Key Idea: Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator
14. 3/12/09 Bill Howe, UW 14VisTrails + GridFields
Key Idea: An Algebra of Tables
select
project
join join
Other operators: aggregate, union, difference, cross product
15. 3/12/09 Bill Howe, UW 15VisTrails + GridFields
GridFields: An Algebra for Unstructured Grids
unstructured grids model
complex domains at multiple
scales simultaneously
red = high salinity (~34psu)
blue = fresh water (~0 psu)
Columbia River Estuary
….but complicate processing
16. 3/12/09 Bill Howe, UW 16VisTrails + GridFields
GridFields: Data Model
x y salt temp
13.8 10.6 29.4 12.1
13.9 9.4 29.8 12.5
14.3 9.0 28.0 12.0
13.4 9.0 30.1 13.2
flux area
11.5 3.3
13.9 5.5
13.1 4.5
17. 3/12/09 Bill Howe, UW 17VisTrails + GridFields
GridFields: Operators
Lifted set operations
Union, Intersection, Cross Product
Scan/Bind
Read a grid/attribute from disk
Restrict
Remove cells that do not satisfy a predicate
Accrete
“Grow” a grid by including neighbors of cells
Regrid
Map the data of one grid onto another
18. 3/12/09 Bill Howe, UW 18VisTrails + GridFields
GridFields: Query Algebra
⊗
H0 : (x,y,b) V0 : (z)
A
restrict(0, z >b)
B
color is depth
Algebraic Manipulation of Scientific Datasets,
B. Howe, D. Maier, VLDBJ 2005
⊗
H0 : (x,y,b) V0 : (σ )
apply(0, z=(surf − b) * σ )
bind(0, surf)
C
color is salinity
22. 3/12/09 Bill Howe, UW 22VisTrails + GridFields
UW + Utah CluE Program
Goals
10+-year “climatologies” at interactive speeds
…with provenance, reproducibility, collaboration
…on a shared-nothing, commodity platform
In general: Explore the intersection of scientific
databases and scientific visualization, at scale
Methods
“Cloud-Enable” two projects
GridFields: Query algebra for mesh data
VisTrails: Scientific workflow and provenance
23. 3/12/09 Bill Howe, UW 23VisTrails + GridFields
Why MapReduce?
Need to scale to hundreds or thousands of CPUs
Parallel databases expensive, proprietary, difficult
Not shown to scale to thousands of computers
MapReduce is a lightweight framework providing automatic
Data parallelism
Fault-tolerance
I/O scheduling
25. 3/12/09 Bill Howe, UW 25VisTrails + GridFields
Some distributed algorithm…
Map
(Shuffle)
Reduce
26. 3/12/09 Bill Howe, UW 26VisTrails + GridFields
MapReduce Programming Model
Input & Output: each a set of key/value pairs
Programmer specifies two functions:
Processes input key/value pair
Produces set of intermediate pairs
Combines all intermediate values for a particular key
Produces a set of merged output values (usually just one)
map (in_key, in_value) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)
slide source: Google, Inc.
27. 3/12/09 Bill Howe, UW 27VisTrails + GridFields
This Talk
Brief Technology Review
Relational Databases
MapReduce: Data-Intensive Scalable Programming
GridFields: Mesh Algebra
VisTrails: Workflow and Provenance
Preliminary Results with Hadoop/MapReduce
Climatology queries on a shared cloud
Core vis algorithms on a private cluster
28. 3/12/09 Bill Howe, UW 28VisTrails + GridFields
Application Domain: Oceanography
<Vis movie>QuickTime™ and a
decompressor
are needed to see this picture.
Key idea: Zooplankton correlated with temperature
29. 3/12/09 Bill Howe, UW 29VisTrails + GridFields
Example Query: Climatology
Feb May
Average Surface Salinity by Month
Columbia River Plume 1999-2006
Columbia
River
psu
Washington
Oregon
animation
36. 3/12/09 Bill Howe, UW 36VisTrails + GridFields
Conclusions
Converging requirements in DB and Vis communities
Motivation exists for a “VisDB”
declarative query + high-performance vis, at full scale
We are evaluating Hadoop as a “substrate” for a VisDB
Scalability and reduced development time are promising
Interactive performance requires some changes
37. 3/12/09 Bill Howe, UW 37VisTrails + GridFields
Acknowledgments
http://escience.washington.edu
39. 3/12/09 Bill Howe, UW 39VisTrails + GridFields
Shared Nothing Parallel Databases
Teradata
Greenplum
Netezza
Aster Data Systems
Datallegro
Vertica
MonetDB
Microsoft
Recently commercialized as “Vectorwise”
40. 3/12/09 Bill Howe, UW 40VisTrails + GridFields
Taxonomy of Parallel Architectures
Easiest to program, but
$$$$
Scales to 1000s of nodes
41. 3/12/09 Bill Howe, UW 41VisTrails + GridFieldsscreenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah
VisTrails
42. 3/12/09 Bill Howe, UW 42VisTrails + GridFieldsscreenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah
Version Tree
43. 3/12/09 Bill Howe, UW 43VisTrails + GridFields
Collaboration
Bill Howe @ UW
computes salt flux
using GridFields
Erik Anderson @ Utah
adds vector
streamlines and
adjusts opacity
Bill Howe @ UW adds
an isosurface of
salinity
Peter Lawson adds
discussion of the
scientific
interpretation
Howe et al., eScience 2008
44. 3/12/09 Bill Howe, UW 44VisTrails + GridFields
Preliminary results
Managing Hadoop jobs with VisTrails
GridField queries in Hadoop
Core Visualization algorithms in Hadoop
45. 3/12/09 Bill Howe, UW 45VisTrails + GridFields
Preliminary results
Managing Hadoop jobs with VisTrails
GridField queries in Hadoop
Core Visualization algorithms in Hadoop
46. 3/12/09 Bill Howe, UW 46VisTrails + GridFields
Hadoop in VisTrails
Wrap Hadoop Streaming/HDFS Operations
Plug “PreProcess” to actual Vis Pipeline
3/12/09 46
47. 3/12/09 Bill Howe, UW 47VisTrails + GridFields
Hadoop in VisTrails
Provenance and Monitoring
3/12/09 47
48. 3/12/09 Bill Howe, UW 48VisTrails + GridFields
Preliminary results
Managing Hadoop jobs with VisTrails
GridField queries in Hadoop
Core Visualization algorithms in Hadoop
49. 3/12/09 Bill Howe, UW 49VisTrails + GridFields
All Science is reducing to a database problem
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, independent of hypotheses)
Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
Medicine: ubiquitous digital records, MRI, ultrasound
Oceanography: high-resolution models, cheap sensors, satellites
Biology: lab automation, high-throughput sequencing
“Increase Data Collection Exponentially in Less Time, with FlowCAM”
Empirical X Analytical X Computational X X-informatics
50. 3/12/09 Bill Howe, UW 50VisTrails + GridFields
Key Idea: Declarative Languages
SELECT *
FROM Order o, Item i
WHERE o.item = i.item
AND o.date = today()
join
select
scan scan
date = today()
o.item = i.item
Order oItem i
Find all orders from today, along with the items ordered
51. 3/12/09 Bill Howe, UW 51VisTrails + GridFields
Example System: Teradata
AMP = unit of parallelism
52. 3/12/09 Bill Howe, UW 52VisTrails + GridFields
Example System: Teradata
AMP 1 AMP 2 AMP 3
select
date=today()
select
date=today()
select
date=today()
scan
Order o
scan
Order o
scan
Order o
hash
h(item)
hash
h(item)
hash
h(item)
AMP 4 AMP 5 AMP 6
53. 3/12/09 Bill Howe, UW 53VisTrails + GridFields
Example System: Teradata
AMP 1 AMP 2 AMP 3
scan
Item i
AMP 4 AMP 5 AMP 6
hash
h(item)
scan
Item i
hash
h(item)
scan
Item i
hash
h(item)
54. 3/12/09 Bill Howe, UW 54VisTrails + GridFields
Example System: Teradata
AMP 4 AMP 5 AMP 6
join join join
o.item = i.item o.item = i.item o.item = i.item
contains all orders and all lines
where hash(item) = 1
contains all orders and all lines
where hash(item) = 2
contains all orders and all lines
where hash(item) = 3
55. 3/12/09 Bill Howe, UW 55VisTrails + GridFields
Workflow Execution Plans
Need execution plans spanning client/server/cloud
56. 3/12/09 Bill Howe, UW 56VisTrails + GridFields
Example: Isosurface Browsing
QuickTime™ and a
decompressor
are needed to see this picture.
57. 3/12/09 Bill Howe, UW 57VisTrails + GridFields
Example: Isosurface Browsing
Plan A
Subset Subset Subset Subset
tstep 0 tstep 1 tstep 2 tstep 3
58. 3/12/09 Bill Howe, UW 58VisTrails + GridFields
Example: Isosurface Browsing
Plan B: Build an index
Build Index, e.g., an Interval Tree (Cignoni 97)
Subset Subset Subset
tstep 0 tstep 1 tstep 2 tstep 3
Subset
Render
Isosurface Isosurface Isosurface Isosurface
Render Render Render
59. 3/12/09 Bill Howe, UW 59VisTrails + GridFields
Example: Isosurface Browsing
Plan C: Build a spatial index to support panning
Plan D: Build a multi-resolution index to support zoom
…and so on
Why not precompute all appropriate indexes?
Some will (partially) reside on client
Storage is not as cheap as we pretend
Need a flexible system where
a “query result” can be explored interactively, and
we prepare for similar queries
similarity defined by natural “browsing patterns” in visualization
systems
61. 3/12/09 Bill Howe, UW 61VisTrails + GridFields
Why MapReduce/Hadoop?
Popular
AWS Elastic MapReduce
100s of startups
# of downloads
# of blog posts
Free as in Speech
Free as in Beer
Flexible, Lightweight
Scalable
Fault-tolerant
69. 3/12/09 Bill Howe, UW 69VisTrails + GridFields
As a GridField Expression
⊗
H0 : (x,y,b) V0 : (σ )
apply(0, z=(surf − b) * σ )
bind(0, surf)
C
H = Scan(contxt, "H")
rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H)
T = Scan(contxt, “T”)
V = Scan(contxt, “V”)
HxV = Cross(H, V)
HxVxT = Cross(HxV, T)
salt = Bind(contxt, HxVxT, “salt”)
onemonth = Regrid(salt, HxV, equijoin(“hpos,vpos”), avg())
70. 3/12/09 Bill Howe, UW 70VisTrails + GridFields
As a SQL Query
Select hpos, vpos, avg(salt)
from ocean
group by hpos, vpos
71. 3/12/09 Bill Howe, UW 71VisTrails + GridFields
Scientific Workflow Systems
Value proposition: More time on science, less time on code
How: By providing language features emphasizing sharing,
reuse, reproducibility, rapid prototyping, efficiency
Provenance
Visual programming
Caching
Integration with domain-specific tools
Scheduling
72. 3/12/09 Bill Howe, UW 72VisTrails + GridFields
Related Vis Work
Parallel visualization systems
ParaView, VisIt
Query-Driven Visualization
[Bethel et al 2006,2008,2009]
FastBit Index
[Shoshani et al 2007]
DB Vis systems
Tableau
73. 3/12/09 Bill Howe, UW 73VisTrails + GridFields
Feeding the Pipeline
source: Ken Moreland
missing step?
75. 3/12/09 Bill Howe, UW 75VisTrails + GridFields
Role 2: Move Computation to the Data
“Transferring the whole data generated … to a storage device or a
visualization machine could become a serious bottleneck, because I/O
would take most of the … time. A more feasible approach is to reduce
and prepare the data in situ for subsequent visualization and data
analysis tasks.”
-- SciDAC Review
76. 3/12/09 Bill Howe, UW 76VisTrails + GridFields
Remote Visualization
Reduce and render remotely, transfer images
++ transfers less data
-- specialized hardware, high load
Reduce remotely, transfer data/geometry, render locally
++ uses local graphics pipeline
-- transfers more data
78. 3/12/09 Bill Howe, UW 78VisTrails + GridFields
Scientific Vis System Roundup
General
ParaView [KitWare, Los Alamos, Sandia]
VisIt [LLNL]
Specialized
SALSA, particles, Quinn, UW
VISUS, streaming/progressive, Jones, LLNL
SAGE,
Hyperwall, tiled display, NASA
Notes de l'éditeur
Drowning in data; starving for information
We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem.
Visualization is a more efficient way to query data -- you can browse and explore.
But you need to be able to switch back and forth between interactive browsing and symbolic querying
Need to consider private clouds
Not just renting hardware: general-purpose data processing
Analytics and Visualization are mutually dependent
Scalability
Fault-tolerance
Exploit shared-nothing, commodity clusters
In general: Move computation to the data
Data is ending up in the cloud; we need to figure out how to use it.
It provides a means of describing data with its natural structure only--that is, without superimposing any additional structure for machine representation purposes. Accordingly, it provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation on the other.
It turns out that you can express a wide variety of computations using only a handful of operators.
Bringing the computation to the data may soon mean implementing your algorithm in hadoop
The storage layer may soon be equipped with Hadoop instead of just a parallel filesystem
Need to learn how to push your computation down into the Hadoop layer.
Growing popularity
Cluster Exploratory program: Google, IBM, NSF
Amazon Web Services: Elastic MapReduce
Hive, Pig, Cascading, Mahout
lots of startups (e.g., Cloudera)
Bringing the computation to the data may soon mean implementing your algorithm in hadoop
The storage layer may soon be equipped with Hadoop instead of just a parallel filesystem
Need to learn how to push your computation down into the Hadoop layer.
Growing popularity
Cluster Exploratory program: Google, IBM, NSF
Amazon Web Services: Elastic MapReduce
Hive, Pig, Cascading, Mahout
lots of startups (e.g., Cloudera)
This movie was rendered offline, but it’s increasingly important to be able to create visualizations on the fly to allow interactive exploration of large datasets.
Data-intensive science
The goal here is to make Shared Nothing Architecturs easier to program.
We only wrap the interface for Hadoop Streaming in VisTrails with the additional suppport of HDFS operations to upload/download data/libraries for the job.
The Hadoop Streaming is plugged into a local VTK rendering pipeline that would grab data from the cloud and generate an animation on the VisTrails Spreadsheet.
Users can specify their own Python Source as mapper/reducer. In this case, a VTK script is specified in the mapper. Also, VTK libraries are shipped along with the code to the computing node. This uses the underlying –cacheArchive of Hadoop streaming.
By default, Hadoop logs are output to the standard output of VisTrails app. Jobs are killed by terminate the program and run an extra command returned by Hadoop. However, one can plug a HadoopTrackerCell to the end of the pipeline to have their log messages to be monitored on the VisTrails Spreadsheet. There are also button to kill the job or show Job Tracker, which would automatically connect through the CLuE’s specific proxy to see additional logs/error messages of jobs.
Drowning in data; starving for information
We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem.
Need to assign workflows to resources for execution in a heterogeneous compute environment. Parts of this workflow can be compiled into Hadoop jobs, parts should be run locally so that they exploit hardware acceleration.
But this is not just computation placement -- there are different execution plans, similar to relational execution plans.
Gridfields expressions can be algebraically optimized, for example.
Plan C: Build a spatial index to support panning
Plan D: Build a multi-resolution index to support zoom
…and so on
Why not precompute all appropriate indexes?
Some will (partially) reside on client
Storage is not as cheap as we pretend
Need a flexible system where
a “query result” can be explored interactively, and
we prepare for similar queries
similarity defined by natural “browsing patterns” in visualization systems
We can’t just precompute the indexes, since they may reside on
Analytics and Visualization are mutually dependent
Scalability
Fault-tolerance
Exploit shared-nothing, commodity clusters
In general: Move computation to the data
Upper left: Average
Sweeping through the velocity fields quickly exposed the location of the “upstream” salt flux -- where salty water made its way back upstream.