SlideShare une entreprise Scribd logo
1  sur  78
Query-Driven Visualization in
the Cloud with MapReduce
Bill Howe, UW
Huy Vo, Utah
Claudio Silva, Utah
Juliana Freire, Utah
YingYi Bu, UW
QuickTime™ and a
decompressor
are needed to see this picture.
3/12/09 Bill Howe, UW 2VisTrails + GridFields
All Science is reducing to a database problem
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, independent of hypotheses)
 Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
 Medicine: ubiquitous digital records, MRI, ultrasound
 Oceanography: high-resolution models, cheap sensors, satellites
 Biology: lab automation, high-throughput sequencing
“Increase Data Collection Exponentially in Less Time, with FlowCAM”
Empirical X  Analytical X  Computational X  X-informatics
3/12/09 Bill Howe, UW 3VisTrails + GridFields
Why “Query-driven”?
 Vis perspective:
 query = subsetting
 DB perspective:
 query = manipulation, preparation, restructuring, index-building,
aggregation, regridding, downsampling, simplification,
reformatting, etc.
Database Maxims:
1. Push the computation to the data.
2. Declarative programming is a good thing.
3/12/09 Bill Howe, UW 4VisTrails + GridFields
Why Visualization?
 Super-charged aggregation
 High bandwidth of the human visual cortex
 Query-writing presupposes a known goal
“What does the salt wedge look like?”
3/12/09 Bill Howe, UW 5VisTrails + GridFields
Why Cloud?
 “Cloud”?
 Software as a Service (SaaS)
 Infrastructure as a Service (IaaS)
 Platform as a Service (PaaS)
 Working definition:
General, elastic, data-intensive, scalable computing
This work: Vis techniques + DB techniques in the Cloud
3/12/09 Bill Howe, UW 6VisTrails + GridFields
Visualization + Data Management
“Transferring the whole data generated … to a storage device or a visualization
machine could become a serious bottleneck, because I/O would take most of the …
time. A more feasible approach is to reduce and prepare the data in situ for
subsequent visualization and data analysis tasks.”
-- SciDAC Review
We can no longer afford two separate systems
3/12/09 Bill Howe, UW 7VisTrails + GridFields
Converging Requirements
Core vis techniques (isosurfaces, volume rendering, …)
Emphasis on interactive performance
Mesh data as a first-class citizen
Vis DB
3/12/09 Bill Howe, UW 8VisTrails + GridFields
Converging Requirements
Declarative languages
Automatic data-parallelism
Algebraic optimization
Vis DB
3/12/09 Bill Howe, UW 9VisTrails + GridFields
Converging Requirements
Vis: “Query-driven Visualization”
Vis: “In Situ Visualization”
Vis: “Remote Visualization”
DB: “Push the computation to the data”
Vis DB
3/12/09 Bill Howe, UW 10VisTrails + GridFields
Thesis
 We can no longer afford to build separate
visualization and data management systems
 Data is increasingly destined for the cloud
 At least two approaches:
1. Build “cloud” Vis platform with DM capabilities
2. Extend “cloud” DM platforms with Vis capabilities
 We are assessing option 2.
3/12/09 Bill Howe, UW 11VisTrails + GridFields
This Talk
 Brief Technology Review
 Relational Databases
 MapReduce: Data-Intensive Scalable Programming
 GridFields: Mesh Algebra
 VisTrails: Workflow and Provenance
 Preliminary Results with Hadoop/MapReduce
 Climatology queries on a shared cloud
 Core vis algorithms on a private cluster
3/12/09 Bill Howe, UW 12VisTrails + GridFields
Pre-Relational: if your data changed, your application broke.
Early RDBMS were buggy and slow (and often reviled), but
required only 5% of the application code.
“Activities of users at terminals and most application programs
should remain unaffected when the internal representation of data
is changed and even when some aspects of the external
representation are changed.”
Key Ideas: Programs that manipulate tabular data exhibit an
algebraic structure allowing reasoning and manipulation
independently of physical data representation
Relational Database History
-- Codd 1979
3/12/09 Bill Howe, UW 13VisTrails + GridFields
Key Idea: Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator
3/12/09 Bill Howe, UW 14VisTrails + GridFields
Key Idea: An Algebra of Tables
select
project
join join
Other operators: aggregate, union, difference, cross product
3/12/09 Bill Howe, UW 15VisTrails + GridFields
GridFields: An Algebra for Unstructured Grids
unstructured grids model
complex domains at multiple
scales simultaneously
red = high salinity (~34psu)
blue = fresh water (~0 psu)
Columbia River Estuary
….but complicate processing
3/12/09 Bill Howe, UW 16VisTrails + GridFields
GridFields: Data Model
x y salt temp
13.8 10.6 29.4 12.1
13.9 9.4 29.8 12.5
14.3 9.0 28.0 12.0
13.4 9.0 30.1 13.2
flux area
11.5 3.3
13.9 5.5
13.1 4.5
3/12/09 Bill Howe, UW 17VisTrails + GridFields
GridFields: Operators
 Lifted set operations
 Union, Intersection, Cross Product
 Scan/Bind
 Read a grid/attribute from disk
 Restrict
 Remove cells that do not satisfy a predicate
 Accrete
 “Grow” a grid by including neighbors of cells
 Regrid
 Map the data of one grid onto another
3/12/09 Bill Howe, UW 18VisTrails + GridFields
GridFields: Query Algebra
⊗
H0 : (x,y,b) V0 : (z)
A
restrict(0, z >b)
B
color is depth
Algebraic Manipulation of Scientific Datasets,
B. Howe, D. Maier, VLDBJ 2005
⊗
H0 : (x,y,b) V0 : (σ )
apply(0, z=(surf − b) * σ )
bind(0, surf)
C
color is salinity
3/12/09 Bill Howe, UW 19VisTrails + GridFields
GridFields: Optimization
3/12/09 Bill Howe, UW 20VisTrails + GridFields
GridFields: Optimization
0
5
10
15
20
25
30
35
40
45
vtk(3D) interpolate simple interp_o simple_o
secs
But: only an 800 MB dataset
3/12/09 Bill Howe, UW 21VisTrails + GridFields
GridFields + VisTrails
3/12/09 Bill Howe, UW 22VisTrails + GridFields
UW + Utah CluE Program
 Goals
 10+-year “climatologies” at interactive speeds
 …with provenance, reproducibility, collaboration
…on a shared-nothing, commodity platform
 In general: Explore the intersection of scientific
databases and scientific visualization, at scale
 Methods
 “Cloud-Enable” two projects

GridFields: Query algebra for mesh data

VisTrails: Scientific workflow and provenance
3/12/09 Bill Howe, UW 23VisTrails + GridFields
Why MapReduce?
 Need to scale to hundreds or thousands of CPUs
 Parallel databases expensive, proprietary, difficult
 Not shown to scale to thousands of computers
 MapReduce is a lightweight framework providing automatic
 Data parallelism
 Fault-tolerance
 I/O scheduling
3/12/09 Bill Howe, UW 24VisTrails + GridFields
Why Hadoop?
paraview
hadoop
3/12/09 Bill Howe, UW 25VisTrails + GridFields
Some distributed algorithm…
Map
(Shuffle)
Reduce
3/12/09 Bill Howe, UW 26VisTrails + GridFields
MapReduce Programming Model
 Input & Output: each a set of key/value pairs
 Programmer specifies two functions:
 Processes input key/value pair
 Produces set of intermediate pairs
 Combines all intermediate values for a particular key
 Produces a set of merged output values (usually just one)
map (in_key, in_value) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)
slide source: Google, Inc.
3/12/09 Bill Howe, UW 27VisTrails + GridFields
This Talk
 Brief Technology Review
 Relational Databases
 MapReduce: Data-Intensive Scalable Programming
 GridFields: Mesh Algebra
 VisTrails: Workflow and Provenance
 Preliminary Results with Hadoop/MapReduce
 Climatology queries on a shared cloud
 Core vis algorithms on a private cluster
3/12/09 Bill Howe, UW 28VisTrails + GridFields
Application Domain: Oceanography
<Vis movie>QuickTime™ and a
decompressor
are needed to see this picture.
Key idea: Zooplankton correlated with temperature
3/12/09 Bill Howe, UW 29VisTrails + GridFields
Example Query: Climatology
Feb May
Average Surface Salinity by Month
Columbia River Plume 1999-2006
Columbia
River
psu
Washington
Oregon
animation
3/12/09 Bill Howe, UW 30VisTrails + GridFields
CluE Query Results
CluE: 400 node shared Hadoop platform provided by Google, IBM, NSF
4-year-old commodity hardware, suspect IO performance
3/12/09 Bill Howe, UW 31VisTrails + GridFields
Preliminary results
 Managing Hadoop jobs with VisTrails
 GridField queries in Hadoop
 Core Visualization algorithms in Hadoop
3/12/09 Bill Howe, UW 32VisTrails + GridFields
Core Vis Algorithms in MapReduce
 Scalar/Volume Rendering
 Map: Rasterization
 Reduce: Compositing, blending
 Isosurface Extraction
 Map: Isosurface Extraction
 Reduce: Combine like isovalues
 Mesh Simplification
 Map: Bin vertices
 Reduce: Collapse binned triangles
3/12/09 Bill Howe, UW 33VisTrails + GridFields
ATLAS dataset
3/12/09 Bill Howe, UW 34VisTrails + GridFields
Rendering (Preliminary)
# of mappers
57-node Nehalem
3/12/09 Bill Howe, UW 35VisTrails + GridFields
Isosurface Extraction (Preliminary)
32
48
64
96
128
3/12/09 Bill Howe, UW 36VisTrails + GridFields
Conclusions
 Converging requirements in DB and Vis communities
 Motivation exists for a “VisDB”
 declarative query + high-performance vis, at full scale
 We are evaluating Hadoop as a “substrate” for a VisDB
 Scalability and reduced development time are promising
 Interactive performance requires some changes
3/12/09 Bill Howe, UW 37VisTrails + GridFields
Acknowledgments
http://escience.washington.edu
3/12/09 Bill Howe, UW 38VisTrails + GridFields
BACKUP SLIDES
3/12/09 Bill Howe, UW 39VisTrails + GridFields
Shared Nothing Parallel Databases
 Teradata
 Greenplum
 Netezza
 Aster Data Systems
 Datallegro
 Vertica
 MonetDB
Microsoft
Recently commercialized as “Vectorwise”
3/12/09 Bill Howe, UW 40VisTrails + GridFields
Taxonomy of Parallel Architectures
Easiest to program, but
$$$$
Scales to 1000s of nodes
3/12/09 Bill Howe, UW 41VisTrails + GridFieldsscreenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah
VisTrails
3/12/09 Bill Howe, UW 42VisTrails + GridFieldsscreenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah
Version Tree
3/12/09 Bill Howe, UW 43VisTrails + GridFields
Collaboration
Bill Howe @ UW
computes salt flux
using GridFields
Erik Anderson @ Utah
adds vector
streamlines and
adjusts opacity
Bill Howe @ UW adds
an isosurface of
salinity
Peter Lawson adds
discussion of the
scientific
interpretation
Howe et al., eScience 2008
3/12/09 Bill Howe, UW 44VisTrails + GridFields
Preliminary results
 Managing Hadoop jobs with VisTrails
 GridField queries in Hadoop
 Core Visualization algorithms in Hadoop
3/12/09 Bill Howe, UW 45VisTrails + GridFields
Preliminary results
 Managing Hadoop jobs with VisTrails
 GridField queries in Hadoop
 Core Visualization algorithms in Hadoop
3/12/09 Bill Howe, UW 46VisTrails + GridFields
Hadoop in VisTrails
 Wrap Hadoop Streaming/HDFS Operations
 Plug “PreProcess” to actual Vis Pipeline
3/12/09 46
3/12/09 Bill Howe, UW 47VisTrails + GridFields
Hadoop in VisTrails
 Provenance and Monitoring
3/12/09 47
3/12/09 Bill Howe, UW 48VisTrails + GridFields
Preliminary results
 Managing Hadoop jobs with VisTrails
 GridField queries in Hadoop
 Core Visualization algorithms in Hadoop
3/12/09 Bill Howe, UW 49VisTrails + GridFields
All Science is reducing to a database problem
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, independent of hypotheses)
 Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
 Medicine: ubiquitous digital records, MRI, ultrasound
 Oceanography: high-resolution models, cheap sensors, satellites
 Biology: lab automation, high-throughput sequencing
“Increase Data Collection Exponentially in Less Time, with FlowCAM”
Empirical X  Analytical X  Computational X  X-informatics
3/12/09 Bill Howe, UW 50VisTrails + GridFields
Key Idea: Declarative Languages
SELECT *
FROM Order o, Item i
WHERE o.item = i.item
AND o.date = today()
join
select
scan scan
date = today()
o.item = i.item
Order oItem i
Find all orders from today, along with the items ordered
3/12/09 Bill Howe, UW 51VisTrails + GridFields
Example System: Teradata
AMP = unit of parallelism
3/12/09 Bill Howe, UW 52VisTrails + GridFields
Example System: Teradata
AMP 1 AMP 2 AMP 3
select
date=today()
select
date=today()
select
date=today()
scan
Order o
scan
Order o
scan
Order o
hash
h(item)
hash
h(item)
hash
h(item)
AMP 4 AMP 5 AMP 6
3/12/09 Bill Howe, UW 53VisTrails + GridFields
Example System: Teradata
AMP 1 AMP 2 AMP 3
scan
Item i
AMP 4 AMP 5 AMP 6
hash
h(item)
scan
Item i
hash
h(item)
scan
Item i
hash
h(item)
3/12/09 Bill Howe, UW 54VisTrails + GridFields
Example System: Teradata
AMP 4 AMP 5 AMP 6
join join join
o.item = i.item o.item = i.item o.item = i.item
contains all orders and all lines
where hash(item) = 1
contains all orders and all lines
where hash(item) = 2
contains all orders and all lines
where hash(item) = 3
3/12/09 Bill Howe, UW 55VisTrails + GridFields
Workflow Execution Plans
Need execution plans spanning client/server/cloud
3/12/09 Bill Howe, UW 56VisTrails + GridFields
Example: Isosurface Browsing
QuickTime™ and a
decompressor
are needed to see this picture.
3/12/09 Bill Howe, UW 57VisTrails + GridFields
Example: Isosurface Browsing
 Plan A
Subset Subset Subset Subset
tstep 0 tstep 1 tstep 2 tstep 3
3/12/09 Bill Howe, UW 58VisTrails + GridFields
Example: Isosurface Browsing
 Plan B: Build an index
Build Index, e.g., an Interval Tree (Cignoni 97)
Subset Subset Subset
tstep 0 tstep 1 tstep 2 tstep 3
Subset
Render
Isosurface Isosurface Isosurface Isosurface
Render Render Render
3/12/09 Bill Howe, UW 59VisTrails + GridFields
Example: Isosurface Browsing
 Plan C: Build a spatial index to support panning
 Plan D: Build a multi-resolution index to support zoom
 …and so on
 Why not precompute all appropriate indexes?
 Some will (partially) reside on client
 Storage is not as cheap as we pretend
 Need a flexible system where
 a “query result” can be explored interactively, and
 we prepare for similar queries
 similarity defined by natural “browsing patterns” in visualization
systems
3/12/09 Bill Howe, UW 60VisTrails + GridFields
3/12/09 Bill Howe, UW 61VisTrails + GridFields
Why MapReduce/Hadoop?
 Popular

AWS Elastic MapReduce

100s of startups

# of downloads

# of blog posts
 Free as in Speech
 Free as in Beer
 Flexible, Lightweight
 Scalable
 Fault-tolerant
3/12/09 Bill Howe, UW 62VisTrails + GridFields
Reducing Latency
 Online processing/progressive refinement
 Deliver approximate/partial results
 Standing Queries/Prepared plans
 Exploit indexes
Changes to Hadoop and/or other
tools required (e.g., Hbase)
3/12/09 Bill Howe, UW 63VisTrails + GridFields
Masking Latency
 Caching/materialized views
 Reuse old results
 Pre-fetching
 Stage and prepare new results
 Speculative processing
 Anticipate future results
No change to Hadoop required
3/12/09 Bill Howe, UW 64VisTrails + GridFields
source: Antonio Baptista, NSF CMOP STC
3/12/09 Bill Howe, UW 65VisTrails + GridFields
Why Visualization? (2)
north
channel
south
channel
3/12/09 Bill Howe, UW 66VisTrails + GridFields
MapReduce?
 Hadoop simplifies parallel data processing
 ++ scalability
 ++ fault tolerance
 ++ less programming
 -- latency is an issue
3/12/09 Bill Howe, UW 67VisTrails + GridFields
1 2 3 4 5 6 7
31
23
psu
8 9 10 11 12 13 14 15
16 17 18
(b)
19 20 21 22
24 25 26 27 28 29 30
Climatology Queries
3/12/09 Bill Howe, UW 68VisTrails + GridFields
3/12/09 Bill Howe, UW 69VisTrails + GridFields
As a GridField Expression
⊗
H0 : (x,y,b) V0 : (σ )
apply(0, z=(surf − b) * σ )
bind(0, surf)
C
H = Scan(contxt, "H")
rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H)
T = Scan(contxt, “T”)
V = Scan(contxt, “V”)
HxV = Cross(H, V)
HxVxT = Cross(HxV, T)
salt = Bind(contxt, HxVxT, “salt”)
onemonth = Regrid(salt, HxV, equijoin(“hpos,vpos”), avg())
3/12/09 Bill Howe, UW 70VisTrails + GridFields
As a SQL Query
Select hpos, vpos, avg(salt)
from ocean
group by hpos, vpos
3/12/09 Bill Howe, UW 71VisTrails + GridFields
Scientific Workflow Systems
 Value proposition: More time on science, less time on code
 How: By providing language features emphasizing sharing,
reuse, reproducibility, rapid prototyping, efficiency
 Provenance
 Visual programming
 Caching
 Integration with domain-specific tools
 Scheduling
3/12/09 Bill Howe, UW 72VisTrails + GridFields
Related Vis Work
 Parallel visualization systems
 ParaView, VisIt
 Query-Driven Visualization
 [Bethel et al 2006,2008,2009]
 FastBit Index
 [Shoshani et al 2007]
 DB Vis systems
 Tableau
3/12/09 Bill Howe, UW 73VisTrails + GridFields
Feeding the Pipeline
source: Ken Moreland
missing step?
3/12/09 Bill Howe, UW 74VisTrails + GridFields
Cannot Ignore “Preprocessing”
Hadoop
3/12/09 Bill Howe, UW 75VisTrails + GridFields
Role 2: Move Computation to the Data
“Transferring the whole data generated … to a storage device or a
visualization machine could become a serious bottleneck, because I/O
would take most of the … time. A more feasible approach is to reduce
and prepare the data in situ for subsequent visualization and data
analysis tasks.”
-- SciDAC Review
3/12/09 Bill Howe, UW 76VisTrails + GridFields
Remote Visualization
 Reduce and render remotely, transfer images
 ++ transfers less data
 -- specialized hardware, high load
 Reduce remotely, transfer data/geometry, render locally
 ++ uses local graphics pipeline
 -- transfers more data
3/12/09 Bill Howe, UW 77VisTrails + GridFields
3/12/09 Bill Howe, UW 78VisTrails + GridFields
Scientific Vis System Roundup
 General
 ParaView [KitWare, Los Alamos, Sandia]
 VisIt [LLNL]
 Specialized
 SALSA, particles, Quinn, UW
 VISUS, streaming/progressive, Jones, LLNL
 SAGE,
 Hyperwall, tiled display, NASA

Contenu connexe

En vedette

The 129 Greatest Sales Strategists, Entrepreneurs, and Podcasters to Follow T...
The 129 Greatest Sales Strategists, Entrepreneurs, and Podcasters to Follow T...The 129 Greatest Sales Strategists, Entrepreneurs, and Podcasters to Follow T...
The 129 Greatest Sales Strategists, Entrepreneurs, and Podcasters to Follow T...Carolyn Rose Kick
 
iWelcome case study: ThiemeMeulenhoff - Digital transformation for publisher...
iWelcome case study: ThiemeMeulenhoff -  Digital transformation for publisher...iWelcome case study: ThiemeMeulenhoff -  Digital transformation for publisher...
iWelcome case study: ThiemeMeulenhoff - Digital transformation for publisher...Maarten Stultjens
 
ESS project – technical and conceptual challenges
ESS project – technical and conceptual challengesESS project – technical and conceptual challenges
ESS project – technical and conceptual challengesGlobal Risk Forum GRFDavos
 
Health inequalities related to the gender division of working-time in Europe
Health inequalities related to the gender division of working-time in EuropeHealth inequalities related to the gender division of working-time in Europe
Health inequalities related to the gender division of working-time in Europesophieproject
 
Cognitive & Language Development
Cognitive & Language DevelopmentCognitive & Language Development
Cognitive & Language DevelopmentANiS ADiBaH
 
ThreadFix and SD Elements Unifying Security Requirements and Vulnerability Ma...
ThreadFix and SD Elements Unifying Security Requirements and Vulnerability Ma...ThreadFix and SD Elements Unifying Security Requirements and Vulnerability Ma...
ThreadFix and SD Elements Unifying Security Requirements and Vulnerability Ma...Denim Group
 
HTML5 Top 10 Threats - Silent Attacks and Stealth Exploits
HTML5 Top 10 Threats - Silent Attacks and Stealth ExploitsHTML5 Top 10 Threats - Silent Attacks and Stealth Exploits
HTML5 Top 10 Threats - Silent Attacks and Stealth ExploitsShreeraj Shah
 
Application Security Guide for Beginners
Application Security Guide for Beginners Application Security Guide for Beginners
Application Security Guide for Beginners Checkmarx
 
Running a High-Efficiency, High-Visibility Application Security Program with...
Running a High-Efficiency, High-Visibility Application Security Program with...Running a High-Efficiency, High-Visibility Application Security Program with...
Running a High-Efficiency, High-Visibility Application Security Program with...Denim Group
 
European fund market mid year review 2015
European fund market mid year review 2015European fund market mid year review 2015
European fund market mid year review 2015Jerome Couteur
 

En vedette (14)

The 129 Greatest Sales Strategists, Entrepreneurs, and Podcasters to Follow T...
The 129 Greatest Sales Strategists, Entrepreneurs, and Podcasters to Follow T...The 129 Greatest Sales Strategists, Entrepreneurs, and Podcasters to Follow T...
The 129 Greatest Sales Strategists, Entrepreneurs, and Podcasters to Follow T...
 
Visibility
VisibilityVisibility
Visibility
 
iWelcome case study: ThiemeMeulenhoff - Digital transformation for publisher...
iWelcome case study: ThiemeMeulenhoff -  Digital transformation for publisher...iWelcome case study: ThiemeMeulenhoff -  Digital transformation for publisher...
iWelcome case study: ThiemeMeulenhoff - Digital transformation for publisher...
 
中文3
中文3中文3
中文3
 
NewResume
NewResumeNewResume
NewResume
 
ESS project – technical and conceptual challenges
ESS project – technical and conceptual challengesESS project – technical and conceptual challenges
ESS project – technical and conceptual challenges
 
Health inequalities related to the gender division of working-time in Europe
Health inequalities related to the gender division of working-time in EuropeHealth inequalities related to the gender division of working-time in Europe
Health inequalities related to the gender division of working-time in Europe
 
Cognitive & Language Development
Cognitive & Language DevelopmentCognitive & Language Development
Cognitive & Language Development
 
ThreadFix and SD Elements Unifying Security Requirements and Vulnerability Ma...
ThreadFix and SD Elements Unifying Security Requirements and Vulnerability Ma...ThreadFix and SD Elements Unifying Security Requirements and Vulnerability Ma...
ThreadFix and SD Elements Unifying Security Requirements and Vulnerability Ma...
 
HTML5 Top 10 Threats - Silent Attacks and Stealth Exploits
HTML5 Top 10 Threats - Silent Attacks and Stealth ExploitsHTML5 Top 10 Threats - Silent Attacks and Stealth Exploits
HTML5 Top 10 Threats - Silent Attacks and Stealth Exploits
 
Application Security Guide for Beginners
Application Security Guide for Beginners Application Security Guide for Beginners
Application Security Guide for Beginners
 
Terrorism in fata,Pakistan
Terrorism in fata,PakistanTerrorism in fata,Pakistan
Terrorism in fata,Pakistan
 
Running a High-Efficiency, High-Visibility Application Security Program with...
Running a High-Efficiency, High-Visibility Application Security Program with...Running a High-Efficiency, High-Visibility Application Security Program with...
Running a High-Efficiency, High-Visibility Application Security Program with...
 
European fund market mid year review 2015
European fund market mid year review 2015European fund market mid year review 2015
European fund market mid year review 2015
 

Similaire à Query-Driven Visualization in the Cloud with MapReduce

Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchUniversity of Washington
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisUniversity of Washington
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebStefan Dietze
 
Towards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledgeTowards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledgeStefan Dietze
 
HILDA 2023 Keynote Bill Howe
HILDA 2023 Keynote Bill HoweHILDA 2023 Keynote Bill Howe
HILDA 2023 Keynote Bill Howedomoritz
 
What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsStefan Dietze
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
remotesensing-12-01253.pdf
remotesensing-12-01253.pdfremotesensing-12-01253.pdf
remotesensing-12-01253.pdfNguyenVanTuan29
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overviewdgarijo
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
HGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseHGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseDan Han
 
KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013Stefan Dietze
 

Similaire à Query-Driven Visualization in the Cloud with MapReduce (20)

Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible Research
 
Data-Intensive Scalable Science
Data-Intensive Scalable ScienceData-Intensive Scalable Science
Data-Intensive Scalable Science
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and Analysis
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
End-to-End eScience
End-to-End eScienceEnd-to-End eScience
End-to-End eScience
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
 
Towards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledgeTowards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledge
 
HILDA 2023 Keynote Bill Howe
HILDA 2023 Keynote Bill HoweHILDA 2023 Keynote Bill Howe
HILDA 2023 Keynote Bill Howe
 
Introduction to D3.js
Introduction to D3.jsIntroduction to D3.js
Introduction to D3.js
 
What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked Datasets
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
remotesensing-12-01253.pdf
remotesensing-12-01253.pdfremotesensing-12-01253.pdf
remotesensing-12-01253.pdf
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overview
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
HGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseHGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBase
 
ECCS 2010
ECCS 2010ECCS 2010
ECCS 2010
 
unit 1 DATA MINING.ppt
unit 1 DATA MINING.pptunit 1 DATA MINING.ppt
unit 1 DATA MINING.ppt
 
KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013
 

Plus de University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceUniversity of Washington
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureUniversity of Washington
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DUniversity of Washington
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe University of Washington
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareUniversity of Washington
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersUniversity of Washington
 

Plus de University of Washington (20)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&D
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
Data Science and Urban Science @ UW
Data Science and Urban Science @ UWData Science and Urban Science @ UW
Data Science and Urban Science @ UW
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
 

Query-Driven Visualization in the Cloud with MapReduce

  • 1. Query-Driven Visualization in the Cloud with MapReduce Bill Howe, UW Huy Vo, Utah Claudio Silva, Utah Juliana Freire, Utah YingYi Bu, UW QuickTime™ and a decompressor are needed to see this picture.
  • 2. 3/12/09 Bill Howe, UW 2VisTrails + GridFields All Science is reducing to a database problem Old model: “Query the world” (Data acquisition coupled to a specific hypothesis) New model: “Download the world” (Data acquired en masse, independent of hypotheses)  Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)  Medicine: ubiquitous digital records, MRI, ultrasound  Oceanography: high-resolution models, cheap sensors, satellites  Biology: lab automation, high-throughput sequencing “Increase Data Collection Exponentially in Less Time, with FlowCAM” Empirical X  Analytical X  Computational X  X-informatics
  • 3. 3/12/09 Bill Howe, UW 3VisTrails + GridFields Why “Query-driven”?  Vis perspective:  query = subsetting  DB perspective:  query = manipulation, preparation, restructuring, index-building, aggregation, regridding, downsampling, simplification, reformatting, etc. Database Maxims: 1. Push the computation to the data. 2. Declarative programming is a good thing.
  • 4. 3/12/09 Bill Howe, UW 4VisTrails + GridFields Why Visualization?  Super-charged aggregation  High bandwidth of the human visual cortex  Query-writing presupposes a known goal “What does the salt wedge look like?”
  • 5. 3/12/09 Bill Howe, UW 5VisTrails + GridFields Why Cloud?  “Cloud”?  Software as a Service (SaaS)  Infrastructure as a Service (IaaS)  Platform as a Service (PaaS)  Working definition: General, elastic, data-intensive, scalable computing This work: Vis techniques + DB techniques in the Cloud
  • 6. 3/12/09 Bill Howe, UW 6VisTrails + GridFields Visualization + Data Management “Transferring the whole data generated … to a storage device or a visualization machine could become a serious bottleneck, because I/O would take most of the … time. A more feasible approach is to reduce and prepare the data in situ for subsequent visualization and data analysis tasks.” -- SciDAC Review We can no longer afford two separate systems
  • 7. 3/12/09 Bill Howe, UW 7VisTrails + GridFields Converging Requirements Core vis techniques (isosurfaces, volume rendering, …) Emphasis on interactive performance Mesh data as a first-class citizen Vis DB
  • 8. 3/12/09 Bill Howe, UW 8VisTrails + GridFields Converging Requirements Declarative languages Automatic data-parallelism Algebraic optimization Vis DB
  • 9. 3/12/09 Bill Howe, UW 9VisTrails + GridFields Converging Requirements Vis: “Query-driven Visualization” Vis: “In Situ Visualization” Vis: “Remote Visualization” DB: “Push the computation to the data” Vis DB
  • 10. 3/12/09 Bill Howe, UW 10VisTrails + GridFields Thesis  We can no longer afford to build separate visualization and data management systems  Data is increasingly destined for the cloud  At least two approaches: 1. Build “cloud” Vis platform with DM capabilities 2. Extend “cloud” DM platforms with Vis capabilities  We are assessing option 2.
  • 11. 3/12/09 Bill Howe, UW 11VisTrails + GridFields This Talk  Brief Technology Review  Relational Databases  MapReduce: Data-Intensive Scalable Programming  GridFields: Mesh Algebra  VisTrails: Workflow and Provenance  Preliminary Results with Hadoop/MapReduce  Climatology queries on a shared cloud  Core vis algorithms on a private cluster
  • 12. 3/12/09 Bill Howe, UW 12VisTrails + GridFields Pre-Relational: if your data changed, your application broke. Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code. “Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation Relational Database History -- Codd 1979
  • 13. 3/12/09 Bill Howe, UW 13VisTrails + GridFields Key Idea: Algebraic Optimization N = ((z*2)+((z*3)+0))/1 Algebraic Laws: 1. (+) identity: x+0 = x 2. (/) identity: x/1 = x 3. (*) distributes: (n*x+n*y) = n*(x+y) 4. (*) commutes: x*y = y*x Apply rules 1, 3, 4, 2: N = (2+3)*z two operations instead of five, no division operator
  • 14. 3/12/09 Bill Howe, UW 14VisTrails + GridFields Key Idea: An Algebra of Tables select project join join Other operators: aggregate, union, difference, cross product
  • 15. 3/12/09 Bill Howe, UW 15VisTrails + GridFields GridFields: An Algebra for Unstructured Grids unstructured grids model complex domains at multiple scales simultaneously red = high salinity (~34psu) blue = fresh water (~0 psu) Columbia River Estuary ….but complicate processing
  • 16. 3/12/09 Bill Howe, UW 16VisTrails + GridFields GridFields: Data Model x y salt temp 13.8 10.6 29.4 12.1 13.9 9.4 29.8 12.5 14.3 9.0 28.0 12.0 13.4 9.0 30.1 13.2 flux area 11.5 3.3 13.9 5.5 13.1 4.5
  • 17. 3/12/09 Bill Howe, UW 17VisTrails + GridFields GridFields: Operators  Lifted set operations  Union, Intersection, Cross Product  Scan/Bind  Read a grid/attribute from disk  Restrict  Remove cells that do not satisfy a predicate  Accrete  “Grow” a grid by including neighbors of cells  Regrid  Map the data of one grid onto another
  • 18. 3/12/09 Bill Howe, UW 18VisTrails + GridFields GridFields: Query Algebra ⊗ H0 : (x,y,b) V0 : (z) A restrict(0, z >b) B color is depth Algebraic Manipulation of Scientific Datasets, B. Howe, D. Maier, VLDBJ 2005 ⊗ H0 : (x,y,b) V0 : (σ ) apply(0, z=(surf − b) * σ ) bind(0, surf) C color is salinity
  • 19. 3/12/09 Bill Howe, UW 19VisTrails + GridFields GridFields: Optimization
  • 20. 3/12/09 Bill Howe, UW 20VisTrails + GridFields GridFields: Optimization 0 5 10 15 20 25 30 35 40 45 vtk(3D) interpolate simple interp_o simple_o secs But: only an 800 MB dataset
  • 21. 3/12/09 Bill Howe, UW 21VisTrails + GridFields GridFields + VisTrails
  • 22. 3/12/09 Bill Howe, UW 22VisTrails + GridFields UW + Utah CluE Program  Goals  10+-year “climatologies” at interactive speeds  …with provenance, reproducibility, collaboration …on a shared-nothing, commodity platform  In general: Explore the intersection of scientific databases and scientific visualization, at scale  Methods  “Cloud-Enable” two projects  GridFields: Query algebra for mesh data  VisTrails: Scientific workflow and provenance
  • 23. 3/12/09 Bill Howe, UW 23VisTrails + GridFields Why MapReduce?  Need to scale to hundreds or thousands of CPUs  Parallel databases expensive, proprietary, difficult  Not shown to scale to thousands of computers  MapReduce is a lightweight framework providing automatic  Data parallelism  Fault-tolerance  I/O scheduling
  • 24. 3/12/09 Bill Howe, UW 24VisTrails + GridFields Why Hadoop? paraview hadoop
  • 25. 3/12/09 Bill Howe, UW 25VisTrails + GridFields Some distributed algorithm… Map (Shuffle) Reduce
  • 26. 3/12/09 Bill Howe, UW 26VisTrails + GridFields MapReduce Programming Model  Input & Output: each a set of key/value pairs  Programmer specifies two functions:  Processes input key/value pair  Produces set of intermediate pairs  Combines all intermediate values for a particular key  Produces a set of merged output values (usually just one) map (in_key, in_value) -> list(out_key, intermediate_value) reduce (out_key, list(intermediate_value)) -> list(out_value) slide source: Google, Inc.
  • 27. 3/12/09 Bill Howe, UW 27VisTrails + GridFields This Talk  Brief Technology Review  Relational Databases  MapReduce: Data-Intensive Scalable Programming  GridFields: Mesh Algebra  VisTrails: Workflow and Provenance  Preliminary Results with Hadoop/MapReduce  Climatology queries on a shared cloud  Core vis algorithms on a private cluster
  • 28. 3/12/09 Bill Howe, UW 28VisTrails + GridFields Application Domain: Oceanography <Vis movie>QuickTime™ and a decompressor are needed to see this picture. Key idea: Zooplankton correlated with temperature
  • 29. 3/12/09 Bill Howe, UW 29VisTrails + GridFields Example Query: Climatology Feb May Average Surface Salinity by Month Columbia River Plume 1999-2006 Columbia River psu Washington Oregon animation
  • 30. 3/12/09 Bill Howe, UW 30VisTrails + GridFields CluE Query Results CluE: 400 node shared Hadoop platform provided by Google, IBM, NSF 4-year-old commodity hardware, suspect IO performance
  • 31. 3/12/09 Bill Howe, UW 31VisTrails + GridFields Preliminary results  Managing Hadoop jobs with VisTrails  GridField queries in Hadoop  Core Visualization algorithms in Hadoop
  • 32. 3/12/09 Bill Howe, UW 32VisTrails + GridFields Core Vis Algorithms in MapReduce  Scalar/Volume Rendering  Map: Rasterization  Reduce: Compositing, blending  Isosurface Extraction  Map: Isosurface Extraction  Reduce: Combine like isovalues  Mesh Simplification  Map: Bin vertices  Reduce: Collapse binned triangles
  • 33. 3/12/09 Bill Howe, UW 33VisTrails + GridFields ATLAS dataset
  • 34. 3/12/09 Bill Howe, UW 34VisTrails + GridFields Rendering (Preliminary) # of mappers 57-node Nehalem
  • 35. 3/12/09 Bill Howe, UW 35VisTrails + GridFields Isosurface Extraction (Preliminary) 32 48 64 96 128
  • 36. 3/12/09 Bill Howe, UW 36VisTrails + GridFields Conclusions  Converging requirements in DB and Vis communities  Motivation exists for a “VisDB”  declarative query + high-performance vis, at full scale  We are evaluating Hadoop as a “substrate” for a VisDB  Scalability and reduced development time are promising  Interactive performance requires some changes
  • 37. 3/12/09 Bill Howe, UW 37VisTrails + GridFields Acknowledgments http://escience.washington.edu
  • 38. 3/12/09 Bill Howe, UW 38VisTrails + GridFields BACKUP SLIDES
  • 39. 3/12/09 Bill Howe, UW 39VisTrails + GridFields Shared Nothing Parallel Databases  Teradata  Greenplum  Netezza  Aster Data Systems  Datallegro  Vertica  MonetDB Microsoft Recently commercialized as “Vectorwise”
  • 40. 3/12/09 Bill Howe, UW 40VisTrails + GridFields Taxonomy of Parallel Architectures Easiest to program, but $$$$ Scales to 1000s of nodes
  • 41. 3/12/09 Bill Howe, UW 41VisTrails + GridFieldsscreenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah VisTrails
  • 42. 3/12/09 Bill Howe, UW 42VisTrails + GridFieldsscreenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah Version Tree
  • 43. 3/12/09 Bill Howe, UW 43VisTrails + GridFields Collaboration Bill Howe @ UW computes salt flux using GridFields Erik Anderson @ Utah adds vector streamlines and adjusts opacity Bill Howe @ UW adds an isosurface of salinity Peter Lawson adds discussion of the scientific interpretation Howe et al., eScience 2008
  • 44. 3/12/09 Bill Howe, UW 44VisTrails + GridFields Preliminary results  Managing Hadoop jobs with VisTrails  GridField queries in Hadoop  Core Visualization algorithms in Hadoop
  • 45. 3/12/09 Bill Howe, UW 45VisTrails + GridFields Preliminary results  Managing Hadoop jobs with VisTrails  GridField queries in Hadoop  Core Visualization algorithms in Hadoop
  • 46. 3/12/09 Bill Howe, UW 46VisTrails + GridFields Hadoop in VisTrails  Wrap Hadoop Streaming/HDFS Operations  Plug “PreProcess” to actual Vis Pipeline 3/12/09 46
  • 47. 3/12/09 Bill Howe, UW 47VisTrails + GridFields Hadoop in VisTrails  Provenance and Monitoring 3/12/09 47
  • 48. 3/12/09 Bill Howe, UW 48VisTrails + GridFields Preliminary results  Managing Hadoop jobs with VisTrails  GridField queries in Hadoop  Core Visualization algorithms in Hadoop
  • 49. 3/12/09 Bill Howe, UW 49VisTrails + GridFields All Science is reducing to a database problem Old model: “Query the world” (Data acquisition coupled to a specific hypothesis) New model: “Download the world” (Data acquired en masse, independent of hypotheses)  Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)  Medicine: ubiquitous digital records, MRI, ultrasound  Oceanography: high-resolution models, cheap sensors, satellites  Biology: lab automation, high-throughput sequencing “Increase Data Collection Exponentially in Less Time, with FlowCAM” Empirical X  Analytical X  Computational X  X-informatics
  • 50. 3/12/09 Bill Howe, UW 50VisTrails + GridFields Key Idea: Declarative Languages SELECT * FROM Order o, Item i WHERE o.item = i.item AND o.date = today() join select scan scan date = today() o.item = i.item Order oItem i Find all orders from today, along with the items ordered
  • 51. 3/12/09 Bill Howe, UW 51VisTrails + GridFields Example System: Teradata AMP = unit of parallelism
  • 52. 3/12/09 Bill Howe, UW 52VisTrails + GridFields Example System: Teradata AMP 1 AMP 2 AMP 3 select date=today() select date=today() select date=today() scan Order o scan Order o scan Order o hash h(item) hash h(item) hash h(item) AMP 4 AMP 5 AMP 6
  • 53. 3/12/09 Bill Howe, UW 53VisTrails + GridFields Example System: Teradata AMP 1 AMP 2 AMP 3 scan Item i AMP 4 AMP 5 AMP 6 hash h(item) scan Item i hash h(item) scan Item i hash h(item)
  • 54. 3/12/09 Bill Howe, UW 54VisTrails + GridFields Example System: Teradata AMP 4 AMP 5 AMP 6 join join join o.item = i.item o.item = i.item o.item = i.item contains all orders and all lines where hash(item) = 1 contains all orders and all lines where hash(item) = 2 contains all orders and all lines where hash(item) = 3
  • 55. 3/12/09 Bill Howe, UW 55VisTrails + GridFields Workflow Execution Plans Need execution plans spanning client/server/cloud
  • 56. 3/12/09 Bill Howe, UW 56VisTrails + GridFields Example: Isosurface Browsing QuickTime™ and a decompressor are needed to see this picture.
  • 57. 3/12/09 Bill Howe, UW 57VisTrails + GridFields Example: Isosurface Browsing  Plan A Subset Subset Subset Subset tstep 0 tstep 1 tstep 2 tstep 3
  • 58. 3/12/09 Bill Howe, UW 58VisTrails + GridFields Example: Isosurface Browsing  Plan B: Build an index Build Index, e.g., an Interval Tree (Cignoni 97) Subset Subset Subset tstep 0 tstep 1 tstep 2 tstep 3 Subset Render Isosurface Isosurface Isosurface Isosurface Render Render Render
  • 59. 3/12/09 Bill Howe, UW 59VisTrails + GridFields Example: Isosurface Browsing  Plan C: Build a spatial index to support panning  Plan D: Build a multi-resolution index to support zoom  …and so on  Why not precompute all appropriate indexes?  Some will (partially) reside on client  Storage is not as cheap as we pretend  Need a flexible system where  a “query result” can be explored interactively, and  we prepare for similar queries  similarity defined by natural “browsing patterns” in visualization systems
  • 60. 3/12/09 Bill Howe, UW 60VisTrails + GridFields
  • 61. 3/12/09 Bill Howe, UW 61VisTrails + GridFields Why MapReduce/Hadoop?  Popular  AWS Elastic MapReduce  100s of startups  # of downloads  # of blog posts  Free as in Speech  Free as in Beer  Flexible, Lightweight  Scalable  Fault-tolerant
  • 62. 3/12/09 Bill Howe, UW 62VisTrails + GridFields Reducing Latency  Online processing/progressive refinement  Deliver approximate/partial results  Standing Queries/Prepared plans  Exploit indexes Changes to Hadoop and/or other tools required (e.g., Hbase)
  • 63. 3/12/09 Bill Howe, UW 63VisTrails + GridFields Masking Latency  Caching/materialized views  Reuse old results  Pre-fetching  Stage and prepare new results  Speculative processing  Anticipate future results No change to Hadoop required
  • 64. 3/12/09 Bill Howe, UW 64VisTrails + GridFields source: Antonio Baptista, NSF CMOP STC
  • 65. 3/12/09 Bill Howe, UW 65VisTrails + GridFields Why Visualization? (2) north channel south channel
  • 66. 3/12/09 Bill Howe, UW 66VisTrails + GridFields MapReduce?  Hadoop simplifies parallel data processing  ++ scalability  ++ fault tolerance  ++ less programming  -- latency is an issue
  • 67. 3/12/09 Bill Howe, UW 67VisTrails + GridFields 1 2 3 4 5 6 7 31 23 psu 8 9 10 11 12 13 14 15 16 17 18 (b) 19 20 21 22 24 25 26 27 28 29 30 Climatology Queries
  • 68. 3/12/09 Bill Howe, UW 68VisTrails + GridFields
  • 69. 3/12/09 Bill Howe, UW 69VisTrails + GridFields As a GridField Expression ⊗ H0 : (x,y,b) V0 : (σ ) apply(0, z=(surf − b) * σ ) bind(0, surf) C H = Scan(contxt, "H") rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H) T = Scan(contxt, “T”) V = Scan(contxt, “V”) HxV = Cross(H, V) HxVxT = Cross(HxV, T) salt = Bind(contxt, HxVxT, “salt”) onemonth = Regrid(salt, HxV, equijoin(“hpos,vpos”), avg())
  • 70. 3/12/09 Bill Howe, UW 70VisTrails + GridFields As a SQL Query Select hpos, vpos, avg(salt) from ocean group by hpos, vpos
  • 71. 3/12/09 Bill Howe, UW 71VisTrails + GridFields Scientific Workflow Systems  Value proposition: More time on science, less time on code  How: By providing language features emphasizing sharing, reuse, reproducibility, rapid prototyping, efficiency  Provenance  Visual programming  Caching  Integration with domain-specific tools  Scheduling
  • 72. 3/12/09 Bill Howe, UW 72VisTrails + GridFields Related Vis Work  Parallel visualization systems  ParaView, VisIt  Query-Driven Visualization  [Bethel et al 2006,2008,2009]  FastBit Index  [Shoshani et al 2007]  DB Vis systems  Tableau
  • 73. 3/12/09 Bill Howe, UW 73VisTrails + GridFields Feeding the Pipeline source: Ken Moreland missing step?
  • 74. 3/12/09 Bill Howe, UW 74VisTrails + GridFields Cannot Ignore “Preprocessing” Hadoop
  • 75. 3/12/09 Bill Howe, UW 75VisTrails + GridFields Role 2: Move Computation to the Data “Transferring the whole data generated … to a storage device or a visualization machine could become a serious bottleneck, because I/O would take most of the … time. A more feasible approach is to reduce and prepare the data in situ for subsequent visualization and data analysis tasks.” -- SciDAC Review
  • 76. 3/12/09 Bill Howe, UW 76VisTrails + GridFields Remote Visualization  Reduce and render remotely, transfer images  ++ transfers less data  -- specialized hardware, high load  Reduce remotely, transfer data/geometry, render locally  ++ uses local graphics pipeline  -- transfers more data
  • 77. 3/12/09 Bill Howe, UW 77VisTrails + GridFields
  • 78. 3/12/09 Bill Howe, UW 78VisTrails + GridFields Scientific Vis System Roundup  General  ParaView [KitWare, Los Alamos, Sandia]  VisIt [LLNL]  Specialized  SALSA, particles, Quinn, UW  VISUS, streaming/progressive, Jones, LLNL  SAGE,  Hyperwall, tiled display, NASA

Notes de l'éditeur

  1. Drowning in data; starving for information We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem.
  2. Visualization is a more efficient way to query data -- you can browse and explore. But you need to be able to switch back and forth between interactive browsing and symbolic querying
  3. Need to consider private clouds Not just renting hardware: general-purpose data processing
  4. Analytics and Visualization are mutually dependent Scalability Fault-tolerance Exploit shared-nothing, commodity clusters In general: Move computation to the data Data is ending up in the cloud; we need to figure out how to use it.
  5. It provides a means of describing data with its natural structure only--that is, without superimposing any additional structure for machine representation purposes. Accordingly, it provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation on the other.
  6. It turns out that you can express a wide variety of computations using only a handful of operators.
  7. Bringing the computation to the data may soon mean implementing your algorithm in hadoop The storage layer may soon be equipped with Hadoop instead of just a parallel filesystem Need to learn how to push your computation down into the Hadoop layer. Growing popularity Cluster Exploratory program: Google, IBM, NSF Amazon Web Services: Elastic MapReduce Hive, Pig, Cascading, Mahout lots of startups (e.g., Cloudera)
  8. Bringing the computation to the data may soon mean implementing your algorithm in hadoop The storage layer may soon be equipped with Hadoop instead of just a parallel filesystem Need to learn how to push your computation down into the Hadoop layer. Growing popularity Cluster Exploratory program: Google, IBM, NSF Amazon Web Services: Elastic MapReduce Hive, Pig, Cascading, Mahout lots of startups (e.g., Cloudera)
  9. This movie was rendered offline, but it’s increasingly important to be able to create visualizations on the fly to allow interactive exploration of large datasets.
  10. Data-intensive science
  11. The goal here is to make Shared Nothing Architecturs easier to program.
  12. We only wrap the interface for Hadoop Streaming in VisTrails with the additional suppport of HDFS operations to upload/download data/libraries for the job. The Hadoop Streaming is plugged into a local VTK rendering pipeline that would grab data from the cloud and generate an animation on the VisTrails Spreadsheet. Users can specify their own Python Source as mapper/reducer. In this case, a VTK script is specified in the mapper. Also, VTK libraries are shipped along with the code to the computing node. This uses the underlying –cacheArchive of Hadoop streaming.
  13. By default, Hadoop logs are output to the standard output of VisTrails app. Jobs are killed by terminate the program and run an extra command returned by Hadoop. However, one can plug a HadoopTrackerCell to the end of the pipeline to have their log messages to be monitored on the VisTrails Spreadsheet. There are also button to kill the job or show Job Tracker, which would automatically connect through the CLuE’s specific proxy to see additional logs/error messages of jobs.
  14. Drowning in data; starving for information We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem.
  15. Need to assign workflows to resources for execution in a heterogeneous compute environment. Parts of this workflow can be compiled into Hadoop jobs, parts should be run locally so that they exploit hardware acceleration. But this is not just computation placement -- there are different execution plans, similar to relational execution plans. Gridfields expressions can be algebraically optimized, for example.
  16. Plan C: Build a spatial index to support panning Plan D: Build a multi-resolution index to support zoom …and so on Why not precompute all appropriate indexes? Some will (partially) reside on client Storage is not as cheap as we pretend Need a flexible system where a “query result” can be explored interactively, and we prepare for similar queries similarity defined by natural “browsing patterns” in visualization systems
  17. We can’t just precompute the indexes, since they may reside on
  18. Analytics and Visualization are mutually dependent Scalability Fault-tolerance Exploit shared-nothing, commodity clusters In general: Move computation to the data
  19. Upper left: Average
  20. Sweeping through the velocity fields quickly exposed the location of the “upstream” salt flux -- where salty water made its way back upstream.