SlideShare a Scribd company logo
1 of 63
Myria: Analytics-as-a-Service
for (Data) Scientists
Bill Howe
University of Washington

10/13/2013

Bill Howe, UW

1
“It’s a great time to be a data geek.”
-- Roger Barga, Microsoft Research

“The greatest minds of my generation are trying
to figure out how to make people click on ads”
-- Jeff Hammerbacher, co-founder, Cloudera

2
How can we deliver 1000 little SDSSs
to anyone who wants one?

10/13/2013

Bill Howe, UW

4
R/V Wecoma, April 2007
Armbrust Lab Retreat, 2009 (Biology, Oceanography)

10/13/2013

Bill Howe, UW

6
Astronomy Visualization
Workshop, 2011

10/13/2013

Bill Howe, UW

7
Big Data in the Long Tail Workshop, 2012 (Social Sciences)

10/13/2013

Bill Howe, UW

8
Maier’s 2nd Maxim

Working with scientists is like
working with 7 year olds:
They think they know everything
and they don’t have any money

10/13/2013

Bill Howe, UW

9
My Goal: Expose all the world’s science data
through declarative query interfaces

10/13/2013

Bill Howe, UW

10
Problem
How much time do you spend “handling
data” as opposed to “doing science”?

Mode answer: “90%”

10/13/2013

Bill Howe, UW

11
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
###query
chr_4[480001-580000].287
chr_4[560001-660000].1
chr_9[400001-500000].503
chr_9[320001-420000].548
chr_27[320001-404298].20
chr_26[320001-420000].378
chr_26[400001-441226].196
chr_24[160001-260000].65
chr_5[720001-820000].339
chr_9[160001-260000].243
chr_12[720001-820000].86
chr_12[800001-900000].109
chr_11[1-100000].70
chr_11[80001-180000].100

length
4500
3556
4211
2833
3991
3963
2949
3542
3141
3002
2895
1463
2886
1523

COG hit #1

e-value #1

identity #1

score #1

COG4547
COG5406
COG4547
COG5099
COG5099

2.00E-04
2.00E-04
5.00E-05
5.00E-05
2.00E-04

19
38
18
17
17

44.6
43.9
46.2
46.2
43.9

620
1001
620
777
777

Cobalamin biosynthesis protein C
Nucleosome binding factor SPN,
Cobalamin biosynthesis protein C
RNA-binding protein of the Puf fa
RNA-binding protein of the Puf fa

COG5099
COG5077
COG5032
COG5032

4.00E-09
1.00E-25
2.00E-09
1.00E-09

20
26
30
30

59.3
114
60.5
60.1

777
1089
2105
2105

RNA-binding protein of the Puf fa
Ubiquitin carboxyl-terminal hydr
Phosphatidylinositol kinase and p
Phosphatidylinositol kinase and p

Simple Example

hit length #1 description #1

COGAnnotation_coastal_sample.txt
id

query
1 FHJ7DRN01A0TND.1
2 FHJ7DRN01A1AD2.2
3 FHJ7DRN01A2HWZ.4
…
2853 FHJ7DRN02HXTBY.5
2854 FHJ7DRN02HZO4J.2
…
3566 FHJ7DRN02FUJW3.1
…

hit
COG0414
COG0092
COG3889

e_value
identity_ score query_start query_end hit_start hit_end hit_length
1.00E-08
28
51
1
74
180
257
285
3.00E-20
47 89.9
6
85
41
120
233
0.0006
26 35.8
9
94
758
845
872

COG5077
COG0444

7.00E-09
2.00E-31

37
67

52.3
127

3
1

77
73

313
135

388
207

1089
316

COG5032

1.00E-09

32

54.7

1

75

1965

2038

2105

SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit
10/13/2013

Bill Howe, UW

12
Maslow’s Needs Hierarchy
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
-- Maslow 43

10/13/2013

Bill Howe, UW

13
A “Needs Hierarchy” of Science Data Management
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
-- Maslow 43

analytics
query
curation
sharing
storage
10/13/2013

Bill Howe, UW

14
A “Needs Hierarchy” of Science Data Management
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
-- Maslow 43

analytics
query
semantic integration
sharing
storage
10/13/2013

Bill Howe, UW

15
Why should you care?

Science == Data Science

10/13/2013

Bill Howe, UW

16
Version 1

QUERY-AS-A-SERVICE
2010 - present

10/13/2013

Bill Howe, UW

17
3) Share the results
Make them public, tag
them, share with specific
colleagues – anyone with
access can query

2) Write SQL
Right in your browser,
writing queries on top of
queries on top of queries ...

1) Upload data “as is”
Cloud-hosted; no need to
install or design a database;
no pre-defined schema

SELECT hit, COUNT(*)
FROM tigrfam_surface
GROUP BY hit
ORDER BY cnt DESC
Find all TIGRFam ids (proteins) that are missing from at least
one of three samples (relations)
SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]
UNION
SELECT col0 FROM [est_hma_fasta_TGIRfam_refs]
UNION
SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]
EXCEPT
SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]
INTERSECT
SELECT col0 FROM [est_hma_fasta_TGIRfam_refs]
INTERSECT
SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]
10/13/2013

Bill Howe, UW

19
Non-programmers can write very complex queries
(rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results
SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp
, w.category as nc_category
, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
THEN x.end_bp - x.start_bp + 1
WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
We see thousands
THEN x.end_bp - w.start_bp + 1
WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
queries written by
THEN w.end_bp - x.start_bp + 1
non-programmers
END AS len_overlap
FROM [koesterj@washington.edu].[hotspots_deserts.tab] x
INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w
ON x.chr = w.chr
WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
ORDER BY x.strain, x.chr ASC, x.start_bp ASC

of
Howe, et al., CISE 2012
Join

Steven
Roberts

Link methylation
with gene description
Excel

Trim

SQL as a lab notebook:
http://bit.ly/16Xj2JP

Compute

misstep: join
w/ wrong fill
Reorder
columns

Reorder
columns

Join

Join

Count

Calculate
methylation ratio

Calculate
methylation ratio
and link with gene
description

Count

Calculate #
methylated CGs

Calculate #
all CGs
Join

Join

Calculate #
methylated
CGs

Calculate #
all CGs

Reorder
columns

GFF of
methylated
CG locations

GFF of all
genes

GFF of all
CG locations

Gene
descriptions

GFF of
methylated
CG locations

Popular service for
Bioinformatics Workflows

GFF of all
genes

GFF of all
CG locations

Gene
descriptions
Halperin, Howe, et al. SSDBM 2013
Andrew White,
UW Chemistry

“An undergraduate student and I are working with gigabytes of tabular data
derived from analysis of protein surfaces.
Previously, we were using huge directory trees and plain text files.
Now we can accomplish a 10 minute 100 line script in 1 line of SQL.”
-- Andrew D White
Decoding nonspecific interactions from nature. A. White, A. Nowinski, W.
Huang, A. Keefe, F. Sun, S. Jiang. (2012) Chemical Science. Accepted
10/13/2013

Bill Howe, UW

24
SSDBM 2011

Scientific data management reduces to sharing views
• Integrate data from multiple sources?
– joins and unions with views

• Standardize on units, apply naming conventions?
– rename columns, apply functions with views

• Attach metadata?
– add new tables with descriptive names, add new columns with views

• Data cleaning, quality control?
– hide bad values with views

• Maintain provenance?
– inspect view dependencies

• Propagate updates?
– view maintenance

• Protect sensitive data?
– expose subsets with views (assuming views carry permissions)
10/13/2013

Bill Howe, UW

25
Two Problems with SQLShare
• No help for really big datasets
• No iteration

10/13/2013

Bill Howe, UW

26
Myria is…
• A compiler framework for multiple
iterative RA-based languages
• A parallel, shared-nothing, iterative
execution engine
• A RESTful Query-as-a-Service platform

• prefix meaning “ten thousand” in Greek
10/13/2013

Bill Howe, UW

27
Myria Team
Dan Suciu
Magda Balazinska
Bill Howe

Dan Halperin (postdoc, technical lead)
Victor Almeida (postdoc)
Andrew Whitaker (research scientist)

Students
Paris Koutris
Emad Soroush
Jingjing Wang
ShengLiang Xu
Jennifer Ortiz
Jeremy Hyrkas
Shumo Chu
28
Myria
Architecture

Web UI
Language Parser

Google
App
Engine

Logical Optimizer for RA+While
Myria Compiler

MyriaL

C Compiler

Grappa

json query plan

MyriaDB

REST Server
Coordinator

Catalog

netty
protocols

Worker

Catalog

Worker

Catalog

…

Worker

Catalog

jdbc

jdbc

jdbc

RDBMS

RDBMS

RDBMS

HDFS

HDFS

HDFS
A(y) :- R(‘a’, y)
A(y) :- A(x), R(x,y)

10/13/2013

Bill Howe, UW

30
A = LOAD('points.txt', id:int, x:float, y:float)
E = LIMIT(A, 4);
F = SEQUENCE();
Centroids = [FROM E EMIT (id=F.next, x=E.x, y=E.y)];
Kmeans = [FROM A EMIT (id=id, x=x, y=y, cluster_id=0)]
DO
I = CROSS(Kmeans, Centroids);
J = [FROM I EMIT (Kmeans.id, Kmeans.x, Kmeans.y, Centroids.cluster_id,
$distance(Kmeans.x, Kmeans.y, Centroids.x, Centroids.y))];
K = [FROM J EMIT id, distance=$min(distance)];
L = JOIN(J, id, K, id)
M = [FROM L WHERE J.distance <= K.distance EMIT
(id=J.id, x=J.x, y=J.y, cluster_id=J.cluster_id)];
Kmeans' = [FROM M EMIT (id, x, y, $min(cluster_id))];
Delta = DIFF(Kmeans', Kmeans)
Kmeans = Kmeans'
Centroids = [FROM Kmeans' EMIT (cluster_id, x=avg(x), y=avg(y))];
WHILE DELTA != {}

10/13/2013

Bill Howe, UW

31
Why Iteration Matters
Datalog: reachability in a graph with 1.4B unique edges: almost 200 iterations

32
Why Iteration Matters
Datalog: reachability in a graph with 1.4B unique edges: almost 200 iterations

Vast majority
of reachable tuples
discovered by
iteration 25
33
Why Iteration Matters
Datalog: reachability in a graph with 1.4B unique edges: almost 200 iterations

Vast majority
of reachable tuples
discovered by
iteration 25

The datalog program
continues for almost
200 iterations, each
almost as expensive
as the early steps

34
Fewer Iterations: Endgame Problem [Afrati 10]
100,000,000

frontier tuples
previously discovered tuples removed

10,000,000

# of tuples discovered

1,000,000
100,000
10,000
1,000
100
10
1
0

10/13/2013

20

40

60

80
100
iteration #
Bill Howe, UW

120

140

160

180

35
Reachability from ‘a’ in datalog

Basic Semi-Naïve Evaluation

Join
10/13/2013

Bill Howe, UW

A(y) :- R(‘a’, y)
A(y) :- A(x), R(x,y)

Dupe-elim
36
MAYBE JUST USE HADOOP?

10/13/2013

Bill Howe, UW

37
VLDB 2010, VLDBJ 2011
Bu, Howe, Balazinska, Ernst
VLDB10, VLDBJ12, Datalog12

Difference

Join

ΔAi-1

map

reduce

map

R(0)

map

reduce

map

R(1)

map
(a)

Ai(0)
(b)

map

Ai(1)

reduce

map

reduce

(a) R is loop invariant, but gets loaded and shuffled on each iteration
(b) Ai grows slowly and monotonically, but is loaded and shuffled on each
iteration. HaLoop’s Reducer Input Cache addressed (a), but did not
support the append semantics needed for (b).
10/13/2013

Bill Howe, UW

38
VLDB 2010, VLDBJ 2011

Inter-loop caching
Iteration i = 0: Load a distributed cache
Iteration i > 0:

ΔAi-1

Difference

Join
map

R(0)

map
R(0)

R(1)

map
R(1)

reduce
reduce

map
map

Ai(0)

map
A(0)

Ai(1)

reduce

map
A(1)

reduce

Bu, Howe, Balazinska, Ernst VLDB10, VLDBJ12, Datalog12
39
Difference

Join

Caching Loop-Invariant Data

ΔAi-1

map

reduce

map

R(0)

map

reduce

map

R(1)

map

reduce

1200

map

Ai(1)

no cache

Ai(0)

map

reduce

cache

failure

me (s)

1000
800
600

First iteration is
slow, as the invariant
graph is shuffled and
cached

400

23X

200
0
0
10/13/2013

10

20
itera on #
Bill Howe, UW

30
40
Difference

Join

ΔAi-1

MapReduce semantics
require that all keys from
the cache be extracted and
passed to reducers.

reduce

map

R(0)

Specialize Cache for Query
Semantics

map

map

reduce

map

R(1)

map

reduce

Ai(0)

map

Ai(1)

map

reduce

join keys arriving
from mappers

Reducer for
Join

But we only care about
keys that join.
all tuples
from cache

10/13/2013

Bill Howe, UW

41
Difference

Join

ΔAi-1

reduce

map

R(0)

map

reduce

map

R(1)

Second optimization:
Specialization for Equijoin

map

map

reduce

Ai(0)

map

Ai(1)

map

reduce

Index the cache, and only extract keys that join

Reducer for
Join
join keys arriving
from mappers

keys that join

indexed cache
lookup

10/13/2013

Bill Howe, UW

42
Difference

Join

ΔAi-1

map

map

reduce

map

R(1)

map

reduce

Ai(0)

map

Ai(1)

Equijoin seman cs

map

reduce

MapReduce seman cs

160

me (s)
total time for loop body (s)

reduce

R(0)

Effect of equijoin
specialization

map

Failure occurred

120
80

~20%

40
0
0

10/13/2013

20

40
60
itera on #
Bill Howe, UW

80
43
Difference

Join

ΔAi-1

reduce

map

R(0)

Third Optimization: Extend Cache
to Support Duplicate Elimination

map

map

reduce

map

R(1)

map

reduce

Ai(0)

map

Ai(1)

map

reduce

The accumulated result is not loop-invariant, but it changes relatively slowly, and
is needed on every iteration to check for duplicates.
Extend the cache to support append, and we can use it for Dupe-Elim as well.

Reducer for
Dupe-elim
tuples arriving from
mappers

unique keys

indexed cache
lookup, with new tuples
inserted
10/13/2013

Bill Howe, UW

44
Effect of Diff Cache
no diff ache
c

with diff ache

loop body (s)
total time for me (s)

100

Failures may be more likely due
to extra network traffic

80
60

~20% overall
improvement

40
20
0
0

10/13/2013

10

20
30
itera on #
Bill Howe, UW

40

50
45
Overall
35000

(a) no optimizations

30000
(b) HaLoop

time (s)

25000
20000
15000

(c) all
optimizations

10000

(d) raw Hadoop
overhead

5000
0
0

50

100

iteration #

150

200

250
Fewer Iteraations: Loop unrolling

Run two joins for every dupe-elim

10/13/2013

Bill Howe, UW

47
half the iterations, but
each is more expensive

change
strategies
10/13/2013

Bill Howe, UW

48
reachable(Y) :- edge(5,Y)
reachable(Y) :- edge(X,Y), reachable(X)

# of Newly Discovered Facts

10000000
1000000
100000
10000
1000

Greenplum
Myria

100
10

not much
useful work

1

1

3

5

7

9

11 13 15 17 19 21 23
Iteration
700
Total Time (second)

600
500
400

Greenplum

Low per-iteration cost

300

Myria

200

Greenplum, incremental

100

Greenplum, incremental+index

0

1 3 5 7 9 11 13 15 17 19 21 23
Iteration

10/13/2013

Bill Howe, UW

50
Summary
• Goal: Expose all the world’s science data through
declarative query interfaces!
• Motivated by real science
• Data and query model is iterative relational algebra
• Industrial-strength Query-as-a-Service
http://db.cs.washington.edu/myria/
http://myria-web.appspot.com/

10/13/2013

Bill Howe, UW

51
10/13/2013

Bill Howe, UW

52
Datalog Parser
Logical Optimizer

Myria Compiler

C Compiler

Grappa

Google
App
Engine

• Hypothesis: The performance difference
between hand-coded graph algorithms and
relational query plans amounts to
implementation details
• Can we generate “hand-coded” plans?

10/13/2013

Bill Howe, UW

53
Path-Counting Queries

Ex: Count the number of unique 2-hops
Assume a collection edges
answers = set()
for all (x, y1) in edges:
for all (y2, z) in edges:
if y1 == y2:
answers.insert((x,z))
count = answers.size()

In an RDBMS: “Nested Loops Join”

10/13/2013

Bill Howe, UW

55
Assume a collection edges, but also an index
neighbors: vertex -> [vetex]
answers = set()
for all (x, y) in edges:
for all z in neighbors[y]:
answers.insert((x,z))
count = answers.size()

In an RDBMS: “Hash Join”

10/13/2013

Bill Howe, UW

56
Just drop the edges collection entirely, leaving only the index
neighbors: vertex -> [vetex]
answers = set()
for all x in neighbors:
for all y in neighbors[x]:
for all z in neighbors[y]:
answers.insert((x,z))
count = answers.size()
In an RDBMS: Still a Hash Join

10/13/2013

Bill Howe, UW

57
Just drop the edges collection entirely, leaving only the index
neighbors: vertex -> [vetex]
count = 0
answers = set()
for all x in neighbors:
for all y in neighbors[x]:
for all z in neighbors[y]:
answers.insert(z)
count += answers.size()
answers.clear()

only one value
stays small

RDBMS don’t express this, but there’s
no reason they couldn’t
10/13/2013

Bill Howe, UW

58
Or if you prefer…assume a collection of vertices, where each
vertex points directly to its neighbors
answers = set()
for all x in neighbors:
for all y in x.neighbors():
for all z in y.neighbors():
answers.insert(z)
count += answers.size()
answers.clear()

only one value, so
stays small

Boils down to dereferencing a pointer vs. probing a hash table

10/13/2013

Bill Howe, UW

59
Experiments
• Data sets:
Dataset

# Vertices

# Edges

#Distinct 2-hop
Paths

# Triangles

BSN*

685,230

7,600,595

78,350,597

6,935,709

Twitter 4MEⱡ 166,317

4,532,185

1,056,317,985

14,912,950

comlivejournal*

3,997,962

34,681,189

735,398,579

soclivejournal*

4,874,571

68,993,773

ⱡ Kwak et al
H.
2010.

112,319,229

*http://snap.stanford.edu/
Experiments

no dupe
elim

single-threaded

dupe elim
BSN data set

Twitter 4ME data set
Experiments
Experiments
• Parallel system performance

More Related Content

What's hot

Towards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in LinguisticsTowards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in LinguisticsRichard Littauer
 
Metadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data RepositoriesMetadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data Repositoriesandrea huang
 
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...GigaScience, BGI Hong Kong
 
MESUR: Making sense and use of usage data
MESUR: Making sense and use of usage dataMESUR: Making sense and use of usage data
MESUR: Making sense and use of usage dataHerbert Van de Sompel
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text Paul Groth
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningPaul Groth
 
Trustworthy AI and Open Science
Trustworthy AI and Open ScienceTrustworthy AI and Open Science
Trustworthy AI and Open ScienceBeth Plale
 
Architecture of ContentMine Components contentmine.org
Architecture of ContentMine Components contentmine.orgArchitecture of ContentMine Components contentmine.org
Architecture of ContentMine Components contentmine.orgpetermurrayrust
 
The PLUTo project @iEvoBio 2014
The PLUTo project @iEvoBio 2014The PLUTo project @iEvoBio 2014
The PLUTo project @iEvoBio 2014Ross Mounce
 
Empowering tools for Neuroimaging
Empowering tools for NeuroimagingEmpowering tools for Neuroimaging
Empowering tools for NeuroimagingAndreas Horn
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph MaintenancePaul Groth
 
towards interoperable archives: the Universal Preprint Service initiative
towards interoperable archives:  the Universal Preprint Service initiativetowards interoperable archives:  the Universal Preprint Service initiative
towards interoperable archives: the Universal Preprint Service initiativeHerbert Van de Sompel
 
Leveraging VIVO data: visualizations, queries, and reports
Leveraging VIVO data: visualizations, queries, and reportsLeveraging VIVO data: visualizations, queries, and reports
Leveraging VIVO data: visualizations, queries, and reportsPaul Albert
 
A Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and WikidataA Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and Wikidatapetermurrayrust
 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebRetrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebStefan Dietze
 
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...GigaScience, BGI Hong Kong
 
Analysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebAnalysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebStefan Dietze
 
UCSD Library Presentation 10182010
UCSD Library Presentation 10182010UCSD Library Presentation 10182010
UCSD Library Presentation 10182010Philip Bourne
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...Marko Rodriguez
 
IASSIST identifiers By Joan Starr
IASSIST identifiers By Joan StarrIASSIST identifiers By Joan Starr
IASSIST identifiers By Joan StarrCarly Strasser
 

What's hot (20)

Towards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in LinguisticsTowards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in Linguistics
 
Metadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data RepositoriesMetadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data Repositories
 
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
 
MESUR: Making sense and use of usage data
MESUR: Making sense and use of usage dataMESUR: Making sense and use of usage data
MESUR: Making sense and use of usage data
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learning
 
Trustworthy AI and Open Science
Trustworthy AI and Open ScienceTrustworthy AI and Open Science
Trustworthy AI and Open Science
 
Architecture of ContentMine Components contentmine.org
Architecture of ContentMine Components contentmine.orgArchitecture of ContentMine Components contentmine.org
Architecture of ContentMine Components contentmine.org
 
The PLUTo project @iEvoBio 2014
The PLUTo project @iEvoBio 2014The PLUTo project @iEvoBio 2014
The PLUTo project @iEvoBio 2014
 
Empowering tools for Neuroimaging
Empowering tools for NeuroimagingEmpowering tools for Neuroimaging
Empowering tools for Neuroimaging
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
towards interoperable archives: the Universal Preprint Service initiative
towards interoperable archives:  the Universal Preprint Service initiativetowards interoperable archives:  the Universal Preprint Service initiative
towards interoperable archives: the Universal Preprint Service initiative
 
Leveraging VIVO data: visualizations, queries, and reports
Leveraging VIVO data: visualizations, queries, and reportsLeveraging VIVO data: visualizations, queries, and reports
Leveraging VIVO data: visualizations, queries, and reports
 
A Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and WikidataA Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and Wikidata
 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebRetrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
 
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...
 
Analysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebAnalysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the Web
 
UCSD Library Presentation 10182010
UCSD Library Presentation 10182010UCSD Library Presentation 10182010
UCSD Library Presentation 10182010
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
 
IASSIST identifiers By Joan Starr
IASSIST identifiers By Joan StarrIASSIST identifiers By Joan Starr
IASSIST identifiers By Joan Starr
 

Similar to Myria: Analytics-as-a-Service for (Data) Scientists

MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DUniversity of Washington
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceUniversity of Washington
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisUniversity of Washington
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8Scott Edmunds
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentalsrjain51
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceCarole Goble
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data ModelingVital.AI
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabadGeohedrick
 
Open PHACTS April 2017 Science webinar Workflow tools
Open PHACTS April 2017 Science webinar Workflow toolsOpen PHACTS April 2017 Science webinar Workflow tools
Open PHACTS April 2017 Science webinar Workflow toolsopen_phacts
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFlávio Codeço Coelho
 

Similar to Myria: Analytics-as-a-Service for (Data) Scientists (20)

eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Data-Intensive Scalable Science
Data-Intensive Scalable ScienceData-Intensive Scalable Science
Data-Intensive Scalable Science
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&D
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScience
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and Analysis
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Democratizing Data Science by Bill Howe
Democratizing Data Science by Bill HoweDemocratizing Data Science by Bill Howe
Democratizing Data Science by Bill Howe
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
 
Resume
ResumeResume
Resume
 
Open PHACTS April 2017 Science webinar Workflow tools
Open PHACTS April 2017 Science webinar Workflow toolsOpen PHACTS April 2017 Science webinar Workflow tools
Open PHACTS April 2017 Science webinar Workflow tools
 
Final Johnson Research Libraries and Computational Research
Final Johnson Research Libraries and Computational ResearchFinal Johnson Research Libraries and Computational Research
Final Johnson Research Libraries and Computational Research
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
 

More from University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceUniversity of Washington
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureUniversity of Washington
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe University of Washington
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareUniversity of Washington
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchUniversity of Washington
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersUniversity of Washington
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce University of Washington
 
Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceUniversity of Washington
 

More from University of Washington (20)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
Data Science and Urban Science @ UW
Data Science and Urban Science @ UWData Science and Urban Science @ UW
Data Science and Urban Science @ UW
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible Research
 
End-to-End eScience
End-to-End eScienceEnd-to-End eScience
End-to-End eScience
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce
 
Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory Science
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Myria: Analytics-as-a-Service for (Data) Scientists

  • 1. Myria: Analytics-as-a-Service for (Data) Scientists Bill Howe University of Washington 10/13/2013 Bill Howe, UW 1
  • 2. “It’s a great time to be a data geek.” -- Roger Barga, Microsoft Research “The greatest minds of my generation are trying to figure out how to make people click on ads” -- Jeff Hammerbacher, co-founder, Cloudera 2
  • 3.
  • 4. How can we deliver 1000 little SDSSs to anyone who wants one? 10/13/2013 Bill Howe, UW 4
  • 6. Armbrust Lab Retreat, 2009 (Biology, Oceanography) 10/13/2013 Bill Howe, UW 6
  • 8. Big Data in the Long Tail Workshop, 2012 (Social Sciences) 10/13/2013 Bill Howe, UW 8
  • 9. Maier’s 2nd Maxim Working with scientists is like working with 7 year olds: They think they know everything and they don’t have any money 10/13/2013 Bill Howe, UW 9
  • 10. My Goal: Expose all the world’s science data through declarative query interfaces 10/13/2013 Bill Howe, UW 10
  • 11. Problem How much time do you spend “handling data” as opposed to “doing science”? Mode answer: “90%” 10/13/2013 Bill Howe, UW 11
  • 12. ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome ###query chr_4[480001-580000].287 chr_4[560001-660000].1 chr_9[400001-500000].503 chr_9[320001-420000].548 chr_27[320001-404298].20 chr_26[320001-420000].378 chr_26[400001-441226].196 chr_24[160001-260000].65 chr_5[720001-820000].339 chr_9[160001-260000].243 chr_12[720001-820000].86 chr_12[800001-900000].109 chr_11[1-100000].70 chr_11[80001-180000].100 length 4500 3556 4211 2833 3991 3963 2949 3542 3141 3002 2895 1463 2886 1523 COG hit #1 e-value #1 identity #1 score #1 COG4547 COG5406 COG4547 COG5099 COG5099 2.00E-04 2.00E-04 5.00E-05 5.00E-05 2.00E-04 19 38 18 17 17 44.6 43.9 46.2 46.2 43.9 620 1001 620 777 777 Cobalamin biosynthesis protein C Nucleosome binding factor SPN, Cobalamin biosynthesis protein C RNA-binding protein of the Puf fa RNA-binding protein of the Puf fa COG5099 COG5077 COG5032 COG5032 4.00E-09 1.00E-25 2.00E-09 1.00E-09 20 26 30 30 59.3 114 60.5 60.1 777 1089 2105 2105 RNA-binding protein of the Puf fa Ubiquitin carboxyl-terminal hydr Phosphatidylinositol kinase and p Phosphatidylinositol kinase and p Simple Example hit length #1 description #1 COGAnnotation_coastal_sample.txt id query 1 FHJ7DRN01A0TND.1 2 FHJ7DRN01A1AD2.2 3 FHJ7DRN01A2HWZ.4 … 2853 FHJ7DRN02HXTBY.5 2854 FHJ7DRN02HZO4J.2 … 3566 FHJ7DRN02FUJW3.1 … hit COG0414 COG0092 COG3889 e_value identity_ score query_start query_end hit_start hit_end hit_length 1.00E-08 28 51 1 74 180 257 285 3.00E-20 47 89.9 6 85 41 120 233 0.0006 26 35.8 9 94 758 845 872 COG5077 COG0444 7.00E-09 2.00E-31 37 67 52.3 127 3 1 77 73 313 135 388 207 1089 316 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105 SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit 10/13/2013 Bill Howe, UW 12
  • 13. Maslow’s Needs Hierarchy “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43 10/13/2013 Bill Howe, UW 13
  • 14. A “Needs Hierarchy” of Science Data Management “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43 analytics query curation sharing storage 10/13/2013 Bill Howe, UW 14
  • 15. A “Needs Hierarchy” of Science Data Management “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43 analytics query semantic integration sharing storage 10/13/2013 Bill Howe, UW 15
  • 16. Why should you care? Science == Data Science 10/13/2013 Bill Howe, UW 16
  • 17. Version 1 QUERY-AS-A-SERVICE 2010 - present 10/13/2013 Bill Howe, UW 17
  • 18. 3) Share the results Make them public, tag them, share with specific colleagues – anyone with access can query 2) Write SQL Right in your browser, writing queries on top of queries on top of queries ... 1) Upload data “as is” Cloud-hosted; no need to install or design a database; no pre-defined schema SELECT hit, COUNT(*) FROM tigrfam_surface GROUP BY hit ORDER BY cnt DESC
  • 19. Find all TIGRFam ids (proteins) that are missing from at least one of three samples (relations) SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs] UNION SELECT col0 FROM [est_hma_fasta_TGIRfam_refs] UNION SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs] EXCEPT SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs] INTERSECT SELECT col0 FROM [est_hma_fasta_TGIRfam_refs] INTERSECT SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs] 10/13/2013 Bill Howe, UW 19
  • 20. Non-programmers can write very complex queries (rather than relying on staff programmers) Example: Computing the overlaps of two sets of blast results SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) We see thousands THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) queries written by THEN w.end_bp - x.start_bp + 1 non-programmers END AS len_overlap FROM [koesterj@washington.edu].[hotspots_deserts.tab] x INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC of
  • 21. Howe, et al., CISE 2012
  • 22. Join Steven Roberts Link methylation with gene description Excel Trim SQL as a lab notebook: http://bit.ly/16Xj2JP Compute misstep: join w/ wrong fill Reorder columns Reorder columns Join Join Count Calculate methylation ratio Calculate methylation ratio and link with gene description Count Calculate # methylated CGs Calculate # all CGs Join Join Calculate # methylated CGs Calculate # all CGs Reorder columns GFF of methylated CG locations GFF of all genes GFF of all CG locations Gene descriptions GFF of methylated CG locations Popular service for Bioinformatics Workflows GFF of all genes GFF of all CG locations Gene descriptions
  • 23. Halperin, Howe, et al. SSDBM 2013
  • 24. Andrew White, UW Chemistry “An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces. Previously, we were using huge directory trees and plain text files. Now we can accomplish a 10 minute 100 line script in 1 line of SQL.” -- Andrew D White Decoding nonspecific interactions from nature. A. White, A. Nowinski, W. Huang, A. Keefe, F. Sun, S. Jiang. (2012) Chemical Science. Accepted 10/13/2013 Bill Howe, UW 24
  • 25. SSDBM 2011 Scientific data management reduces to sharing views • Integrate data from multiple sources? – joins and unions with views • Standardize on units, apply naming conventions? – rename columns, apply functions with views • Attach metadata? – add new tables with descriptive names, add new columns with views • Data cleaning, quality control? – hide bad values with views • Maintain provenance? – inspect view dependencies • Propagate updates? – view maintenance • Protect sensitive data? – expose subsets with views (assuming views carry permissions) 10/13/2013 Bill Howe, UW 25
  • 26. Two Problems with SQLShare • No help for really big datasets • No iteration 10/13/2013 Bill Howe, UW 26
  • 27. Myria is… • A compiler framework for multiple iterative RA-based languages • A parallel, shared-nothing, iterative execution engine • A RESTful Query-as-a-Service platform • prefix meaning “ten thousand” in Greek 10/13/2013 Bill Howe, UW 27
  • 28. Myria Team Dan Suciu Magda Balazinska Bill Howe Dan Halperin (postdoc, technical lead) Victor Almeida (postdoc) Andrew Whitaker (research scientist) Students Paris Koutris Emad Soroush Jingjing Wang ShengLiang Xu Jennifer Ortiz Jeremy Hyrkas Shumo Chu 28
  • 29. Myria Architecture Web UI Language Parser Google App Engine Logical Optimizer for RA+While Myria Compiler MyriaL C Compiler Grappa json query plan MyriaDB REST Server Coordinator Catalog netty protocols Worker Catalog Worker Catalog … Worker Catalog jdbc jdbc jdbc RDBMS RDBMS RDBMS HDFS HDFS HDFS
  • 30. A(y) :- R(‘a’, y) A(y) :- A(x), R(x,y) 10/13/2013 Bill Howe, UW 30
  • 31. A = LOAD('points.txt', id:int, x:float, y:float) E = LIMIT(A, 4); F = SEQUENCE(); Centroids = [FROM E EMIT (id=F.next, x=E.x, y=E.y)]; Kmeans = [FROM A EMIT (id=id, x=x, y=y, cluster_id=0)] DO I = CROSS(Kmeans, Centroids); J = [FROM I EMIT (Kmeans.id, Kmeans.x, Kmeans.y, Centroids.cluster_id, $distance(Kmeans.x, Kmeans.y, Centroids.x, Centroids.y))]; K = [FROM J EMIT id, distance=$min(distance)]; L = JOIN(J, id, K, id) M = [FROM L WHERE J.distance <= K.distance EMIT (id=J.id, x=J.x, y=J.y, cluster_id=J.cluster_id)]; Kmeans' = [FROM M EMIT (id, x, y, $min(cluster_id))]; Delta = DIFF(Kmeans', Kmeans) Kmeans = Kmeans' Centroids = [FROM Kmeans' EMIT (cluster_id, x=avg(x), y=avg(y))]; WHILE DELTA != {} 10/13/2013 Bill Howe, UW 31
  • 32. Why Iteration Matters Datalog: reachability in a graph with 1.4B unique edges: almost 200 iterations 32
  • 33. Why Iteration Matters Datalog: reachability in a graph with 1.4B unique edges: almost 200 iterations Vast majority of reachable tuples discovered by iteration 25 33
  • 34. Why Iteration Matters Datalog: reachability in a graph with 1.4B unique edges: almost 200 iterations Vast majority of reachable tuples discovered by iteration 25 The datalog program continues for almost 200 iterations, each almost as expensive as the early steps 34
  • 35. Fewer Iterations: Endgame Problem [Afrati 10] 100,000,000 frontier tuples previously discovered tuples removed 10,000,000 # of tuples discovered 1,000,000 100,000 10,000 1,000 100 10 1 0 10/13/2013 20 40 60 80 100 iteration # Bill Howe, UW 120 140 160 180 35
  • 36. Reachability from ‘a’ in datalog Basic Semi-Naïve Evaluation Join 10/13/2013 Bill Howe, UW A(y) :- R(‘a’, y) A(y) :- A(x), R(x,y) Dupe-elim 36
  • 37. MAYBE JUST USE HADOOP? 10/13/2013 Bill Howe, UW 37
  • 38. VLDB 2010, VLDBJ 2011 Bu, Howe, Balazinska, Ernst VLDB10, VLDBJ12, Datalog12 Difference Join ΔAi-1 map reduce map R(0) map reduce map R(1) map (a) Ai(0) (b) map Ai(1) reduce map reduce (a) R is loop invariant, but gets loaded and shuffled on each iteration (b) Ai grows slowly and monotonically, but is loaded and shuffled on each iteration. HaLoop’s Reducer Input Cache addressed (a), but did not support the append semantics needed for (b). 10/13/2013 Bill Howe, UW 38
  • 39. VLDB 2010, VLDBJ 2011 Inter-loop caching Iteration i = 0: Load a distributed cache Iteration i > 0: ΔAi-1 Difference Join map R(0) map R(0) R(1) map R(1) reduce reduce map map Ai(0) map A(0) Ai(1) reduce map A(1) reduce Bu, Howe, Balazinska, Ernst VLDB10, VLDBJ12, Datalog12 39
  • 40. Difference Join Caching Loop-Invariant Data ΔAi-1 map reduce map R(0) map reduce map R(1) map reduce 1200 map Ai(1) no cache Ai(0) map reduce cache failure me (s) 1000 800 600 First iteration is slow, as the invariant graph is shuffled and cached 400 23X 200 0 0 10/13/2013 10 20 itera on # Bill Howe, UW 30 40
  • 41. Difference Join ΔAi-1 MapReduce semantics require that all keys from the cache be extracted and passed to reducers. reduce map R(0) Specialize Cache for Query Semantics map map reduce map R(1) map reduce Ai(0) map Ai(1) map reduce join keys arriving from mappers Reducer for Join But we only care about keys that join. all tuples from cache 10/13/2013 Bill Howe, UW 41
  • 42. Difference Join ΔAi-1 reduce map R(0) map reduce map R(1) Second optimization: Specialization for Equijoin map map reduce Ai(0) map Ai(1) map reduce Index the cache, and only extract keys that join Reducer for Join join keys arriving from mappers keys that join indexed cache lookup 10/13/2013 Bill Howe, UW 42
  • 43. Difference Join ΔAi-1 map map reduce map R(1) map reduce Ai(0) map Ai(1) Equijoin seman cs map reduce MapReduce seman cs 160 me (s) total time for loop body (s) reduce R(0) Effect of equijoin specialization map Failure occurred 120 80 ~20% 40 0 0 10/13/2013 20 40 60 itera on # Bill Howe, UW 80 43
  • 44. Difference Join ΔAi-1 reduce map R(0) Third Optimization: Extend Cache to Support Duplicate Elimination map map reduce map R(1) map reduce Ai(0) map Ai(1) map reduce The accumulated result is not loop-invariant, but it changes relatively slowly, and is needed on every iteration to check for duplicates. Extend the cache to support append, and we can use it for Dupe-Elim as well. Reducer for Dupe-elim tuples arriving from mappers unique keys indexed cache lookup, with new tuples inserted 10/13/2013 Bill Howe, UW 44
  • 45. Effect of Diff Cache no diff ache c with diff ache loop body (s) total time for me (s) 100 Failures may be more likely due to extra network traffic 80 60 ~20% overall improvement 40 20 0 0 10/13/2013 10 20 30 itera on # Bill Howe, UW 40 50 45
  • 46. Overall 35000 (a) no optimizations 30000 (b) HaLoop time (s) 25000 20000 15000 (c) all optimizations 10000 (d) raw Hadoop overhead 5000 0 0 50 100 iteration # 150 200 250
  • 47. Fewer Iteraations: Loop unrolling Run two joins for every dupe-elim 10/13/2013 Bill Howe, UW 47
  • 48. half the iterations, but each is more expensive change strategies 10/13/2013 Bill Howe, UW 48
  • 49. reachable(Y) :- edge(5,Y) reachable(Y) :- edge(X,Y), reachable(X) # of Newly Discovered Facts 10000000 1000000 100000 10000 1000 Greenplum Myria 100 10 not much useful work 1 1 3 5 7 9 11 13 15 17 19 21 23 Iteration
  • 50. 700 Total Time (second) 600 500 400 Greenplum Low per-iteration cost 300 Myria 200 Greenplum, incremental 100 Greenplum, incremental+index 0 1 3 5 7 9 11 13 15 17 19 21 23 Iteration 10/13/2013 Bill Howe, UW 50
  • 51. Summary • Goal: Expose all the world’s science data through declarative query interfaces! • Motivated by real science • Data and query model is iterative relational algebra • Industrial-strength Query-as-a-Service http://db.cs.washington.edu/myria/ http://myria-web.appspot.com/ 10/13/2013 Bill Howe, UW 51
  • 53. Datalog Parser Logical Optimizer Myria Compiler C Compiler Grappa Google App Engine • Hypothesis: The performance difference between hand-coded graph algorithms and relational query plans amounts to implementation details • Can we generate “hand-coded” plans? 10/13/2013 Bill Howe, UW 53
  • 54. Path-Counting Queries Ex: Count the number of unique 2-hops
  • 55. Assume a collection edges answers = set() for all (x, y1) in edges: for all (y2, z) in edges: if y1 == y2: answers.insert((x,z)) count = answers.size() In an RDBMS: “Nested Loops Join” 10/13/2013 Bill Howe, UW 55
  • 56. Assume a collection edges, but also an index neighbors: vertex -> [vetex] answers = set() for all (x, y) in edges: for all z in neighbors[y]: answers.insert((x,z)) count = answers.size() In an RDBMS: “Hash Join” 10/13/2013 Bill Howe, UW 56
  • 57. Just drop the edges collection entirely, leaving only the index neighbors: vertex -> [vetex] answers = set() for all x in neighbors: for all y in neighbors[x]: for all z in neighbors[y]: answers.insert((x,z)) count = answers.size() In an RDBMS: Still a Hash Join 10/13/2013 Bill Howe, UW 57
  • 58. Just drop the edges collection entirely, leaving only the index neighbors: vertex -> [vetex] count = 0 answers = set() for all x in neighbors: for all y in neighbors[x]: for all z in neighbors[y]: answers.insert(z) count += answers.size() answers.clear() only one value stays small RDBMS don’t express this, but there’s no reason they couldn’t 10/13/2013 Bill Howe, UW 58
  • 59. Or if you prefer…assume a collection of vertices, where each vertex points directly to its neighbors answers = set() for all x in neighbors: for all y in x.neighbors(): for all z in y.neighbors(): answers.insert(z) count += answers.size() answers.clear() only one value, so stays small Boils down to dereferencing a pointer vs. probing a hash table 10/13/2013 Bill Howe, UW 59
  • 60. Experiments • Data sets: Dataset # Vertices # Edges #Distinct 2-hop Paths # Triangles BSN* 685,230 7,600,595 78,350,597 6,935,709 Twitter 4MEⱡ 166,317 4,532,185 1,056,317,985 14,912,950 comlivejournal* 3,997,962 34,681,189 735,398,579 soclivejournal* 4,874,571 68,993,773 ⱡ Kwak et al H. 2010. 112,319,229 *http://snap.stanford.edu/

Editor's Notes

  1. So in part motivated by this. there’s a group of great database researchers who work deeply with scientistsDave Maier, my advisor. Jignesh, who left. Natassa, who left. Yannis, Alex, others.And we recently attracted some new blood to the science data arena.But this community of science databases has something in common with the HPTS communityJim was luminary of HPTS, and no less so a luminary of science databases.The Sloan Digital Sk
  2. To understand the problem, it’s useful to consider past successes. The Sloan Digital Sky Survey used a relational database with a carefully engineered schema, and then served the database online using a carefully engineered infrastructure.This approach requires a lot of people, expertise, money, and time – things that small and medium-sized projects don’t typically have.So the question we explore is: How can we support 1000 little &quot;SDSSs” for small- and medium- sized projects?---We started thinking about a new tool. schema designed in part by a turing-award winning computer database expert  We can&apos;t afford to build a database + applications from scratch for every project and nobody wants to maintain such a system anyway.  Most importantly, the data comes from all over the place instead of a single source like SDSS --- we can&apos;t pretend the data will arrive clean and coherent.
  3. …whereI had to disguise myself as an oceanographer in order to do data science work. This me on a research cruise in 2007
  4. But since joining the eScience Institute, I’m can mingle freely with the scientists in their natural habitat, and I sometimes get invited to their events
  5. In every discipline, you can play where’s Waldo in these group photos and find me.
  6. The problem is not only scale, and not even usually scale – it’s what Stratos called DB exploration. Grubbing around in messy data with unknown quality, properties, etc.And, working
  7. But the fundamental error made by computer scientists, and it’s probably the fault of the database community, is to assume that strong semantic integration is a prerequisite for query and analytics.It isn’t. It’s the final goal, not some insignificant preamble to analysis.Domain scientists know this – they take a very pragmatic approach. They write code to do data handling, they write code to do analytics, and they do data integration on the fly in a task-specific way.So one of my goals is to convince you of is that you can decouple declarative query from semantic integration, and doing so gives scientists a very powerful tool.
  8. But the fundamental error made by computer scientists, and it’s probably the fault of the database community, is to assume that semantic integration is a prerequisite for query and analytics.It isn’t. It’s the final goal, not some insignificant preamble to analysis.Domain scientists know this – they take a very pragmatic approach. They write code to do data handling, they write code to do analytics, and they do data integration on the fly in a task-specific way.So one of my goals is to convince you of is that you can decouple declarative query from semantic integration, and doing so gives scientists a very powerful tool.
  9. So we developed SQLShare to support a very simple workflow: you can upload data “as is” from spreadsheets or anything. It’s in the cloud, so no need to install or design a database.You can immediately begin writing queries, right in your browser, and put queries on top of queries on top of queries.Then you can share the results online: Your colleagues can browse the science questions and see the SQL that answers it. ta out.  ----Key ideas to get data in: a) Use the cloud to avoid having to install and run a databaseb) Give up on the schema -- just throw your data in &quot;as is&quot; and do &quot;lazy integration.”c) Use some magic to automate parsing, integration, recommendations, and more.Key ideas to get data out:a) Associate science questions (in English) with each SQL query -- makes them easy to understand and easy to find.b) Saving and reusing queries is a first class requirement.  Given examples, it&apos;s easy to modify it into an &quot;adjacent&quot; query.c) Expose the whole system through a REST API to make it easy to bring new client applications online.
  10. Multiple input laguages, multiple output languages, all RA basedDatabase on every node for local processingEverything in memory, but can push down into databasePush-based processing with back pressure to keep queues filled (a bit of streaming influence)Column-oriented tuple-batches between workers.Row-oriented on disk, typically, but depends on the databaseSupport
  11. Four points to make:0) This is the time for *join only*, not the overall iteration time1) First iteration is slower, as the cache is filled2) each iteration is about 23X faster by joining against cached results.3) The gaps are failures, which are a reality at this scale. Recovery proceeded as usual.HaLoop showed similar, but did not evaluate complete datalog queries.
  12. Two points to make:1) 20% speedup on the overall iteration time from this specialization. This optimization violates MapReduce semantics, but is safe given our target lanaguage of datalog. 2) The outliers represent failures, which is a reality of dealing with large-scale data, and is a key reason why HaLoop is popular.
  13. The accumulated result is not loop-invariant, but it changes relatively slowly, and is needed on every iteration to check for duplicates. Extend the cache to support append, and we can use it for Dupe-Elim as well.
  14. The diff cache worksMaybe ignore the failure comment, but just in case a question arises about why failures appear to be more common without the cache.The answer: we’re not sure, but we know more data is being transferred over the network without the cache.
  15. But if we can’t express important analysis tasks, they’ll export their data and use some parallel cloudy R monstrosity