Relational databases remain underused in the long tail of science, despite a number of significant
success stories and a natural correspondence between scientific inquiry and ad hoc database query.
Barriers to adoption have been articulated in the past, but spreadsheets and other file-oriented ap-
proaches still dominate. At the University of Washington eScience Institute, we are exploring a new
“delivery vector” for selected database features targeting researchers in the long tail: a web-based
query-as-a-service system called SQLShare that eschews conventional database design, instead empha-
sizing a simple Upload-Query-Share workflow and exposing a direct, full-SQL query interface over
“raw” tabular data. We augment the basic query interface with services for cleaning and integrating
data, recommending and authoring queries, and automatically generating visualizations. We find that
even non-programmers are able to create and share SQL views for a variety of tasks, including quality
control, integration, basic analysis, and access control. Researchers in oceanography, molecular biol-
ogy, and ecology report migrating data to our system from spreadsheets, from conventional databases,
and from ASCII files. In this paper, we will provide some examples of how the platform has enabled sci-
ence in other domains, describe our SQLShare system, and propose some emerging research directions
in this space for the database community.
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Enabling Collaborative Research Data Management with SQLShare
1. Enabling Collaborative Research
Data Management with SQLShare
Bill Howe, PhD Tom Lewis
Director of Director, Academic &
Research, Scalable Data Collaborative Applications
Analytics University of Washington
University of Washington Information Technology
eScience Institute
10/3/2012 Bill Howe, UW 1
2. Assessing UW Researchers‘
Data Management Needs
1. Conversations with Research Leaders (2008)
• First large-scale assessment of researchers‘
needs
• 124 Interviews with top researchers
2. Faculty Technology Survey (2011)
• Use of teaching and research technologies
• Paired with student and TA surveys
• Reached all disciplines, levels of research
• 689 instructors responded
3. Data Management Needs
• Expertise—designing new systems, enhancing
current practices, analysis
• Storage—Large amounts of data for current
projects and data archives
• Backup—inconsistent systems among
researchers, some unreliable practices
• Security—secure access needed for inter-
institutional partners
4. Types of Data Stored
Quantities/statistics 71.60%
61.20%
Text (literature, transcriptions, field notes)
48.00%
Images (photographs, maps)
22.80%
Video recordings
20.50%
Audio recordings
15.20%
Multimedia digital objects
11.70%
Geo-tagged objects/ spatial data 5.20%
0% 20% 40% 60% 80% 100%
5. Data Storage Location
My computer 87%
External device (hard drive, thumb drive) 66%
41%
Department-managed server
27%
Server managed by research team
12%
External (Non-UW) data center
6%
Department-managed data center 6%
Other 5%
Data managed by research team 5%
0% 20% 40% 60% 80% 100%
8. The University of Washington
eScience Institute
• Rationale
– The exponential increase in sensors is transitioning all fields of science and
engineering from data-poor to data-rich
– Techniques and technologies include
• Sensors and sensor networks, databases, data mining, machine learning,
visualization, cluster/cloud computing
– If these techniques and technologies are not widely available and widely
practiced, UW will cease to be competitive
• Mission
– Help position the University of Washington at the forefront of research both in
modern eScience techniques and technologies, and in the fields that depend
upon them
• Strategy
– Bootstrap a cadre of Research Scientists
– Add faculty in key fields
– Build out a ―consultancy‖ of students and non-research staff
10/3/2012 Bill Howe, eScience Institute 8
9. eScience Big Data Group
Bill Howe, Phd (databases, cloud, data-intensive scalable computing, visualization)
Staff
– Seung-Hee Bae, Phd (postdoc, scalable machine learning algorithms)
– Dan Halperin, Phd (postdoc; scalable systems)
– Sagar Chitnis, Research Engineer (Azure, databases, web services)
– (alumna) Marianne Shaw, Phd (hadoop, semantic graph databases)
– (alumna) Alicia Key, Research Engineer (visualization, web applications)
Students
– Scott Moe (2nd yr Phd, Applied Math)
– Daniel Perry (2nd yr Phd, HCDE)
Partners
– CSE DB Faculty: Magda Balazinska, Dan Suciu
– CSE students: Paris Koutris, Prasang Upadhyaya,
– UW-IT (web applications, QA/support)
– Cecilia Aragon, Phd, Associate Professor, HCDE (visualization, scientific applications)
10/3/2012 Bill Howe, UW 9
10. [src: Carol Goble]
• Power distribution
• 80:20 rule
Popularity / Sales
Head
Tail
Products / Results
First published May 2007, Wired Magazine article 2004
11. [src: Carol Goble]
Long Tail of Research Data
High throughput experimental methods
Industrial scale
Commons based production
Publicly data sets
Cherry picked results
Preserved
GenBank
PDB
UniProt ChemSpider
Pfam CATH, SCOP
(Protein
Structure
Classification) Spreadsheets, Notebooks
Local, Lost
12. Problem
How much time do you spend “handling
data” as opposed to “doing science”?
Mode answer: “90%”
10/3/2012 Bill Howe, UW 12
13. ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1
chr_4[480001-580000].287 4500
chr_4[560001-660000].1 3556
chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein C
Simple Example
chr_9[320001-420000].548
chr_27[320001-404298].20
2833
3991
COG5406
COG4547
2.00E-04
5.00E-05
38
18
43.9
46.2
1001
620
Nucleosome binding factor SPN,
Cobalamin biosynthesis protein C
chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf fa
chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf fa
chr_24[160001-260000].65 3542
chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf fa
chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydr
chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and p
chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and p
chr_11[1-100000].70 2886
chr_11[80001-180000].100 1523
COGAnnotation_coastal_sample.txt
id query hit e_value identity_ score query_start query_end hit_start hit_end hit_length
1 FHJ7DRN01A0TND.1 COG0414 1.00E-08 28 51 1 74 180 257 285
2 FHJ7DRN01A1AD2.2 COG0092 3.00E-20 47 89.9 6 85 41 120 233
3 FHJ7DRN01A2HWZ.4 COG3889 0.0006 26 35.8 9 94 758 845 872
…
2853 FHJ7DRN02HXTBY.5 COG5077 7.00E-09 37 52.3 3 77 313 388 1089
2854 FHJ7DRN02HZO4J.2 COG0444 2.00E-31 67 127 1 73 135 207 316
…
3566 FHJ7DRN02FUJW3.1 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105
…
SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit
10/3/2012 Bill Howe, UW 13
14. Data management becoming the bottleneck
• Data is captured and manipulated in spreadsheets
• This approach made sense five years ago; the data
volumes were manageable
• But now:
– 50k rows in each of 50-100 spreadsheets, per scientist
– Significantly more sharing required: ―mega-collabs‖
– Continuously producing new results
– The management overhead is beginning to dominate
– Difficult to share data publicly (emailing files is a mess)
– Lots of work to get an answers to any new question
10/3/2012 14
15. Why not build a database?
• Assumes an pre-defined schema
– more on this later
• Requires specialized technical skills and a huge amount
of up-front effort
• Not a perfect fit – it‘s hard to design a ―permanent‖
database for a fast-moving research target
• Researchers have little interest in operating and
maintaining a data system – they just want to organize,
manipulate, and share data
10/3/2012 15
16. Database-as-a-Service for the 99%
Approach: Strip down databases to bare essentials
– Upload -> Query -> Share
– Try to eliminate
installation, configuration, schema design, data
loading, tuning, easy, thanks to the cloud
app-building
harder
10/3/2012 Bill Howe, UW 16
17. Getting Data In
• Upload data through a browser to a cloud
database (so no need to manage a local system)
• No need to ―design a database‖ before you use
your data – just upload and get to work
• Write SQL queries, with some automated help and
some guided help from experts
• Build on your own results
– Output of one query can be the input to another
– Encourages sharing and reuse
– Avoids everyone on the team running the same data
processing steps over and over
– Provides provenance – ―how did you get this result?‖
10/3/2012 Bill Howe, UW 17
18. Select from a list of English descriptions
10/3/2012 18
19. Edit a Query
Save the results, share them with others
10/3/2012 19
27. Robin
Kodner
2010 Pilot- Outreach and Education-
based sampling: Schooner Adventuress
9
13
11
14
12
15
16
4 510
8
7
10/3/2012 Bill Howe, UW
6 27
28. NatureMapping Program Karen
Dvornich
Wildlife Observations (1902- )
Data collection and submission options:
1. Download/upload spreadsheet
2. Online data entry
3. NatureTracker on handheld/GPS
4. Android ODK (Open Data Kit)
Water Quality Monitoring Sites (2003 - )
10/3/2012 Bill Howe, UW 28
29. Andrew
White, UW
Chemistry
“An undergraduate student and I are working with gigabytes of tabular
data derived from analysis of protein surfaces.
Previously, we were using huge directory trees and plain text files.
Now we can accomplish a 10 minute 100 line script in 1 line of SQL.”
-- Andrew D White
10/3/2012 Bill Howe, UW 29
31. A ―Needs Hierarchy‖ of Science Data Management
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
-- Maslow 43
full semantic integration
analytics
query
sharing
storage
10/3/2012 Bill Howe, UW 31
32. NoSchema (not NoSQL)
• A schema* is a shared consensus about some
universe of discourse
• At the frontier of research, this shared consensus
does not exist, by definition
• Any schema that does emerge will change
frequently, by definition
• Data found ―in the wild‖ will typically not conform
to any schema, by definition
• But this doesn‘t mean we have to punt on
databases and go back to ad hoc scripts and files
* ontology/metadata standard/controlled vocabulary/etc.
10/3/2012 Bill Howe, UW 32
33. A ―Needs Hierarchy‖ of Science Data Management
Silos full semantic
integration
analytics
query
sharing
Cloud
storage
10/3/2012 Bill Howe, UW 33
34. What‘s the point?
• Conventional wisdom says ―Science data isn‘t relational‖
– This is utter nonsense
• Conventional wisdom says ―Scientists won‘t write SQL‖
– This is utter nonsense
• So why aren‘t databases being used more often?
– They‘re a PITA
• We implicate difficulty in
– installation, configuration
– schema design, data loading
– performance tuning
– app-building (NoGUI?)
We ask instead, “What kind of platform can
support ad hoc scientific Q&A with SQL?”
5/18/10 Garret Cole, eScience Institute
35. Biologists are beginning to write very complex
queries (rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results
SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp
, w.category as nc_category
, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
THEN x.end_bp - x.start_bp + 1
WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
THEN x.end_bp - w.start_bp + 1 We see thousands of
WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) queries written by
THEN w.end_bp - x.start_bp + 1
END AS len_overlap
non-programmers
FROM [koesterj@washington.edu].[hotspots_deserts.tab] x
INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w
ON x.chr = w.chr
WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
ORDER BY x.strain, x.chr ASC, x.start_bp ASC
36. So why SQL?
• Covers 80% of what we need
– Ex: Sloan Digital Sky Survey
– Ex: Hybrid Hash Join algorithm published in BMC bioinformatics
• Empower a new class of data-savvy scientist who isn‘t
forced to be trained as an IT professional
• Algebraic optimization
– Ask me about this if you‘re interested
• Views: Logical and physical data Independence
– Reason about the problem independently of the data representation
– No re-execution of ―workflows‖
– No file format incompatibilities
– No version mismatches
– Data and code tightly coupled and (logically) centralize
10/3/2012 Bill Howe, UW 36
37. SQLShare as a CS Research Platform
SSDBM 2011
• Automatic ―Starter‖ Queries
– (Bill Howe, Garret Cole, Nodira Khoussainova, Leilani Battle)
SIGMOD 2011 (demo)
• VizDeck: Automatic Mashups and Visualization SSDBM 2011
– (Bill Howe, Alicia Key, Daniel Perry, Cecilia Aragon) CHI 2012
• Info Extraction from Spreadsheets SIGMOD 2012 (demo)
– (Mike Cafarella, Dave Maier, Bill Howe, Sagar Chitnis, Abdu Alwani)
• Scalable Analytics-as-a-Service
– (Dan Suciu, Magda Balazinska, Bill Howe)
• Optimizing Iterative Queries for Machine Learning VLDB 2010
– (Dan Suciu, Magda Balazinska, Bill Howe) Datalog2.0 2012
CIDR 2013
10/3/2012 Bill Howe, UW 37
38. Where we‘re headed:
• Local or cloud-hosted deployments done!
• Multi-institution sharing
• Global users and permissions
• Distributed data and distributed query
We are looking for partners!
10/3/2012 Bill Howe, UW 38
41. Scientific data management reduces to sharing views
• Integrate data from multiple sources?
– joins and unions with views
• Standardize on units, apply naming conventions?
– rename columns, apply functions with views
• Attach metadata?
– add new tables with descriptive names, add new columns with views
• Data cleaning, quality control?
– hide bad values with views
• Maintain provenance?
– inspect view dependencies
• Propagate updates?
– view maintenance
• Protect sensitive data?
– expose subsets with views (assuming views carry permissions)
10/3/2012 Bill Howe, UW 41
42. An observation about ―handling data‖
• How many plasmids were bombarded in July and
have a rescue and expression?
SELECT count(*)
FROM [bombardment_log]
WHERE bomb_date BETWEEN ‘7/1/2010' AND ‘7/31/2010'
AND rescue clone IS NOT NULL
AND [expression?] = 'yes'
5/18/10 Garret Cole, eScience Institute
43. An observation about ―handling data‖
• Which samples have not been cloned?
SELECT *
FROM plasmiddb
WHERE NOT (ISDATE(cloned) OR cloned = ‗yes‘)
5/18/10 Garret Cole, eScience Institute
44. An observation about ―handling data‖
• How often does each RNA hit appear inside
the annotated surface group?
SELECT hit, COUNT(*) as cnt
FROM tigrfamannotation_surface
GROUP BY hit
ORDER BY cnt DESC
5/18/10 Garret Cole, eScience Institute
45. An observation about ―handling data‖
For a given promoter (or protein fusion), how many
expressing line have been generated (they would all
have different strain designations)
SELECT strain, count(distinct line)
FROM glycerol_stocks
GROUP BY strain
5/18/10 Garret Cole, eScience Institute
46. Find all TIGRFam ids (proteins) that are missing from at least
one of three samples (relations)
SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]
UNION
SELECT col0 FROM [est_hma_fasta_TGIRfam_refs]
UNION
SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]
EXCEPT
SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]
INTERSECT
SELECT col0 FROM [est_hma_fasta_TGIRfam_refs]
INTERSECT
SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]
10/3/2012 Bill Howe, UW 46
48. An Observation on NoSQL
• 2004 Dean et al. MapReduce
• 2008 Hadoop 0.17 release
• 2008 Olston et al. Pig: Relational Algebra on Hadoop
• 2008 DryadLINQ: Relational Algebra in a Hadoop-like system
• 2009 Thusoo et al. HIVE: SQL on Hadoop
NoSQL is a misnomer
– NoMySQL?
– NoSchema?
– NoLoading?
– NoLicenseFees!
10/3/2012 Bill Howe, UW 48
49. Digression: Relational Database History
Pre-Relational: if your data changed, your application broke.
Early RDBMS were buggy and slow (and often reviled), but
required only 5% of the application code.
“Activities of users at terminals and most application programs should
remain unaffected when the internal representation of data is changed and
even when some aspects of the external representation are changed.”
-- Codd 1979
Key Ideas: Programs that manipulate tabular data exhibit an algebraic
structure allowing reasoning and manipulation independently of physical
data representation
50. Key Idea: An Algebra of Tables
select
project
join join
Other operators: aggregate, union, difference, cross product
10/3/2012 Bill Howe, UW 50
51. Algebraic Optimization
N = ((4*2)+((4*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2: N = (2+3)*4
two operations instead of five, no division operator
Same idea works with very large tables / graphs, but the payoff is much higher
51
52. Architecture
Excel Python R
Addin Client Client
Spreadsheet ―Flagship‖ VizDeck
Ingest SQLShare App
(Python) on EC2
ASP.NET
SQLShare REST API
SQLShare REST API
Windows Azure
IIS
SQL Azure SQL Server
53. Science is becoming a database query problem
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquisition supports many hypotheses)
– Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
– Biology: lab automation, high-throughput sequencing,
– Oceanography: high-resolution models, cheap sensors, satellites
40TB / 2 nights
1 device
~1TB / day
100s of devices
10/3/2012 Bill Howe, eScience Institute 53
54. Problem
– Research data is captured and manipulated in
spreadsheets
– This perhaps made sense five years ago; the data volumes
were manageable
– But now: 50k rows, 100s of files, ―mega-collabs‖
Why not put everything into a database?
– A huge amount of up-front effort
– Hard to design for a moving target
– Running a database system is huge drain
Approach: SQLShare
– Upload data through your browser: no setup, no install
– Login to browse science questions in English
– Click a question, see the SQL to answer it question
– Edit the SQL to answer an "adjacent" question, even if you
wouldn‘t know how to write it from scratch
https://sqlshare.escience.washington.edu/
10/3/2012 54
55. Astronomy
LSST
Oceanography
PanSTARRS SRA MG-RAST
OOI GenBank
# of bytes
PDB
SDSS Biology
IOOS UniProt Pfam
NDBC
LANL Galaxy
Pathway HIV
Commons BioMart
GEO
SQLShare
# of sources
10/3/2012 Bill Howe, UW 55
56. Parallel Iterative Analytics
Large scale
GridFields
―Long tail‖
Scale, training
SQLShare: Query-as-a-Service
VizDeck
citizen science;
public outreach
astrophysics life sciences earth sciences commercial
57. Parallel Iterative Analytics
Large
scale GridFields
―Long tail‖
Scale, training
SQLShare: Query-as-a-Service
VizDeck
citizen science;
public outreach
astrophysics life sciences earth sciences commercial
58. A ―Needs Hierarchy‖ of Science Data Management
analytics
query
full integration
sharing
storage
10/3/2012 Bill Howe, UW 58
59. A ―Needs Hierarchy‖ of Science Data Management
analytics
query
full integration
sharing
storage
10/3/2012 Bill Howe, UW 59
60. Four Conjectures about
Declarative Query for Science
• Most science data manipulation tasks can be
expressed in relational algebra
• Most science analytics task can be expressed
in relational algebra + recursion
Hellerstein 09, Re 12
• These expressions can be efficiently and
scalably executed in the cloud
• Researchers are willing and able to program
using relational algebra languages
c.f. SDSS
10/3/2012 Bill Howe, UW 60
66. The Long Tail
LSST
(~100PB) “The future is already here; it’s just
CERN not very evenly distributed.”
(~15PB/year)
-- William Gibson
data volume
PanSTARRS
(~40PB)
SDSS
(~100TB)
Ocean
CARMEN Modelers Seis- Microbiologists <Spreadsheet
(~50TB) mologists users>
rank
10/3/2012 Bill Howe, eScience Institute 66
67. Experimental Engagement Algorithm for the Long Tail
A stripped-down version of Jim Gray’s
“20 questions” methodology
1. Get the data
2. Load the data ―as is‖ – no schema design
3. Get ~20 questions (in English)
4. Translate the questions into SQL (when possible)
5. Provide these ―starter queries‖ to the researchers
Q: Can researchers questions be expressed in SQL?
Q: Are a few examples sufficient for novices to self-train with SQL?
Q: Can we scale this process up?
Q: If so, will the use of SQL reduce their data handling overhead?
68. • Which samples have not been cloned?
SELECT *
FROM plasmiddb
WHERE NOT (ISDATE(cloned) OR cloned = ‗yes‘)
10/3/2012 Bill Howe, UW 68
69. Details
• Backend is Microsoft‘s SQL Azure
• About a year old, essentially zero advertising
• 50+ active users (UW and external)
• 2500+ tables/views (~20% are public
• largest table: 1.1M rows, smallest table: 1 row
• transitioning to support by UW-IT
10/3/2012 Bill Howe, UW 69
70. View Semantics and Features
• ―Saved query‖ = View with attached metadata
• Unify views and tables as ―datasets‖
– table = “select * from [raw_table]”
• Replacement semantics for name conflicts
– old versions materialized and archived
• Materialize downstream views
– when dependencies deleted
– when dependencies become incompatible
• Permissions
– public, private, ACLs (may work on groups in the future)
• Sharing, social querying, CQMS*
– search, recent queries, friends’ queries, favorites, ratings
– facilitate sharing and recommendations of not just whole queries, but
common predicates, join patterns, etc.
10/3/2012 Bill Howe, UW * [Khoussainova, CIDR 2009]
70
We want to give a little background of our project before we launch into it, so we will discuss the problem we are trying to solve.Essentially, we want to remove the speed-bump of data handling from the scientists.
List of English descriptions
Data files can be uploaded as is, and we make an attempt to automatically infer the appropriate schema based on a type analysis of the values and a variety of heuristics.
goal is to be inclusive rather than exclusive: accept everything, then clean it up with queries and views.so we can tolerate irregular number of columns, mixed-type columns, missing column names.There is a lot more to do here. Kathleen Fisher’s work on LearnPADS provides a great mechanism for automatically inferring the structure of ad hoc data formats – Very promising for us.Jeff Heer’s and Joe Hellerstein’s work on Data Wrangler is relevant, as is Google Refine.
Every uploaded table is associated with a trivial view. You can modify this view to do some basic cleanup, and in fact that is typically the first step we are seeing people perform.
Lots of features you can imagine here – anything you can do with a youtube video, you should be able to do with a query: share it, rate it, “more like this”, recommendations, We are exploring some of these.
I’ll give you a couple of examples of use cases we are seeing.Robin Kodner at Friday Harbor Labs in the San Juan islands works on a program called Sound Experience that gives K12 – undergrads experience sailing, but reuses this ship time to do some real science and collect samples from areas that wouldotherwise be very expensive.
But the fundamental error made by computer scientists, and it’s probably the fault of the database community, is to assume that semantic integration is a prerequisite for query and analytics.It isn’t. It’s the final goal, not some insignificant preamble to analysis.Domain scientists know this – they take a very pragmatic approach. They write code to do data handling, they write code to do analytics, and they do data integration on the fly in a task-specific way.So one of my goals is to convince you of is that you can decouple declarative query from semantic integration, and doing so gives scientists a very powerful tool.
So how do we do this? Back to the needs hierarchy.It’s already happening.To a first approximation, I consider storage and sharing solved thanks to cloud computing. There are plenty of issues to resolve, especially in sustainability and uptake, but these systems exist. And they’re getting cheaper.But semantic integration and analytics are still largely application-specific, ad hoc, siloed manner.Query services for science are underexplored, with a few mind-blowingly successful examples such as the Sloan Digital Sky Survey.I think many of you are familiar with SDSS, but briefly: It is not an understatement to say that the liberal application of database technology has transformed the field of Astronomy. Undergrads and grads use SQL alongside IDL and Fortran. 5000 papers were written about the SDSS data, and only about 100 of those were co-authored by the Pis. The rest were written by astronomers exercising the public query interfaces.So the roadmap here is to FIRST expose data through query services, THEN use these services to generalize, scale, and democratize analytics and semantic integration.
To begin, we ask, what kind of questions would you ask your data once you have it ready to be worked on?Just about EVERY question that we have heard a scientist would ask, we have found an equivalent SQL statement counterpart.If we could just turn their questions in SQL our job would be done, but there are many other problems to solve before that becomes a reality. For example, their data may not reside in a relational database.This brings us to part of our next problem: how can we bring the power of SQL to the scientists to solve their questions without the overhead of everything that a database administrator would need to do.
To begin, we ask, what kind of questions would you ask your data once you have it ready to be worked on?Just about EVERY question that we have heard a scientist would ask, we have found an equivalent SQL statement counterpart.If we could just turn their questions in SQL our job would be done, but there are many other problems to solve before that becomes a reality. For example, their data may not reside in a relational database.This brings us to part of our next problem: how can we bring the power of SQL to the scientists to solve their questions without the overhead of everything that a database administrator would need to do.
To begin, we ask, what kind of questions would you ask your data once you have it ready to be worked on?Just about EVERY question that we have heard a scientist would ask, we have found an equivalent SQL statement counterpart.If we could just turn their questions in SQL our job would be done, but there are many other problems to solve before that becomes a reality. For example, their data may not reside in a relational database.This brings us to part of our next problem: how can we bring the power of SQL to the scientists to solve their questions without the overhead of everything that a database administrator would need to do.
To begin, we ask, what kind of questions would you ask your data once you have it ready to be worked on?Just about EVERY question that we have heard a scientist would ask, we have found an equivalent SQL statement counterpart.If we could just turn their questions in SQL our job would be done, but there are many other problems to solve before that becomes a reality. For example, their data may not reside in a relational database.This brings us to part of our next problem: how can we bring the power of SQL to the scientists to solve their questions without the overhead of everything that a database administrator would need to do.
Why has MapReduce been so successful?One reason: it turned “mere mortal” java programmers into distributed systems programmersMapReduce raised the level of abstraction for big dataBut not high enough, evidentlyTwo of the earliest and most successful projects in the Hadoop ecosystem were Pig and HIVE: Declarative query languages on top of HadoopNoSQL is a misnomerNoSchema? NoLoading? NoLicenseFees? NoOverpricedDBA?NoMySQLTaught database people to think about fault tolerance differently and to not ignore dirty data.
But the fundamental error made by computer scientists, and it’s probably the fault of the database community, is to assume that semantic integration is a prerequisite for query and analytics.It isn’t. It’s the final goal, not some insignificant preamble to analysis.Domain scientists know this – they take a very pragmatic approach. They write code to do data handling, they write code to do analytics, and they do data integration on the fly in a task-specific way.So one of my goals is to convince you of is that you can decouple declarative query from semantic integration, and doing so gives scientists a very powerful tool.
But the fundamental error made by computer scientists, and it’s probably the fault of the database community, is to assume that semantic integration is a prerequisite for query and analytics.It isn’t. It’s the final goal, not some insignificant preamble to analysis.Domain scientists know this – they take a very pragmatic approach. They write code to do data handling, they write code to do analytics, and they do data integration on the fly in a task-specific way.So one of my goals is to convince you of is that you can decouple declarative query from semantic integration, and doing so gives scientists a very powerful tool.
The DNA material in the samples taken from the water are then sequenced in a machine to produce millions of short strings
These DNA reads can then be cross-referenced in public databases to determine what organisms were present in the water, and what genes were being expressed
Each step generates a bunch of “residual” data, usually in the form of spreadsheets or text files.This process is repeated many times, leading to 100s of spreadsheetsAt this point, the actual science questions are answered using these spreadsheets by computing “manual joins”, creating plots, searching and filtering, copying and pasting, etc. It’s a mess -- when asked how much time is spent “handling data” as opposed to “doing science”!We’ve heard that 90% of their work is manipulating the data before they can actually answer a question!