SlideShare une entreprise Scribd logo
1  sur  72
Enabling Collaborative Research
 Data Management with SQLShare
               Bill Howe, PhD                                 Tom Lewis
                   Director of                           Director, Academic &
            Research, Scalable Data                    Collaborative Applications
                    Analytics                          University of Washington
            University of Washington                    Information Technology
               eScience Institute



10/3/2012                              Bill Howe, UW                                1
Assessing UW Researchers‘
   Data Management Needs
1. Conversations with Research Leaders (2008)
   • First large-scale assessment of researchers‘
     needs
   • 124 Interviews with top researchers

2. Faculty Technology Survey (2011)
   • Use of teaching and research technologies
   • Paired with student and TA surveys
   • Reached all disciplines, levels of research
   • 689 instructors responded
Data Management Needs
• Expertise—designing new systems, enhancing
  current practices, analysis

• Storage—Large amounts of data for current
  projects and data archives

• Backup—inconsistent systems among
  researchers, some unreliable practices

• Security—secure access needed for inter-
  institutional partners
Types of Data Stored
                        Quantities/statistics                                     71.60%

                                                                               61.20%
Text (literature, transcriptions, field notes)
                                                                       48.00%
               Images (photographs, maps)
                                                              22.80%
                           Video recordings
                                                          20.50%
                           Audio recordings
                                                         15.20%
                 Multimedia digital objects
                                                       11.70%
          Geo-tagged objects/ spatial data            5.20%
                                                 0%     20%      40%     60%       80%     100%
Data Storage Location
                          My computer                                                 87%

External device (hard drive, thumb drive)                                     66%
                                                                  41%
         Department-managed server
                                                            27%
    Server managed by research team
                                                      12%
       External (Non-UW) data center
                                                 6%
    Department-managed data center               6%
                                  Other          5%

      Data managed by research team              5%

                                            0%        20%    40%        60%     80%    100%
http://escience.washington.edu
The University of Washington
    eScience Institute
• Rationale
   – The exponential increase in sensors is transitioning all fields of science and
     engineering from data-poor to data-rich
   – Techniques and technologies include
         • Sensors and sensor networks, databases, data mining, machine learning,
           visualization, cluster/cloud computing
   – If these techniques and technologies are not widely available and widely
     practiced, UW will cease to be competitive
• Mission
   – Help position the University of Washington at the forefront of research both in
     modern eScience techniques and technologies, and in the fields that depend
     upon them
• Strategy
   – Bootstrap a cadre of Research Scientists
   – Add faculty in key fields
   – Build out a ―consultancy‖ of students and non-research staff


    10/3/2012                      Bill Howe, eScience Institute                    8
eScience Big Data Group
Bill Howe, Phd (databases, cloud, data-intensive scalable computing, visualization)

Staff
    –   Seung-Hee Bae, Phd (postdoc, scalable machine learning algorithms)
    –   Dan Halperin, Phd (postdoc; scalable systems)
    –   Sagar Chitnis, Research Engineer (Azure, databases, web services)
    –   (alumna) Marianne Shaw, Phd (hadoop, semantic graph databases)
    –   (alumna) Alicia Key, Research Engineer (visualization, web applications)
Students
    – Scott Moe (2nd yr Phd, Applied Math)
    – Daniel Perry (2nd yr Phd, HCDE)
Partners
    –   CSE DB Faculty: Magda Balazinska, Dan Suciu
    –   CSE students: Paris Koutris, Prasang Upadhyaya,
    –   UW-IT (web applications, QA/support)
    –   Cecilia Aragon, Phd, Associate Professor, HCDE (visualization, scientific applications)


               10/3/2012                Bill Howe, UW                                   9
[src: Carol Goble]


                     • Power distribution
                     • 80:20 rule
Popularity / Sales




                        Head




                                             Tail


                                         Products / Results
          First published May 2007, Wired Magazine article 2004
[src: Carol Goble]

Long Tail of Research Data
                 High throughput experimental methods
                 Industrial scale
                 Commons based production
                 Publicly data sets
                 Cherry picked results
                 Preserved

GenBank

PDB
  UniProt    ChemSpider

      Pfam   CATH, SCOP
             (Protein
             Structure
             Classification)      Spreadsheets, Notebooks
                                  Local, Lost
Problem
How much time do you spend “handling
data” as opposed to “doing science”?



      Mode answer: “90%”



10/3/2012          Bill Howe, UW       12
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
###query                      length   COG hit #1   e-value #1      identity #1   score #1   hit length #1 description #1
chr_4[480001-580000].287       4500
chr_4[560001-660000].1         3556
chr_9[400001-500000].503       4211    COG4547      2.00E-04            19         44.6         620        Cobalamin biosynthesis protein C

          Simple Example
chr_9[320001-420000].548
chr_27[320001-404298].20
                               2833
                               3991
                                       COG5406
                                       COG4547
                                                    2.00E-04
                                                    5.00E-05
                                                                        38
                                                                        18
                                                                                   43.9
                                                                                   46.2
                                                                                                1001
                                                                                                620
                                                                                                           Nucleosome binding factor SPN,
                                                                                                           Cobalamin biosynthesis protein C
chr_26[320001-420000].378      3963    COG5099      5.00E-05            17         46.2         777        RNA-binding protein of the Puf fa
chr_26[400001-441226].196      2949    COG5099      2.00E-04            17         43.9         777        RNA-binding protein of the Puf fa
chr_24[160001-260000].65       3542
chr_5[720001-820000].339       3141    COG5099      4.00E-09            20         59.3         777        RNA-binding protein of the Puf fa
chr_9[160001-260000].243       3002    COG5077      1.00E-25            26         114          1089       Ubiquitin carboxyl-terminal hydr
chr_12[720001-820000].86       2895    COG5032      2.00E-09            30         60.5         2105       Phosphatidylinositol kinase and p
chr_12[800001-900000].109      1463    COG5032      1.00E-09            30         60.1         2105       Phosphatidylinositol kinase and p
chr_11[1-100000].70            2886
chr_11[80001-180000].100       1523



     COGAnnotation_coastal_sample.txt
     id    query                hit        e_value   identity_ score query_start query_end hit_start hit_end hit_length
         1 FHJ7DRN01A0TND.1     COG0414     1.00E-08         28    51           1        74      180      257        285
         2 FHJ7DRN01A1AD2.2     COG0092     3.00E-20         47 89.9            6        85       41      120        233
         3 FHJ7DRN01A2HWZ.4     COG3889       0.0006         26 35.8            9        94      758      845        872
           …
      2853 FHJ7DRN02HXTBY.5     COG5077     7.00E-09           37     52.3             3           77      313       388       1089
      2854 FHJ7DRN02HZO4J.2     COG0444     2.00E-31           67      127             1           73      135       207        316
           …
      3566 FHJ7DRN02FUJW3.1     COG5032     1.00E-09           32     54.7             1           75     1965     2038        2105
           …




   SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit
                        10/3/2012                        Bill Howe, UW                                                       13
Data management becoming the bottleneck
• Data is captured and manipulated in spreadsheets
• This approach made sense five years ago; the data
  volumes were manageable
• But now:
    –   50k rows in each of 50-100 spreadsheets, per scientist
    –   Significantly more sharing required: ―mega-collabs‖
    –   Continuously producing new results
    –   The management overhead is beginning to dominate
    –   Difficult to share data publicly (emailing files is a mess)
    –   Lots of work to get an answers to any new question




   10/3/2012                                                          14
Why not build a database?
• Assumes an pre-defined schema
     – more on this later
• Requires specialized technical skills and a huge amount
  of up-front effort
• Not a perfect fit – it‘s hard to design a ―permanent‖
  database for a fast-moving research target
• Researchers have little interest in operating and
  maintaining a data system – they just want to organize,
  manipulate, and share data




10/3/2012                                             15
Database-as-a-Service for the 99%
Approach: Strip down databases to bare essentials
   – Upload -> Query -> Share
   – Try to eliminate
     installation, configuration, schema design, data
     loading, tuning, easy, thanks to the cloud
                       app-building
       harder




  10/3/2012                       Bill Howe, UW         16
Getting Data In
• Upload data through a browser to a cloud
  database (so no need to manage a local system)
• No need to ―design a database‖ before you use
  your data – just upload and get to work
• Write SQL queries, with some automated help and
  some guided help from experts
• Build on your own results
    – Output of one query can be the input to another
    – Encourages sharing and reuse
    – Avoids everyone on the team running the same data
      processing steps over and over
    – Provides provenance – ―how did you get this result?‖

 10/3/2012                     Bill Howe, UW                 17
Select from a list of English descriptions
10/3/2012                                                18
Edit a Query




               Save the results, share them with others




  10/3/2012                                               19
http://sqlshare.escience.washington.edu
10/3/2012                  Bill Howe, UW              20
http://sqlshare.escience.washington.edu
10/3/2012                  Bill Howe, UW              21
http://sqlshare.escience.washington.edu

10/3/2012                       Bill Howe, UW         22
http://sqlshare.escience.washington.edu

10/3/2012                       Bill Howe, UW         23
http://sqlshare.escience.washington.edu
10/3/2012                 Bill Howe, UW               24
Spreadsheet   Excel    Python     R
                               Crawler     Addin    Client   Client
             ―Flagship‖
VizDeck    SQLShare App         ASP.NET
          (Python) on EC2




                                           OAuth2
                    SQLShare REST API      WCF
Deeply nested hierarchies of views

                      Provenance
                    Controlled Sharing
10/3/2012                Bill Howe, UW           26
Robin
                                                   Kodner
2010 Pilot- Outreach and Education-
based sampling: Schooner Adventuress
                              9




   13
                                                            11



               14
                                              12

                         15
                        16



                    4   510




                                  8

                                          7
   10/3/2012                  Bill Howe, UW
                                    6                       27
NatureMapping Program                                                                      Karen
                                                                                           Dvornich
       Wildlife Observations (1902- )




                                                           Data collection and submission options:
                                                         1. Download/upload spreadsheet
                                                         2. Online data entry
                                                         3. NatureTracker on handheld/GPS
                                                         4. Android ODK (Open Data Kit)


     Water Quality Monitoring Sites (2003 - )




 10/3/2012                                      Bill Howe, UW                                    28
Andrew
                                                     White, UW
                                                     Chemistry




“An undergraduate student and I are working with gigabytes of tabular
data derived from analysis of protein surfaces.

Previously, we were using huge directory trees and plain text files.

Now we can accomplish a 10 minute 100 line script in 1 line of SQL.”
                                                   -- Andrew D White
  10/3/2012                      Bill Howe, UW                         29
WHY SQL?


10/3/2012     Bill Howe, UW   30
A ―Needs Hierarchy‖ of Science Data Management

“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
                         -- Maslow 43




            full semantic integration

             analytics

            query
            sharing

            storage
10/3/2012                         Bill Howe, UW   31
NoSchema (not NoSQL)
• A schema* is a shared consensus about some
  universe of discourse
• At the frontier of research, this shared consensus
  does not exist, by definition
• Any schema that does emerge will change
  frequently, by definition
• Data found ―in the wild‖ will typically not conform
  to any schema, by definition
• But this doesn‘t mean we have to punt on
  databases and go back to ad hoc scripts and files




     * ontology/metadata standard/controlled vocabulary/etc.
10/3/2012                        Bill Howe, UW                 32
A ―Needs Hierarchy‖ of Science Data Management

 Silos           full semantic
                 integration

                 analytics


                 query


                 sharing
                                          Cloud
                 storage

10/3/2012             Bill Howe, UW               33
What‘s the point?
• Conventional wisdom says ―Science data isn‘t relational‖
    – This is utter nonsense
• Conventional wisdom says ―Scientists won‘t write SQL‖
    – This is utter nonsense
• So why aren‘t databases being used more often?
  – They‘re a PITA
• We implicate difficulty in
    –   installation, configuration
    –   schema design, data loading
    –   performance tuning
    –   app-building (NoGUI?)

We ask instead, “What kind of platform can
support ad hoc scientific Q&A with SQL?”

              5/18/10          Garret Cole, eScience Institute
Biologists are beginning to write very complex
queries (rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results

 SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
   , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp
   , w.category as nc_category
   , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
  THEN x.end_bp - x.start_bp + 1
  WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
  THEN x.end_bp - w.start_bp + 1                                            We see thousands     of
  WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)                    queries written by
  THEN w.end_bp - x.start_bp + 1
 END AS len_overlap
                                                                            non-programmers

  FROM [koesterj@washington.edu].[hotspots_deserts.tab] x
  INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w
  ON x.chr = w.chr
  WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
  OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
  OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
  ORDER BY x.strain, x.chr ASC, x.start_bp ASC
So why SQL?
• Covers 80% of what we need
    – Ex: Sloan Digital Sky Survey
    – Ex: Hybrid Hash Join algorithm published in BMC bioinformatics
• Empower a new class of data-savvy scientist who isn‘t
  forced to be trained as an IT professional
• Algebraic optimization
    – Ask me about this if you‘re interested
• Views: Logical and physical data Independence
    –   Reason about the problem independently of the data representation
    –   No re-execution of ―workflows‖
    –   No file format incompatibilities
    –   No version mismatches
    –   Data and code tightly coupled and (logically) centralize
 10/3/2012                      Bill Howe, UW                      36
SQLShare as a CS Research Platform
                                                                         SSDBM 2011
• Automatic ―Starter‖ Queries
   –   (Bill Howe, Garret Cole, Nodira Khoussainova, Leilani Battle)
                                                                         SIGMOD 2011 (demo)
• VizDeck: Automatic Mashups and Visualization                           SSDBM 2011
   –   (Bill Howe, Alicia Key, Daniel Perry, Cecilia Aragon)             CHI 2012
• Info Extraction from Spreadsheets                                      SIGMOD 2012 (demo)
   –   (Mike Cafarella, Dave Maier, Bill Howe, Sagar Chitnis, Abdu Alwani)
• Scalable Analytics-as-a-Service
   –   (Dan Suciu, Magda Balazinska, Bill Howe)
• Optimizing Iterative Queries for Machine Learning                          VLDB 2010
   –   (Dan Suciu, Magda Balazinska, Bill Howe)                              Datalog2.0 2012
                                                                             CIDR 2013




  10/3/2012                                  Bill Howe, UW                                     37
Where we‘re headed:
              •   Local or cloud-hosted deployments        done!
              •   Multi-institution sharing
              •   Global users and permissions
              •   Distributed data and distributed query

            We are looking for partners!




10/3/2012               Bill Howe, UW                              38
http://sqlshare.escience.washington.edu


     billhowe@cs.washington.edu


       http://escience.washington.edu




10/3/2012                  Bill Howe, UW      39
10/3/2012   Bill Howe, UW   40
Scientific data management reduces to sharing views
• Integrate data from multiple sources?
      – joins and unions with views
• Standardize on units, apply naming conventions?
      – rename columns, apply functions with views
• Attach metadata?
      – add new tables with descriptive names, add new columns with views
• Data cleaning, quality control?
      – hide bad values with views
• Maintain provenance?
      – inspect view dependencies
• Propagate updates?
      – view maintenance
• Protect sensitive data?
      – expose subsets with views (assuming views carry permissions)
10/3/2012                        Bill Howe, UW                         41
An observation about ―handling data‖

• How many plasmids were bombarded in July and
  have a rescue and expression?

  SELECT count(*)
  FROM [bombardment_log]
  WHERE bomb_date BETWEEN ‘7/1/2010' AND ‘7/31/2010'
  AND rescue clone IS NOT NULL
  AND [expression?] = 'yes'




          5/18/10             Garret Cole, eScience Institute
An observation about ―handling data‖

• Which samples have not been cloned?



  SELECT *
  FROM plasmiddb
  WHERE NOT (ISDATE(cloned) OR cloned = ‗yes‘)




          5/18/10          Garret Cole, eScience Institute
An observation about ―handling data‖


• How often does each RNA hit appear inside
  the annotated surface group?

          SELECT hit, COUNT(*) as cnt
           FROM tigrfamannotation_surface
          GROUP BY hit
          ORDER BY cnt DESC




5/18/10                   Garret Cole, eScience Institute
An observation about ―handling data‖
For a given promoter (or protein fusion), how many
expressing line have been generated (they would all
have different strain designations)


 SELECT strain, count(distinct line)
 FROM glycerol_stocks
 GROUP BY strain




         5/18/10              Garret Cole, eScience Institute
Find all TIGRFam ids (proteins) that are missing from at least
one of three samples (relations)
              SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]
              UNION
              SELECT col0 FROM [est_hma_fasta_TGIRfam_refs]
              UNION
              SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]

              EXCEPT

              SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]
              INTERSECT
              SELECT col0 FROM [est_hma_fasta_TGIRfam_refs]
              INTERSECT
              SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]

  10/3/2012                         Bill Howe, UW                46
On NoSQL




10/3/2012      Bill Howe, UW   47
An Observation on NoSQL
•   2004 Dean et al. MapReduce
•   2008 Hadoop 0.17 release
•   2008 Olston et al. Pig: Relational Algebra on Hadoop
•   2008 DryadLINQ: Relational Algebra in a Hadoop-like system
•   2009 Thusoo et al. HIVE: SQL on Hadoop



NoSQL is a misnomer
      –     NoMySQL?
      –     NoSchema?
      –     NoLoading?
      –     NoLicenseFees!


10/3/2012                       Bill Howe, UW                    48
Digression: Relational Database History
    Pre-Relational: if your data changed, your application broke.
    Early RDBMS were buggy and slow (and often reviled), but
    required only 5% of the application code.


“Activities of users at terminals and most application programs should
remain unaffected when the internal representation of data is changed and
even when some aspects of the external representation are changed.”
                                 -- Codd 1979

Key Ideas: Programs that manipulate tabular data exhibit an algebraic
structure allowing reasoning and manipulation independently of physical
data representation
Key Idea: An Algebra of Tables

                                                   select


                                                   project


                                join               join

    Other operators: aggregate, union, difference, cross product

10/3/2012                        Bill Howe, UW                     50
Algebraic Optimization
     N = ((4*2)+((4*3)+0))/1

     Algebraic Laws:
     1. (+) identity:      x+0 = x
     2. (/) identity:      x/1 = x
     3. (*) distributes:   (n*x+n*y) = n*(x+y)
     4. (*) commutes:      x*y = y*x

     Apply rules 1, 3, 4, 2: N = (2+3)*4
     two operations instead of five, no division operator

Same idea works with very large tables / graphs, but the payoff is much higher




                                                                             51
Architecture
                                          Excel     Python     R
                                          Addin     Client   Client


Spreadsheet      ―Flagship‖     VizDeck
   Ingest      SQLShare App
              (Python) on EC2
   ASP.NET




                      SQLShare REST API
                                                    SQLShare REST API
                  Windows Azure
                                                  IIS


                    SQL Azure                       SQL Server
Science is becoming a database query problem
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquisition supports many hypotheses)
  – Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
  – Biology: lab automation, high-throughput sequencing,
  – Oceanography: high-resolution models, cheap sensors, satellites




40TB / 2 nights
1 device




                          ~1TB / day
                          100s of devices


    10/3/2012                  Bill Howe, eScience Institute               53
Problem
 – Research data is captured and manipulated in
   spreadsheets
 – This perhaps made sense five years ago; the data volumes
   were manageable
 – But now: 50k rows, 100s of files, ―mega-collabs‖
Why not put everything into a database?
 – A huge amount of up-front effort
 – Hard to design for a moving target
 – Running a database system is huge drain
Approach: SQLShare
 –    Upload data through your browser: no setup, no install
 –    Login to browse science questions in English
 –    Click a question, see the SQL to answer it question
 –    Edit the SQL to answer an "adjacent" question, even if you
      wouldn‘t know how to write it from scratch
             https://sqlshare.escience.washington.edu/
     10/3/2012                                                     54
Astronomy
                    LSST
                                       Oceanography
                     PanSTARRS                           SRA       MG-RAST
                                  OOI                          GenBank
# of bytes




                                                         PDB
                    SDSS                                                     Biology
                                            IOOS         UniProt   Pfam
                                  NDBC
                                                           LANL     Galaxy
                                                  Pathway HIV
                                                  Commons          BioMart
                                            GEO

                                                               SQLShare


                                   # of sources

        10/3/2012                       Bill Howe, UW                             55
Parallel Iterative Analytics


Large scale
                                                                     GridFields
―Long tail‖
              Scale, training




                                                    SQLShare: Query-as-a-Service




                                  VizDeck

citizen science;
public outreach
                                  astrophysics       life sciences     earth sciences   commercial
Parallel Iterative Analytics


Large
scale                                                                  GridFields
―Long tail‖
              Scale, training




                                                    SQLShare: Query-as-a-Service




                                  VizDeck

citizen science;
public outreach
                                    astrophysics       life sciences     earth sciences   commercial
A ―Needs Hierarchy‖ of Science Data Management




            analytics

            query

            full integration

            sharing

            storage
10/3/2012                      Bill Howe, UW     58
A ―Needs Hierarchy‖ of Science Data Management




            analytics

            query

            full integration

            sharing

            storage
10/3/2012                      Bill Howe, UW     59
Four Conjectures about
Declarative Query for Science
 • Most science data manipulation tasks can be
   expressed in relational algebra
 • Most science analytics task can be expressed
   in relational algebra + recursion
       Hellerstein 09, Re 12

 • These expressions can be efficiently and
   scalably executed in the cloud
 • Researchers are willing and able to program
   using relational algebra languages
       c.f. SDSS
10/3/2012                      Bill Howe, UW     60
5/18/10   Garret Cole, eScience Institute
5/18/10   Garret Cole, eScience Institute
5/18/10   Garret Cole, eScience Institute
metadata



    sequence
    data


           search results
SQL




      5/18/10   Garret Cole, eScience Institute
The Long Tail
                  LSST
                  (~100PB)                       “The future is already here; it’s just
                   CERN                          not very evenly distributed.”
                   (~15PB/year)
                                                                -- William Gibson
data volume




                     PanSTARRS
                     (~40PB)

                          SDSS
                          (~100TB)


                               Ocean
               CARMEN          Modelers      Seis-                Microbiologists   <Spreadsheet
               (~50TB)                       mologists                              users>


                                          rank
              10/3/2012                      Bill Howe, eScience Institute                     66
Experimental Engagement Algorithm for the Long Tail

     A stripped-down version of Jim Gray’s
        “20 questions” methodology

1.    Get the data
2.    Load the data ―as is‖ – no schema design
3.    Get ~20 questions (in English)
4.    Translate the questions into SQL (when possible)
5.    Provide these ―starter queries‖ to the researchers

 Q: Can researchers questions be expressed in SQL?
 Q: Are a few examples sufficient for novices to self-train with SQL?
 Q: Can we scale this process up?
 Q: If so, will the use of SQL reduce their data handling overhead?
• Which samples have not been cloned?



  SELECT *
  FROM plasmiddb
  WHERE NOT (ISDATE(cloned) OR cloned = ‗yes‘)




 10/3/2012                 Bill Howe, UW         68
Details
•   Backend is Microsoft‘s SQL Azure
•   About a year old, essentially zero advertising
•   50+ active users (UW and external)
•   2500+ tables/views (~20% are public
•   largest table: 1.1M rows, smallest table: 1 row
•   transitioning to support by UW-IT




10/3/2012                 Bill Howe, UW               69
View Semantics and Features
•      ―Saved query‖ = View with attached metadata
•      Unify views and tables as ―datasets‖
        – table = “select * from [raw_table]”
•      Replacement semantics for name conflicts
        – old versions materialized and archived
•      Materialize downstream views
        – when dependencies deleted
        – when dependencies become incompatible
•      Permissions
        – public, private, ACLs (may work on groups in the future)
•      Sharing, social querying, CQMS*
        – search, recent queries, friends’ queries, favorites, ratings
        – facilitate sharing and recommendations of not just whole queries, but
          common predicates, join patterns, etc.




    10/3/2012                          Bill Howe, UW   * [Khoussainova, CIDR 2009]
                                                                             70
Karen
                            Dvornich




10/3/2012   Bill Howe, UW        71
10/3/2012   Bill Howe, UW   72

Contenu connexe

Similaire à Enabling Collaborative Research Data Management with SQLShare

Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTERN Australia
 
Data-intensive profile for the VAMDC
Data-intensive profile for the VAMDCData-intensive profile for the VAMDC
Data-intensive profile for the VAMDCAstroAtom
 
On the diversity and availability of temporal information in linked open data
On the diversity and availability of temporal information in linked open dataOn the diversity and availability of temporal information in linked open data
On the diversity and availability of temporal information in linked open dataAnisa Rula
 
Workflows to access and massage VOData
Workflows to access and massage VODataWorkflows to access and massage VOData
Workflows to access and massage VODataJose Enrique Ruiz
 
Linked Data for Federation of OER Data &amp; Repositories
Linked Data for Federation of OER Data &amp; RepositoriesLinked Data for Federation of OER Data &amp; Repositories
Linked Data for Federation of OER Data &amp; RepositoriesStefan Dietze
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
2+3D Photography 2017 – SP 1 Multispectral Imaging Workflow Integration - Meg...
2+3D Photography 2017 – SP 1 Multispectral Imaging Workflow Integration - Meg...2+3D Photography 2017 – SP 1 Multispectral Imaging Workflow Integration - Meg...
2+3D Photography 2017 – SP 1 Multispectral Imaging Workflow Integration - Meg...rijksmuseum
 
Aspects of Reproducibility in Earth Science
Aspects of Reproducibility in Earth ScienceAspects of Reproducibility in Earth Science
Aspects of Reproducibility in Earth ScienceRaul Palma
 
Preserving the Inputs and Outputs of Scholarship
Preserving the Inputs and Outputs of ScholarshipPreserving the Inputs and Outputs of Scholarship
Preserving the Inputs and Outputs of Scholarshiptsbbbu
 
Open Data: Ready Set Go
Open Data: Ready Set GoOpen Data: Ready Set Go
Open Data: Ready Set GoPaul Groth
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeCarole Goble
 
Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing
Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data PublishingScott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing
Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data PublishingGigaScience, BGI Hong Kong
 
Collaboration and Sharing
Collaboration and SharingCollaboration and Sharing
Collaboration and SharingJisc
 
Crushing, Blending, and Stretching Data
Crushing, Blending, and Stretching DataCrushing, Blending, and Stretching Data
Crushing, Blending, and Stretching DataRay Schwartz
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesIan Foster
 
Tracking Trends in Korean Information Science Research, 2000-2011
Tracking Trends in Korean Information Science Research, 2000-2011Tracking Trends in Korean Information Science Research, 2000-2011
Tracking Trends in Korean Information Science Research, 2000-2011SoYoung YU
 

Similaire à Enabling Collaborative Research Data Management with SQLShare (20)

Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasets
 
Data-intensive profile for the VAMDC
Data-intensive profile for the VAMDCData-intensive profile for the VAMDC
Data-intensive profile for the VAMDC
 
On the diversity and availability of temporal information in linked open data
On the diversity and availability of temporal information in linked open dataOn the diversity and availability of temporal information in linked open data
On the diversity and availability of temporal information in linked open data
 
Workflows to access and massage VOData
Workflows to access and massage VODataWorkflows to access and massage VOData
Workflows to access and massage VOData
 
Linked Data for Federation of OER Data &amp; Repositories
Linked Data for Federation of OER Data &amp; RepositoriesLinked Data for Federation of OER Data &amp; Repositories
Linked Data for Federation of OER Data &amp; Repositories
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
COPO - Collaborative Open Plant Omics, by Rob Davey
COPO - Collaborative Open Plant Omics, by Rob DaveyCOPO - Collaborative Open Plant Omics, by Rob Davey
COPO - Collaborative Open Plant Omics, by Rob Davey
 
Digital Curation for Excel (DCXL)
Digital Curation for Excel (DCXL)Digital Curation for Excel (DCXL)
Digital Curation for Excel (DCXL)
 
2+3D Photography 2017 – SP 1 Multispectral Imaging Workflow Integration - Meg...
2+3D Photography 2017 – SP 1 Multispectral Imaging Workflow Integration - Meg...2+3D Photography 2017 – SP 1 Multispectral Imaging Workflow Integration - Meg...
2+3D Photography 2017 – SP 1 Multispectral Imaging Workflow Integration - Meg...
 
Aspects of Reproducibility in Earth Science
Aspects of Reproducibility in Earth ScienceAspects of Reproducibility in Earth Science
Aspects of Reproducibility in Earth Science
 
Preserving the Inputs and Outputs of Scholarship
Preserving the Inputs and Outputs of ScholarshipPreserving the Inputs and Outputs of Scholarship
Preserving the Inputs and Outputs of Scholarship
 
Open Data: Ready Set Go
Open Data: Ready Set GoOpen Data: Ready Set Go
Open Data: Ready Set Go
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
 
Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing
Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data PublishingScott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing
Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing
 
Collaboration and Sharing
Collaboration and SharingCollaboration and Sharing
Collaboration and Sharing
 
Inter Lab Quigg 2
Inter Lab Quigg 2Inter Lab Quigg 2
Inter Lab Quigg 2
 
Crushing, Blending, and Stretching Data
Crushing, Blending, and Stretching DataCrushing, Blending, and Stretching Data
Crushing, Blending, and Stretching Data
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
 
Tracking Trends in Korean Information Science Research, 2000-2011
Tracking Trends in Korean Information Science Research, 2000-2011Tracking Trends in Korean Information Science Research, 2000-2011
Tracking Trends in Korean Information Science Research, 2000-2011
 

Plus de University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceUniversity of Washington
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureUniversity of Washington
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DUniversity of Washington
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe University of Washington
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 

Plus de University of Washington (20)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&D
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
Data Science and Urban Science @ UW
Data Science and Urban Science @ UWData Science and Urban Science @ UW
Data Science and Urban Science @ UW
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 

Dernier

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Dernier (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Enabling Collaborative Research Data Management with SQLShare

  • 1. Enabling Collaborative Research Data Management with SQLShare Bill Howe, PhD Tom Lewis Director of Director, Academic & Research, Scalable Data Collaborative Applications Analytics University of Washington University of Washington Information Technology eScience Institute 10/3/2012 Bill Howe, UW 1
  • 2. Assessing UW Researchers‘ Data Management Needs 1. Conversations with Research Leaders (2008) • First large-scale assessment of researchers‘ needs • 124 Interviews with top researchers 2. Faculty Technology Survey (2011) • Use of teaching and research technologies • Paired with student and TA surveys • Reached all disciplines, levels of research • 689 instructors responded
  • 3. Data Management Needs • Expertise—designing new systems, enhancing current practices, analysis • Storage—Large amounts of data for current projects and data archives • Backup—inconsistent systems among researchers, some unreliable practices • Security—secure access needed for inter- institutional partners
  • 4. Types of Data Stored Quantities/statistics 71.60% 61.20% Text (literature, transcriptions, field notes) 48.00% Images (photographs, maps) 22.80% Video recordings 20.50% Audio recordings 15.20% Multimedia digital objects 11.70% Geo-tagged objects/ spatial data 5.20% 0% 20% 40% 60% 80% 100%
  • 5. Data Storage Location My computer 87% External device (hard drive, thumb drive) 66% 41% Department-managed server 27% Server managed by research team 12% External (Non-UW) data center 6% Department-managed data center 6% Other 5% Data managed by research team 5% 0% 20% 40% 60% 80% 100%
  • 6.
  • 8. The University of Washington eScience Institute • Rationale – The exponential increase in sensors is transitioning all fields of science and engineering from data-poor to data-rich – Techniques and technologies include • Sensors and sensor networks, databases, data mining, machine learning, visualization, cluster/cloud computing – If these techniques and technologies are not widely available and widely practiced, UW will cease to be competitive • Mission – Help position the University of Washington at the forefront of research both in modern eScience techniques and technologies, and in the fields that depend upon them • Strategy – Bootstrap a cadre of Research Scientists – Add faculty in key fields – Build out a ―consultancy‖ of students and non-research staff 10/3/2012 Bill Howe, eScience Institute 8
  • 9. eScience Big Data Group Bill Howe, Phd (databases, cloud, data-intensive scalable computing, visualization) Staff – Seung-Hee Bae, Phd (postdoc, scalable machine learning algorithms) – Dan Halperin, Phd (postdoc; scalable systems) – Sagar Chitnis, Research Engineer (Azure, databases, web services) – (alumna) Marianne Shaw, Phd (hadoop, semantic graph databases) – (alumna) Alicia Key, Research Engineer (visualization, web applications) Students – Scott Moe (2nd yr Phd, Applied Math) – Daniel Perry (2nd yr Phd, HCDE) Partners – CSE DB Faculty: Magda Balazinska, Dan Suciu – CSE students: Paris Koutris, Prasang Upadhyaya, – UW-IT (web applications, QA/support) – Cecilia Aragon, Phd, Associate Professor, HCDE (visualization, scientific applications) 10/3/2012 Bill Howe, UW 9
  • 10. [src: Carol Goble] • Power distribution • 80:20 rule Popularity / Sales Head Tail Products / Results First published May 2007, Wired Magazine article 2004
  • 11. [src: Carol Goble] Long Tail of Research Data High throughput experimental methods Industrial scale Commons based production Publicly data sets Cherry picked results Preserved GenBank PDB UniProt ChemSpider Pfam CATH, SCOP (Protein Structure Classification) Spreadsheets, Notebooks Local, Lost
  • 12. Problem How much time do you spend “handling data” as opposed to “doing science”? Mode answer: “90%” 10/3/2012 Bill Howe, UW 12
  • 13. ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome ###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1 chr_4[480001-580000].287 4500 chr_4[560001-660000].1 3556 chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein C Simple Example chr_9[320001-420000].548 chr_27[320001-404298].20 2833 3991 COG5406 COG4547 2.00E-04 5.00E-05 38 18 43.9 46.2 1001 620 Nucleosome binding factor SPN, Cobalamin biosynthesis protein C chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf fa chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf fa chr_24[160001-260000].65 3542 chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf fa chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydr chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and p chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and p chr_11[1-100000].70 2886 chr_11[80001-180000].100 1523 COGAnnotation_coastal_sample.txt id query hit e_value identity_ score query_start query_end hit_start hit_end hit_length 1 FHJ7DRN01A0TND.1 COG0414 1.00E-08 28 51 1 74 180 257 285 2 FHJ7DRN01A1AD2.2 COG0092 3.00E-20 47 89.9 6 85 41 120 233 3 FHJ7DRN01A2HWZ.4 COG3889 0.0006 26 35.8 9 94 758 845 872 … 2853 FHJ7DRN02HXTBY.5 COG5077 7.00E-09 37 52.3 3 77 313 388 1089 2854 FHJ7DRN02HZO4J.2 COG0444 2.00E-31 67 127 1 73 135 207 316 … 3566 FHJ7DRN02FUJW3.1 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105 … SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit 10/3/2012 Bill Howe, UW 13
  • 14. Data management becoming the bottleneck • Data is captured and manipulated in spreadsheets • This approach made sense five years ago; the data volumes were manageable • But now: – 50k rows in each of 50-100 spreadsheets, per scientist – Significantly more sharing required: ―mega-collabs‖ – Continuously producing new results – The management overhead is beginning to dominate – Difficult to share data publicly (emailing files is a mess) – Lots of work to get an answers to any new question 10/3/2012 14
  • 15. Why not build a database? • Assumes an pre-defined schema – more on this later • Requires specialized technical skills and a huge amount of up-front effort • Not a perfect fit – it‘s hard to design a ―permanent‖ database for a fast-moving research target • Researchers have little interest in operating and maintaining a data system – they just want to organize, manipulate, and share data 10/3/2012 15
  • 16. Database-as-a-Service for the 99% Approach: Strip down databases to bare essentials – Upload -> Query -> Share – Try to eliminate installation, configuration, schema design, data loading, tuning, easy, thanks to the cloud app-building harder 10/3/2012 Bill Howe, UW 16
  • 17. Getting Data In • Upload data through a browser to a cloud database (so no need to manage a local system) • No need to ―design a database‖ before you use your data – just upload and get to work • Write SQL queries, with some automated help and some guided help from experts • Build on your own results – Output of one query can be the input to another – Encourages sharing and reuse – Avoids everyone on the team running the same data processing steps over and over – Provides provenance – ―how did you get this result?‖ 10/3/2012 Bill Howe, UW 17
  • 18. Select from a list of English descriptions 10/3/2012 18
  • 19. Edit a Query Save the results, share them with others 10/3/2012 19
  • 25. Spreadsheet Excel Python R Crawler Addin Client Client ―Flagship‖ VizDeck SQLShare App ASP.NET (Python) on EC2 OAuth2 SQLShare REST API WCF
  • 26. Deeply nested hierarchies of views Provenance Controlled Sharing 10/3/2012 Bill Howe, UW 26
  • 27. Robin Kodner 2010 Pilot- Outreach and Education- based sampling: Schooner Adventuress 9 13 11 14 12 15 16 4 510 8 7 10/3/2012 Bill Howe, UW 6 27
  • 28. NatureMapping Program Karen Dvornich Wildlife Observations (1902- ) Data collection and submission options: 1. Download/upload spreadsheet 2. Online data entry 3. NatureTracker on handheld/GPS 4. Android ODK (Open Data Kit) Water Quality Monitoring Sites (2003 - ) 10/3/2012 Bill Howe, UW 28
  • 29. Andrew White, UW Chemistry “An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces. Previously, we were using huge directory trees and plain text files. Now we can accomplish a 10 minute 100 line script in 1 line of SQL.” -- Andrew D White 10/3/2012 Bill Howe, UW 29
  • 30. WHY SQL? 10/3/2012 Bill Howe, UW 30
  • 31. A ―Needs Hierarchy‖ of Science Data Management “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43 full semantic integration analytics query sharing storage 10/3/2012 Bill Howe, UW 31
  • 32. NoSchema (not NoSQL) • A schema* is a shared consensus about some universe of discourse • At the frontier of research, this shared consensus does not exist, by definition • Any schema that does emerge will change frequently, by definition • Data found ―in the wild‖ will typically not conform to any schema, by definition • But this doesn‘t mean we have to punt on databases and go back to ad hoc scripts and files * ontology/metadata standard/controlled vocabulary/etc. 10/3/2012 Bill Howe, UW 32
  • 33. A ―Needs Hierarchy‖ of Science Data Management Silos full semantic integration analytics query sharing Cloud storage 10/3/2012 Bill Howe, UW 33
  • 34. What‘s the point? • Conventional wisdom says ―Science data isn‘t relational‖ – This is utter nonsense • Conventional wisdom says ―Scientists won‘t write SQL‖ – This is utter nonsense • So why aren‘t databases being used more often? – They‘re a PITA • We implicate difficulty in – installation, configuration – schema design, data loading – performance tuning – app-building (NoGUI?) We ask instead, “What kind of platform can support ad hoc scientific Q&A with SQL?” 5/18/10 Garret Cole, eScience Institute
  • 35. Biologists are beginning to write very complex queries (rather than relying on staff programmers) Example: Computing the overlaps of two sets of blast results SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 We see thousands of WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) queries written by THEN w.end_bp - x.start_bp + 1 END AS len_overlap non-programmers FROM [koesterj@washington.edu].[hotspots_deserts.tab] x INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC
  • 36. So why SQL? • Covers 80% of what we need – Ex: Sloan Digital Sky Survey – Ex: Hybrid Hash Join algorithm published in BMC bioinformatics • Empower a new class of data-savvy scientist who isn‘t forced to be trained as an IT professional • Algebraic optimization – Ask me about this if you‘re interested • Views: Logical and physical data Independence – Reason about the problem independently of the data representation – No re-execution of ―workflows‖ – No file format incompatibilities – No version mismatches – Data and code tightly coupled and (logically) centralize 10/3/2012 Bill Howe, UW 36
  • 37. SQLShare as a CS Research Platform SSDBM 2011 • Automatic ―Starter‖ Queries – (Bill Howe, Garret Cole, Nodira Khoussainova, Leilani Battle) SIGMOD 2011 (demo) • VizDeck: Automatic Mashups and Visualization SSDBM 2011 – (Bill Howe, Alicia Key, Daniel Perry, Cecilia Aragon) CHI 2012 • Info Extraction from Spreadsheets SIGMOD 2012 (demo) – (Mike Cafarella, Dave Maier, Bill Howe, Sagar Chitnis, Abdu Alwani) • Scalable Analytics-as-a-Service – (Dan Suciu, Magda Balazinska, Bill Howe) • Optimizing Iterative Queries for Machine Learning VLDB 2010 – (Dan Suciu, Magda Balazinska, Bill Howe) Datalog2.0 2012 CIDR 2013 10/3/2012 Bill Howe, UW 37
  • 38. Where we‘re headed: • Local or cloud-hosted deployments done! • Multi-institution sharing • Global users and permissions • Distributed data and distributed query We are looking for partners! 10/3/2012 Bill Howe, UW 38
  • 39. http://sqlshare.escience.washington.edu billhowe@cs.washington.edu http://escience.washington.edu 10/3/2012 Bill Howe, UW 39
  • 40. 10/3/2012 Bill Howe, UW 40
  • 41. Scientific data management reduces to sharing views • Integrate data from multiple sources? – joins and unions with views • Standardize on units, apply naming conventions? – rename columns, apply functions with views • Attach metadata? – add new tables with descriptive names, add new columns with views • Data cleaning, quality control? – hide bad values with views • Maintain provenance? – inspect view dependencies • Propagate updates? – view maintenance • Protect sensitive data? – expose subsets with views (assuming views carry permissions) 10/3/2012 Bill Howe, UW 41
  • 42. An observation about ―handling data‖ • How many plasmids were bombarded in July and have a rescue and expression? SELECT count(*) FROM [bombardment_log] WHERE bomb_date BETWEEN ‘7/1/2010' AND ‘7/31/2010' AND rescue clone IS NOT NULL AND [expression?] = 'yes' 5/18/10 Garret Cole, eScience Institute
  • 43. An observation about ―handling data‖ • Which samples have not been cloned? SELECT * FROM plasmiddb WHERE NOT (ISDATE(cloned) OR cloned = ‗yes‘) 5/18/10 Garret Cole, eScience Institute
  • 44. An observation about ―handling data‖ • How often does each RNA hit appear inside the annotated surface group? SELECT hit, COUNT(*) as cnt FROM tigrfamannotation_surface GROUP BY hit ORDER BY cnt DESC 5/18/10 Garret Cole, eScience Institute
  • 45. An observation about ―handling data‖ For a given promoter (or protein fusion), how many expressing line have been generated (they would all have different strain designations) SELECT strain, count(distinct line) FROM glycerol_stocks GROUP BY strain 5/18/10 Garret Cole, eScience Institute
  • 46. Find all TIGRFam ids (proteins) that are missing from at least one of three samples (relations) SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs] UNION SELECT col0 FROM [est_hma_fasta_TGIRfam_refs] UNION SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs] EXCEPT SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs] INTERSECT SELECT col0 FROM [est_hma_fasta_TGIRfam_refs] INTERSECT SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs] 10/3/2012 Bill Howe, UW 46
  • 47. On NoSQL 10/3/2012 Bill Howe, UW 47
  • 48. An Observation on NoSQL • 2004 Dean et al. MapReduce • 2008 Hadoop 0.17 release • 2008 Olston et al. Pig: Relational Algebra on Hadoop • 2008 DryadLINQ: Relational Algebra in a Hadoop-like system • 2009 Thusoo et al. HIVE: SQL on Hadoop NoSQL is a misnomer – NoMySQL? – NoSchema? – NoLoading? – NoLicenseFees! 10/3/2012 Bill Howe, UW 48
  • 49. Digression: Relational Database History Pre-Relational: if your data changed, your application broke. Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code. “Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” -- Codd 1979 Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation
  • 50. Key Idea: An Algebra of Tables select project join join Other operators: aggregate, union, difference, cross product 10/3/2012 Bill Howe, UW 50
  • 51. Algebraic Optimization N = ((4*2)+((4*3)+0))/1 Algebraic Laws: 1. (+) identity: x+0 = x 2. (/) identity: x/1 = x 3. (*) distributes: (n*x+n*y) = n*(x+y) 4. (*) commutes: x*y = y*x Apply rules 1, 3, 4, 2: N = (2+3)*4 two operations instead of five, no division operator Same idea works with very large tables / graphs, but the payoff is much higher 51
  • 52. Architecture Excel Python R Addin Client Client Spreadsheet ―Flagship‖ VizDeck Ingest SQLShare App (Python) on EC2 ASP.NET SQLShare REST API SQLShare REST API Windows Azure IIS SQL Azure SQL Server
  • 53. Science is becoming a database query problem Old model: “Query the world” (Data acquisition coupled to a specific hypothesis) New model: “Download the world” (Data acquisition supports many hypotheses) – Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) – Biology: lab automation, high-throughput sequencing, – Oceanography: high-resolution models, cheap sensors, satellites 40TB / 2 nights 1 device ~1TB / day 100s of devices 10/3/2012 Bill Howe, eScience Institute 53
  • 54. Problem – Research data is captured and manipulated in spreadsheets – This perhaps made sense five years ago; the data volumes were manageable – But now: 50k rows, 100s of files, ―mega-collabs‖ Why not put everything into a database? – A huge amount of up-front effort – Hard to design for a moving target – Running a database system is huge drain Approach: SQLShare – Upload data through your browser: no setup, no install – Login to browse science questions in English – Click a question, see the SQL to answer it question – Edit the SQL to answer an "adjacent" question, even if you wouldn‘t know how to write it from scratch https://sqlshare.escience.washington.edu/ 10/3/2012 54
  • 55. Astronomy LSST Oceanography PanSTARRS SRA MG-RAST OOI GenBank # of bytes PDB SDSS Biology IOOS UniProt Pfam NDBC LANL Galaxy Pathway HIV Commons BioMart GEO SQLShare # of sources 10/3/2012 Bill Howe, UW 55
  • 56. Parallel Iterative Analytics Large scale GridFields ―Long tail‖ Scale, training SQLShare: Query-as-a-Service VizDeck citizen science; public outreach astrophysics life sciences earth sciences commercial
  • 57. Parallel Iterative Analytics Large scale GridFields ―Long tail‖ Scale, training SQLShare: Query-as-a-Service VizDeck citizen science; public outreach astrophysics life sciences earth sciences commercial
  • 58. A ―Needs Hierarchy‖ of Science Data Management analytics query full integration sharing storage 10/3/2012 Bill Howe, UW 58
  • 59. A ―Needs Hierarchy‖ of Science Data Management analytics query full integration sharing storage 10/3/2012 Bill Howe, UW 59
  • 60. Four Conjectures about Declarative Query for Science • Most science data manipulation tasks can be expressed in relational algebra • Most science analytics task can be expressed in relational algebra + recursion Hellerstein 09, Re 12 • These expressions can be efficiently and scalably executed in the cloud • Researchers are willing and able to program using relational algebra languages c.f. SDSS 10/3/2012 Bill Howe, UW 60
  • 61. 5/18/10 Garret Cole, eScience Institute
  • 62. 5/18/10 Garret Cole, eScience Institute
  • 63. 5/18/10 Garret Cole, eScience Institute
  • 64. metadata sequence data search results
  • 65. SQL 5/18/10 Garret Cole, eScience Institute
  • 66. The Long Tail LSST (~100PB) “The future is already here; it’s just CERN not very evenly distributed.” (~15PB/year) -- William Gibson data volume PanSTARRS (~40PB) SDSS (~100TB) Ocean CARMEN Modelers Seis- Microbiologists <Spreadsheet (~50TB) mologists users> rank 10/3/2012 Bill Howe, eScience Institute 66
  • 67. Experimental Engagement Algorithm for the Long Tail A stripped-down version of Jim Gray’s “20 questions” methodology 1. Get the data 2. Load the data ―as is‖ – no schema design 3. Get ~20 questions (in English) 4. Translate the questions into SQL (when possible) 5. Provide these ―starter queries‖ to the researchers Q: Can researchers questions be expressed in SQL? Q: Are a few examples sufficient for novices to self-train with SQL? Q: Can we scale this process up? Q: If so, will the use of SQL reduce their data handling overhead?
  • 68. • Which samples have not been cloned? SELECT * FROM plasmiddb WHERE NOT (ISDATE(cloned) OR cloned = ‗yes‘) 10/3/2012 Bill Howe, UW 68
  • 69. Details • Backend is Microsoft‘s SQL Azure • About a year old, essentially zero advertising • 50+ active users (UW and external) • 2500+ tables/views (~20% are public • largest table: 1.1M rows, smallest table: 1 row • transitioning to support by UW-IT 10/3/2012 Bill Howe, UW 69
  • 70. View Semantics and Features • ―Saved query‖ = View with attached metadata • Unify views and tables as ―datasets‖ – table = “select * from [raw_table]” • Replacement semantics for name conflicts – old versions materialized and archived • Materialize downstream views – when dependencies deleted – when dependencies become incompatible • Permissions – public, private, ACLs (may work on groups in the future) • Sharing, social querying, CQMS* – search, recent queries, friends’ queries, favorites, ratings – facilitate sharing and recommendations of not just whole queries, but common predicates, join patterns, etc. 10/3/2012 Bill Howe, UW * [Khoussainova, CIDR 2009] 70
  • 71. Karen Dvornich 10/3/2012 Bill Howe, UW 71
  • 72. 10/3/2012 Bill Howe, UW 72

Notes de l'éditeur

  1. D
  2. We want to give a little background of our project before we launch into it, so we will discuss the problem we are trying to solve.Essentially, we want to remove the speed-bump of data handling from the scientists.
  3. List of English descriptions
  4. Data files can be uploaded as is, and we make an attempt to automatically infer the appropriate schema based on a type analysis of the values and a variety of heuristics.
  5. goal is to be inclusive rather than exclusive: accept everything, then clean it up with queries and views.so we can tolerate irregular number of columns, mixed-type columns, missing column names.There is a lot more to do here. Kathleen Fisher’s work on LearnPADS provides a great mechanism for automatically inferring the structure of ad hoc data formats – Very promising for us.Jeff Heer’s and Joe Hellerstein’s work on Data Wrangler is relevant, as is Google Refine.
  6. Every uploaded table is associated with a trivial view. You can modify this view to do some basic cleanup, and in fact that is typically the first step we are seeing people perform.
  7. Lots of features you can imagine here – anything you can do with a youtube video, you should be able to do with a query: share it, rate it, “more like this”, recommendations, We are exploring some of these.
  8. I’ll give you a couple of examples of use cases we are seeing.Robin Kodner at Friday Harbor Labs in the San Juan islands works on a program called Sound Experience that gives K12 – undergrads experience sailing, but reuses this ship time to do some real science and collect samples from areas that wouldotherwise be very expensive.
  9. But the fundamental error made by computer scientists, and it’s probably the fault of the database community, is to assume that semantic integration is a prerequisite for query and analytics.It isn’t. It’s the final goal, not some insignificant preamble to analysis.Domain scientists know this – they take a very pragmatic approach. They write code to do data handling, they write code to do analytics, and they do data integration on the fly in a task-specific way.So one of my goals is to convince you of is that you can decouple declarative query from semantic integration, and doing so gives scientists a very powerful tool.
  10. So how do we do this? Back to the needs hierarchy.It’s already happening.To a first approximation, I consider storage and sharing solved thanks to cloud computing. There are plenty of issues to resolve, especially in sustainability and uptake, but these systems exist. And they’re getting cheaper.But semantic integration and analytics are still largely application-specific, ad hoc, siloed manner.Query services for science are underexplored, with a few mind-blowingly successful examples such as the Sloan Digital Sky Survey.I think many of you are familiar with SDSS, but briefly: It is not an understatement to say that the liberal application of database technology has transformed the field of Astronomy. Undergrads and grads use SQL alongside IDL and Fortran. 5000 papers were written about the SDSS data, and only about 100 of those were co-authored by the Pis. The rest were written by astronomers exercising the public query interfaces.So the roadmap here is to FIRST expose data through query services, THEN use these services to generalize, scale, and democratize analytics and semantic integration.
  11. To begin, we ask, what kind of questions would you ask your data once you have it ready to be worked on?Just about EVERY question that we have heard a scientist would ask, we have found an equivalent SQL statement counterpart.If we could just turn their questions in SQL our job would be done, but there are many other problems to solve before that becomes a reality. For example, their data may not reside in a relational database.This brings us to part of our next problem: how can we bring the power of SQL to the scientists to solve their questions without the overhead of everything that a database administrator would need to do.
  12. To begin, we ask, what kind of questions would you ask your data once you have it ready to be worked on?Just about EVERY question that we have heard a scientist would ask, we have found an equivalent SQL statement counterpart.If we could just turn their questions in SQL our job would be done, but there are many other problems to solve before that becomes a reality. For example, their data may not reside in a relational database.This brings us to part of our next problem: how can we bring the power of SQL to the scientists to solve their questions without the overhead of everything that a database administrator would need to do.
  13. To begin, we ask, what kind of questions would you ask your data once you have it ready to be worked on?Just about EVERY question that we have heard a scientist would ask, we have found an equivalent SQL statement counterpart.If we could just turn their questions in SQL our job would be done, but there are many other problems to solve before that becomes a reality. For example, their data may not reside in a relational database.This brings us to part of our next problem: how can we bring the power of SQL to the scientists to solve their questions without the overhead of everything that a database administrator would need to do.
  14. To begin, we ask, what kind of questions would you ask your data once you have it ready to be worked on?Just about EVERY question that we have heard a scientist would ask, we have found an equivalent SQL statement counterpart.If we could just turn their questions in SQL our job would be done, but there are many other problems to solve before that becomes a reality. For example, their data may not reside in a relational database.This brings us to part of our next problem: how can we bring the power of SQL to the scientists to solve their questions without the overhead of everything that a database administrator would need to do.
  15. Why has MapReduce been so successful?One reason: it turned “mere mortal” java programmers into distributed systems programmersMapReduce raised the level of abstraction for big dataBut not high enough, evidentlyTwo of the earliest and most successful projects in the Hadoop ecosystem were Pig and HIVE: Declarative query languages on top of HadoopNoSQL is a misnomerNoSchema? NoLoading? NoLicenseFees? NoOverpricedDBA?NoMySQLTaught database people to think about fault tolerance differently and to not ignore dirty data.
  16. But the fundamental error made by computer scientists, and it’s probably the fault of the database community, is to assume that semantic integration is a prerequisite for query and analytics.It isn’t. It’s the final goal, not some insignificant preamble to analysis.Domain scientists know this – they take a very pragmatic approach. They write code to do data handling, they write code to do analytics, and they do data integration on the fly in a task-specific way.So one of my goals is to convince you of is that you can decouple declarative query from semantic integration, and doing so gives scientists a very powerful tool.
  17. But the fundamental error made by computer scientists, and it’s probably the fault of the database community, is to assume that semantic integration is a prerequisite for query and analytics.It isn’t. It’s the final goal, not some insignificant preamble to analysis.Domain scientists know this – they take a very pragmatic approach. They write code to do data handling, they write code to do analytics, and they do data integration on the fly in a task-specific way.So one of my goals is to convince you of is that you can decouple declarative query from semantic integration, and doing so gives scientists a very powerful tool.
  18. The DNA material in the samples taken from the water are then sequenced in a machine to produce millions of short strings
  19. These DNA reads can then be cross-referenced in public databases to determine what organisms were present in the water, and what genes were being expressed
  20. Each step generates a bunch of “residual” data, usually in the form of spreadsheets or text files.This process is repeated many times, leading to 100s of spreadsheetsAt this point, the actual science questions are answered using these spreadsheets by computing “manual joins”, creating plots, searching and filtering, copying and pasting, etc. It’s a mess -- when asked how much time is spent “handling data” as opposed to “doing science”!We’ve heard that 90% of their work is manipulating the data before they can actually answer a question!