SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
A case for teaching
SQL to scientists
Daniel Halperin
#w2tbac @SESYNC 2013-07-09
SQL: think like data
• SQL is a Language for expressing Queries
over Structured data.
• vs Python/R, SQL is
• strictly less powerful
• better for concisely, clearly, and efficiently
expressing data manipulation
• ... and anecdotally, “many” scripts written
by scientists just manipulate data
Claim 1: SQL is
Concise & Clear
• English questions often translate
directly into SQL
• Scripting languages have a lot of language
overhead -- syntactic sugar
• Let’s see some (admittedly biased)
examples
with open(‘file.txt’) as input_file:
cnt = 0
for line in input_file:
cnt += 1
print cnt
What does this code do?
with open(‘file.txt’) as input_file:
cnt = 0
for line in input_file:
cnt += 1
print cnt
What does this code do?
SELECT COUNT(*) AS cnt
FROM file
with open(‘file.txt’) as input_file:
for line in input_file:
if int(line.split()[3]) > 5:
print line
What does this code do?
with open(‘file.txt’) as input_file:
for line in input_file:
if int(line.split()[3]) > 5:
print line
What does this code do?
SELECT *
FROM file
WHERE value > 5
What does this code do?
SELECT value, SUM(counts) AS tot_count
FROM file
GROUP BY value
What does this code do?
with open(‘file.txt’) as input_file:
tot_counts = defaultdict(0)
for line in input_file:
tot_counts[line.split()[3]] += int(line.split()[4])
for value in tot_counts:
print value, tot_counts[value]
SELECT value, SUM(counts) AS tot_count
FROM file
GROUP BY value
What does this code do?
SELECT census.county,
electoral.votes / census.population AS voting_rate
FROM electoral, census
WHERE electoral.county = census.county
What does this code do?
SELECT census.county,
electoral.votes / census.population AS voting_rate
FROM electoral, census
WHERE electoral.county = census.county
<Complicated stuff with dictionaries>
Claim 2: SQL is Efficient
Scaling up your data
• What happens when Python/R data
doesn’t fit in memory? Crash, or rewrite
much more complicated code
• All databases automatically,
transparently spill to disk, and are
heavily optimized for performance
Claim 2: SQL is Efficient
Say you inherit a really well-engineered Python script
./highly_optimized_code.py < TB.dataset > GB.result
Claim 2: SQL is Efficient
Say you inherit a really well-engineered Python script
./simple_data_filter.py < GB.result > MB.answer
./highly_optimized_code.py < TB.dataset > GB.result
But are only interested in a small fraction of the result
Claim 2: SQL is Efficient
Say you inherit a really well-engineered Python script
./simple_data_filter.py < GB.result > MB.answer
./highly_optimized_code.py < TB.dataset > GB.result
But are only interested in a small fraction of the result
1) Dive into the complex code and modify its
internals to filter inside
2) Suffer the long running time of the first program
Claim 2: SQL is Efficient
CREATE VIEW their_query AS
SELECT <... their code ...>
FROM terabyte_dataset
Gives their query a
name, but doesn’t
execute it!
Claim 2: SQL is Efficient
CREATE VIEW their_query AS
SELECT <... their code ...>
FROM terabyte_dataset
SELECT *
FROM their_query
WHERE <... your filter ...>
Gives their query a
name, but doesn’t
execute it!
Combine both
queries and optimize
together!
Claim 2: SQL is Efficient
CREATE VIEW their_query AS
SELECT <... their code ...>
FROM terabyte_dataset
SELECT *
FROM their_query
WHERE <... your filter ...>
Gives their query a
name, but doesn’t
execute it!
Combine both
queries and optimize
together!
Fast!
SQL for Science
• UW’s SQLShare - open, view-oriented,
web database service
• Easy data import, public & private sharing,
permalinks (DOI support coming)
• Use a series of views instead of scripts for:
• data cleaning, transformation, integration
• simple stats, analytics, format conversion
• provenance and publishing
• mashups: integrated with R, Sage, etc.
escience.washington.edu/sqlshare
“An undergraduate student and I are working with gigabytes of tabular
data derived from analysis of protein surfaces. Previously, we were using
huge directory trees and plain text files. Now we can accomplish a
10 minute 100 line script in 1 line of SQL.”
- Andrew D White, grad student in UW Chem Eng
“I have had two students who are struggling with R come up and tell me
how much more they like working in SQLShare.”
- Robin Kodner, as asst professor at Western Washington U
"That [SQL query that finished in 1 second] took
me a week [manually in Excel]!"
- Robin Kodner, as postdoc at UW Oceanography
* yes, we need (and are interested in) more than anecdotes!!
SQL can do more than
you think (here vs R)

Contenu connexe

Tendances

Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsDan Sullivan, Ph.D.
 
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...Citus Data
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositoriesfeiwin
 
WEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
WEBINAR: Proven Patterns for Loading Test Data for Managed Package TestingWEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
WEBINAR: Proven Patterns for Loading Test Data for Managed Package TestingCodeScience
 
The LINQ Between XML and Database
The LINQ Between XML and DatabaseThe LINQ Between XML and Database
The LINQ Between XML and DatabaseIRJET Journal
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma APIKyle Banerjee
 

Tendances (8)

Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in Bioinformatics
 
Scrutiny 2
Scrutiny 2Scrutiny 2
Scrutiny 2
 
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositories
 
Text mining meets neural nets
Text mining meets neural netsText mining meets neural nets
Text mining meets neural nets
 
WEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
WEBINAR: Proven Patterns for Loading Test Data for Managed Package TestingWEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
WEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
 
The LINQ Between XML and Database
The LINQ Between XML and DatabaseThe LINQ Between XML and Database
The LINQ Between XML and Database
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma API
 

En vedette

Nutrición saludable. am can res inst tomado de http colonoscopy.ru patient br...
Nutrición saludable. am can res inst tomado de http colonoscopy.ru patient br...Nutrición saludable. am can res inst tomado de http colonoscopy.ru patient br...
Nutrición saludable. am can res inst tomado de http colonoscopy.ru patient br...Omar Zenteno-Fuentes
 
Timeless Fashion Necklaces for Women
Timeless Fashion Necklaces for WomenTimeless Fashion Necklaces for Women
Timeless Fashion Necklaces for WomenSally Sen
 
教案2
教案2教案2
教案2Amy Li
 
โครงร่างโครงงานคอมพิวเตอร์
โครงร่างโครงงานคอมพิวเตอร์โครงร่างโครงงานคอมพิวเตอร์
โครงร่างโครงงานคอมพิวเตอร์noeiinoii
 
教學簡報
教學簡報教學簡報
教學簡報Amy Li
 
What Retailers Must Do Today to Meet the Consumer Expectations of Tomorrow
What Retailers Must Do Today to Meet the Consumer Expectations of TomorrowWhat Retailers Must Do Today to Meet the Consumer Expectations of Tomorrow
What Retailers Must Do Today to Meet the Consumer Expectations of TomorrowMozu
 
Daily option news letter 09 july 2013
Daily option news letter 09 july 2013Daily option news letter 09 july 2013
Daily option news letter 09 july 2013Rakhi Tips Provider
 
โครงงานคอม
โครงงานคอมโครงงานคอม
โครงงานคอมnoeiinoii
 
Tips for option market with newsletter: 5 Aug
Tips for option market with newsletter: 5 AugTips for option market with newsletter: 5 Aug
Tips for option market with newsletter: 5 AugRakhi Tips Provider
 

En vedette (18)

Nutrición saludable. am can res inst tomado de http colonoscopy.ru patient br...
Nutrición saludable. am can res inst tomado de http colonoscopy.ru patient br...Nutrición saludable. am can res inst tomado de http colonoscopy.ru patient br...
Nutrición saludable. am can res inst tomado de http colonoscopy.ru patient br...
 
Killer Bugs From Outer Space
Killer Bugs From Outer SpaceKiller Bugs From Outer Space
Killer Bugs From Outer Space
 
งานคอม
งานคอมงานคอม
งานคอม
 
Timeless Fashion Necklaces for Women
Timeless Fashion Necklaces for WomenTimeless Fashion Necklaces for Women
Timeless Fashion Necklaces for Women
 
Lorain
LorainLorain
Lorain
 
教案2
教案2教案2
教案2
 
teachin ESP
teachin ESPteachin ESP
teachin ESP
 
โครงร่างโครงงานคอมพิวเตอร์
โครงร่างโครงงานคอมพิวเตอร์โครงร่างโครงงานคอมพิวเตอร์
โครงร่างโครงงานคอมพิวเตอร์
 
教學簡報
教學簡報教學簡報
教學簡報
 
Campus Democracy
Campus DemocracyCampus Democracy
Campus Democracy
 
What Retailers Must Do Today to Meet the Consumer Expectations of Tomorrow
What Retailers Must Do Today to Meet the Consumer Expectations of TomorrowWhat Retailers Must Do Today to Meet the Consumer Expectations of Tomorrow
What Retailers Must Do Today to Meet the Consumer Expectations of Tomorrow
 
Daily option news letter 09 july 2013
Daily option news letter 09 july 2013Daily option news letter 09 july 2013
Daily option news letter 09 july 2013
 
664 2
664 2664 2
664 2
 
โครงงานคอม
โครงงานคอมโครงงานคอม
โครงงานคอม
 
INDOKON BETON INSTAN
INDOKON BETON INSTANINDOKON BETON INSTAN
INDOKON BETON INSTAN
 
Tips for option market with newsletter: 5 Aug
Tips for option market with newsletter: 5 AugTips for option market with newsletter: 5 Aug
Tips for option market with newsletter: 5 Aug
 
Equity Newsletter For 3-October
Equity Newsletter For 3-OctoberEquity Newsletter For 3-October
Equity Newsletter For 3-October
 
Jadia jn-pierre-now-arguing-with-the-imf
Jadia jn-pierre-now-arguing-with-the-imfJadia jn-pierre-now-arguing-with-the-imf
Jadia jn-pierre-now-arguing-with-the-imf
 

Similaire à A case for teaching SQL to scientists

Kultam MM UI - MySQL for Data Analytics and Business Intelligence.pdf
Kultam MM UI - MySQL for Data Analytics and Business Intelligence.pdfKultam MM UI - MySQL for Data Analytics and Business Intelligence.pdf
Kultam MM UI - MySQL for Data Analytics and Business Intelligence.pdfShaNatasha1
 
IRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET Journal
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsKellyn Pot'Vin-Gorman
 
Intelligent query converter a domain independent interfacefor conversion
Intelligent query converter a domain independent interfacefor conversionIntelligent query converter a domain independent interfacefor conversion
Intelligent query converter a domain independent interfacefor conversionIAEME Publication
 
Building a Testable Data Access Layer
Building a Testable Data Access LayerBuilding a Testable Data Access Layer
Building a Testable Data Access LayerTodd Anglin
 
Move a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudMove a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudIke Ellis
 
U-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for DevelopersU-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for DevelopersMichael Rys
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Michael Rys
 
Top 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseTop 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseSandesh Rao
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossAndrew Flatters
 
ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)Michael Rys
 
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Jim Czuprynski
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveIlyas F ☁☁☁
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
Using sql server in c sharp
Using sql server in c sharpUsing sql server in c sharp
Using sql server in c sharpFaruk Alkan
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016James Serra
 
Nyc web perf-final-july-23
Nyc web perf-final-july-23Nyc web perf-final-july-23
Nyc web perf-final-july-23Dan Boutin
 

Similaire à A case for teaching SQL to scientists (20)

Kultam MM UI - MySQL for Data Analytics and Business Intelligence.pdf
Kultam MM UI - MySQL for Data Analytics and Business Intelligence.pdfKultam MM UI - MySQL for Data Analytics and Business Intelligence.pdf
Kultam MM UI - MySQL for Data Analytics and Business Intelligence.pdf
 
IRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET- Natural Language Query Processing
IRJET- Natural Language Query Processing
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI Options
 
Intelligent query converter a domain independent interfacefor conversion
Intelligent query converter a domain independent interfacefor conversionIntelligent query converter a domain independent interfacefor conversion
Intelligent query converter a domain independent interfacefor conversion
 
Building a Testable Data Access Layer
Building a Testable Data Access LayerBuilding a Testable Data Access Layer
Building a Testable Data Access Layer
 
Move a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudMove a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloud
 
U-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for DevelopersU-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for Developers
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)
 
Top 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseTop 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous Database
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy Cross
 
ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)
 
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep Dive
 
70487.pdf
70487.pdf70487.pdf
70487.pdf
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Using sql server in c sharp
Using sql server in c sharpUsing sql server in c sharp
Using sql server in c sharp
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
 
Nyc web perf-final-july-23
Nyc web perf-final-july-23Nyc web perf-final-july-23
Nyc web perf-final-july-23
 

Dernier

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxleah joy valeriano
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 

Dernier (20)

LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 

A case for teaching SQL to scientists

  • 1. A case for teaching SQL to scientists Daniel Halperin #w2tbac @SESYNC 2013-07-09
  • 2. SQL: think like data • SQL is a Language for expressing Queries over Structured data. • vs Python/R, SQL is • strictly less powerful • better for concisely, clearly, and efficiently expressing data manipulation • ... and anecdotally, “many” scripts written by scientists just manipulate data
  • 3. Claim 1: SQL is Concise & Clear • English questions often translate directly into SQL • Scripting languages have a lot of language overhead -- syntactic sugar • Let’s see some (admittedly biased) examples
  • 4. with open(‘file.txt’) as input_file: cnt = 0 for line in input_file: cnt += 1 print cnt What does this code do?
  • 5. with open(‘file.txt’) as input_file: cnt = 0 for line in input_file: cnt += 1 print cnt What does this code do? SELECT COUNT(*) AS cnt FROM file
  • 6. with open(‘file.txt’) as input_file: for line in input_file: if int(line.split()[3]) > 5: print line What does this code do?
  • 7. with open(‘file.txt’) as input_file: for line in input_file: if int(line.split()[3]) > 5: print line What does this code do? SELECT * FROM file WHERE value > 5
  • 8. What does this code do? SELECT value, SUM(counts) AS tot_count FROM file GROUP BY value
  • 9. What does this code do? with open(‘file.txt’) as input_file: tot_counts = defaultdict(0) for line in input_file: tot_counts[line.split()[3]] += int(line.split()[4]) for value in tot_counts: print value, tot_counts[value] SELECT value, SUM(counts) AS tot_count FROM file GROUP BY value
  • 10. What does this code do? SELECT census.county, electoral.votes / census.population AS voting_rate FROM electoral, census WHERE electoral.county = census.county
  • 11. What does this code do? SELECT census.county, electoral.votes / census.population AS voting_rate FROM electoral, census WHERE electoral.county = census.county <Complicated stuff with dictionaries>
  • 12. Claim 2: SQL is Efficient Scaling up your data • What happens when Python/R data doesn’t fit in memory? Crash, or rewrite much more complicated code • All databases automatically, transparently spill to disk, and are heavily optimized for performance
  • 13. Claim 2: SQL is Efficient Say you inherit a really well-engineered Python script ./highly_optimized_code.py < TB.dataset > GB.result
  • 14. Claim 2: SQL is Efficient Say you inherit a really well-engineered Python script ./simple_data_filter.py < GB.result > MB.answer ./highly_optimized_code.py < TB.dataset > GB.result But are only interested in a small fraction of the result
  • 15. Claim 2: SQL is Efficient Say you inherit a really well-engineered Python script ./simple_data_filter.py < GB.result > MB.answer ./highly_optimized_code.py < TB.dataset > GB.result But are only interested in a small fraction of the result 1) Dive into the complex code and modify its internals to filter inside 2) Suffer the long running time of the first program
  • 16. Claim 2: SQL is Efficient CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset Gives their query a name, but doesn’t execute it!
  • 17. Claim 2: SQL is Efficient CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset SELECT * FROM their_query WHERE <... your filter ...> Gives their query a name, but doesn’t execute it! Combine both queries and optimize together!
  • 18. Claim 2: SQL is Efficient CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset SELECT * FROM their_query WHERE <... your filter ...> Gives their query a name, but doesn’t execute it! Combine both queries and optimize together! Fast!
  • 19. SQL for Science • UW’s SQLShare - open, view-oriented, web database service • Easy data import, public & private sharing, permalinks (DOI support coming) • Use a series of views instead of scripts for: • data cleaning, transformation, integration • simple stats, analytics, format conversion • provenance and publishing • mashups: integrated with R, Sage, etc.
  • 20. escience.washington.edu/sqlshare “An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces. Previously, we were using huge directory trees and plain text files. Now we can accomplish a 10 minute 100 line script in 1 line of SQL.” - Andrew D White, grad student in UW Chem Eng “I have had two students who are struggling with R come up and tell me how much more they like working in SQLShare.” - Robin Kodner, as asst professor at Western Washington U "That [SQL query that finished in 1 second] took me a week [manually in Excel]!" - Robin Kodner, as postdoc at UW Oceanography * yes, we need (and are interested in) more than anecdotes!!
  • 21. SQL can do more than you think (here vs R)