SlideShare a Scribd company logo
1 of 26
Data Science Provenance:
From Drug Discovery
to Fake Fans
Dr Jameel Syed
@tilapia
Overview
 Knowledge work adds value to raw data
 How determines whether results can be reliably reproduced and scrutinized
 Solving parts of the problem
- Inforsense (life sciences workflow analytics platform)
- Musicmetric (social media analytics for music)
 What's Provenance & why its important
 Representations of provenance
 Considerations to allow analysis computation to be recreated
 Reliable collection of noisy data from the Internet
 Archiving of data and accommodating retrospective changes
 Using linked data to direct Big Data analytics
What is Data (Science) Provenance?
 Scientific research is generally held to be of good provenance when it is documented in
detail sufficient to allow reproducibility. Scientific workflows assist scientists and
programmers with tracking their data through all transformations, analyses, and
interpretations. Data sets are reliable when the process used to create them are
reproducible and analyzable for defects. Current initiatives to effectively manage,
share, and reuse ecological data are indicative of the increasing importance of data
provenance.
 Reproducibility of data & research process
- Explanation - Why were the end conclusions reached?
- Debugging and verification – Sharing, auditing
- Re-application
The Economist, October 19th 2013
 Last year researchers at one biotech firm,
Amgen, found they could reproduce just six of 53
“landmark” studies in cancer research. Earlier, a
group at Bayer, a drug company, managed to
repeat just a quarter of 67 similarly important
papers.
 Ideally, research protocols should be registered
in advance and monitored in virtual notebooks.
This would curb the temptation to fiddle with the
experiment’s design midstream so as to make
the results look more substantial than they
are. ... Where possible, trial data also should be
open for other researchers to inspect and test.
 http://econ.st/H3qU5a
 Nature, Vol 500, 1st August 2013; http://go.nature.com/zqtrnp
Reinhart and Rogoff's spreadsheet error
 "Growth in a Time of Debt" paper shaping decisions affecting
national economies
 BBC; 20 April 2013 http://www.bbc.co.uk/news/magazine-22223190
- After some correspondence, Reinhart and Rogoff provided
Thomas with the actual working spreadsheet they'd used to obtain
their results. "Everyone says seeing is believing, but I almost
didn't believe my eyes," he says.
- The Harvard professors had accidentally only included 15 of the
20 countries under analysis in their key calculation (of average
GDP growth in countries with high public debt). Australia, Austria,
Belgium, Canada and Denmark were missing.
- Businessweek FAQ http://buswk.co/YZgwSA
 "Spreadsheets: The Ununderstood Dark Matter Of IT"
- Y2K bug was not just COBOL!
Open Data Science
 Open Source Software is the foundation
 Open Access to data and methodology - errors happen, but are they found?
 Many efforts...
- Open Access publication (PubMed, arXiv.org)
- Mozilla Science Lab @MozillaScience
- Open Knowledge Foundation http://okfn.org
- Open Data Institute http://theodi.org/
 Licensing
- Panton Principles
- Creative commons license data
- Non-commercial API access
Inforsense
 Workflow analytics platform for Life Sciences
- “in silico” research / e-Science
- Process representation and re-use
- Which data sets were used, where are they from, how were they computed?
 Spin out from research at Imperial College London
- Discovery Net e-Science project
 Used by pharmaceutical and biotech companies
“Big Data”
 Gene chips (DNA microarray) – rather than a PhD on a few genes, 10's of thousands a
time (& culmination of Human Genome Project)
 High-throughput screening (HTS) – drug discovery; thousands of automated
experiments per day
 What to do with the data?
- Paper published
- Data set sometimes published
- Reproduce and expand methodology manually

http://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Affymetrix-microarray.jpg/150px-Affymetrix-microarray.jpg
Representations
 How to represent or codify ideas? (beyond writing a traditional paper)
 Statistician - Business Intelligence Analyst - Data Scientist - Software engineer
- Some coding?
- How much?
- Scientists have been using Fortran for decades (& S+, R, Matlab...)
- GRAIL (1969, RAND corporation) flow charts and light pens
- http://www.rand.org/content/dam/rand/pubs/research_memoranda/2006/RM6001.pdf

- Bioinformaticians (back in the day) Perl hackers, open source, sharing data
Declarative Workflows

 Academic paper & data set → encoded as workflow → computed results
 What should the set of operations be?
- Deterministic, no side effects
- Common functions between workflows
 Functional composition
Functional Programming

 "Functional programming combines
the flexibility and power of abstract
mathematics with the intuitive clarity
of abstract mathematics."
 http://xkcd.com/1270/
Declarative vs Imperative
 Maths proof scrutiny
- Axioms and deductive steps; describe assumptions
 Functional composition
- No side effects!
- The code documents itself!
 Combination - no silver bullet (in memory speed, out of core scale)
- “e-Lab notebook” http://ipython.org/notebook.html
- Inline visualisations (see also Mathematica)
- Hadoop does the heavy lifting (ETL)
- Pig, Hive, Cascading (Scalding, Cascalog), Crunch/Scrunch, Java MR
Live vs static
 A static representation of knowledge does not allow for
discourse with the data and process
- In Phaedrus, Socrates says:
- "Writing shares a strange feature with painting. The offspring
of painting stand there as if they were alive, but if anyone
asks them anything they are solemnly silent"...
- "alone, it cannot defend itself or come to its own support"
 Writing programs or solving problems?
 Encapsulate and generalize specific instance of a process
- To run again
- To run on similar data (making a tool to solve problems)
 Russel Jurney – Agile Data Analyis book
Metadata of datasets
 What is this?
- 5.1,3.5,1.4,0.2,setosa
- 4.9,3.0,1.4,0.2,setosa
- 4.7,3.2,1.3,0.2,setosa
- 4.6,3.1,1.5,0.2,setosa
- 5.0,3.6,1.4,0.2,setosa
- 5.4,3.9,1.7,0.4,setosa
- 4.6,3.4,1.4,0.3,setosa
- 5.0,3.4,1.5,0.2,setosa
- 4.4,2.9,1.4,0.2,setosa
 Modified version of http://archive.ics.uci.edu/ml/datasets/Iris
 1. Title: Iris Plants Database
 Updated Sept 21 by C.Blake - Added discrepency information
 2. Sources:


(a) Creator: R.A. Fisher



(b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)



(c) Date: July, 1988

 3. Past Usage:


- Publications: too many to mention!!! Here are a few.



1. Fisher,R.A. "The use of multiple measurements in taxonomic problems"



Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions



to Mathematical Statistics" (John Wiley, NY, 1950).



2. Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.



(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. ...
Process Methodology

 Mathematical method / Scientific method
- Understanding / Characterize from experience &
observation
- Analysis / Hypothesis: a proposed explanation
- Synthesis / Deduction: prediction from the hypothesis
- Review/Extend / Test and experiment
- http://en.wikipedia.org/wiki/Scientific_method#Relationship_with_mathematics

 CRISP-DM
- http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining

 OSEMN ('awesome') (Hilary Mason)
- Obtaining, Scrubbing, Exploring, Modeling, and
iNterpreting data
- http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
Musicmetric (Semetric Ltd)
 Analytics for musical artists (and beyond)
- Collecting data from the Internet/APIs - Provenance of Data
- Linked entities
 Hadoop-based Big Data processing → NoSQL → RESTful API
- Nathan Marz/”Lambda Architecture”
- http://www.ymc.ch/en/lambda-architecture-part-1
 Used by record labels, artist managers, brand owners, festivals, publishers,
broadcasters
Lots of data about lots of entities
I read it on the Internet, it must be true?
 Collection and archiving of web data is not straightforward
 Dealing with noisy or incorrect data
- Issues with data from APIs
- Filter between raw data and data used in analysis (preprocessing/data cleaning)
- Data and metadata retrospectively changing
- Present processed data, with access to raw data
 Sample rate frequency
- Collect hourly, present daily
- Interpolation to accommodate irregularities in update frequency
 Anomalies...
Fake fans
 “Fake Fans” or “Fake Followers”
- Social media activity caused by artificially created and controlled social media user profiles → fraud
- “Buying fans” to get noticed
 Fan count goes up
- Collect more data, detect and remove anomalous data
- “daily diff” time series – how many fans did I gain today? (compared to yesterday)
 Fan count goes down
- Twitter et al try to fix the problem → Massive removal of fans → This is also a problem!
 Data Science for pre-processing
- Predict what is normal using all historical data (for artist, for data source)
- Death event detector :-/
 http://www.musicmetric.com/2013/04/fake-fans-and-anomaly-detection-at-musicmetric/
Versioning (raw) Data
 Git - https://github.com/blog/1601-see-your-csvs
 Dat Version control for data (git alternative)
- http://dat-data.com/
- @maxogden http://strataconf.com/strataeu2013/public/schedule/detail/32390
 Figshare
 S3 (e.g. Datasift Twitter firehose)
 Dropbox? (“consumerisation” of enterprise tools)
 TSV not CSV (consider bz2 rather than gzip; don't forget to md5sum)
Tate collection on Github
Using linked data to direct Big Data analytics
 Linked data platform
- Profiles for The Beatles
- Puff Daddy/P Diddy, Prince/TAFKAP
- Macklemore & Ryan Lewis, Simon & Garfunkel
- Canonicalise URLs
- Temporal logic? IDs change; not good, but it happens (Musicbrainz NGS)
- RESTful API / UUIDs / external IDs
 Manual curation separated from data processing
- Resist all temptation for any manual manipulation of data!
Future
 Data to knowledge - Value chain of data
- Provenance is key to this
 Epistemology / Justified True Belief
- Semantic Web
- Big Metadata: internet of things (the archetypes, not the physical objects)
Summary
 How you made your discovery is as important as the discovery
- Reproduce, debug, verify, share, re-application
 Open Data Science
 How to represent (declarative vs imperative, maths vs software engineering)
- Separate Data Science from Software Engineering in a well defined way
- Don't orphan data from how it was computed
 Don't rely on your input data/metadata
- a) never changing b) being available
- Version control and share your (meta)data
 Resist all temptation for any manual manipulation of data!
 Consider the entities you are analysing
Thank You!
 Any questions?
 Jameel Syed
 @tilapia

More Related Content

What's hot

Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
The Hive
 
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Simplilearn
 

What's hot (20)

Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
Begin with Data Scientist
Begin with Data ScientistBegin with Data Scientist
Begin with Data Scientist
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
 
Data science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebookData science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebook
 
Data Science presentation for elementary school students
Data Science presentation for elementary school studentsData Science presentation for elementary school students
Data Science presentation for elementary school students
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine Learning
 
Data science
Data scienceData science
Data science
 
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science
 
Data Science Retrospective
Data Science RetrospectiveData Science Retrospective
Data Science Retrospective
 
Data Science
Data ScienceData Science
Data Science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 

Viewers also liked

Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.
OReillyStrata
 
Customer Behaviour Analytics: Billions of Events to one Customer-Product Prop...
Customer Behaviour Analytics: Billions of Events to one Customer-Product Prop...Customer Behaviour Analytics: Billions of Events to one Customer-Product Prop...
Customer Behaviour Analytics: Billions of Events to one Customer-Product Prop...
Paul Lam
 
Microbiologia degli alimenti
Microbiologia degli alimentiMicrobiologia degli alimenti
Microbiologia degli alimenti
guestb4e016
 

Viewers also liked (20)

Superhero GPS
Superhero GPSSuperhero GPS
Superhero GPS
 
RAW GNSS in Android Nugat
RAW GNSS in Android NugatRAW GNSS in Android Nugat
RAW GNSS in Android Nugat
 
Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.
 
R presentation - UoN
R presentation - UoNR presentation - UoN
R presentation - UoN
 
Introduction to GNSS RAW measurements provided by Android N
Introduction to GNSS RAW measurements provided by Android NIntroduction to GNSS RAW measurements provided by Android N
Introduction to GNSS RAW measurements provided by Android N
 
Customer Behaviour Analytics: Billions of Events to one Customer-Product Prop...
Customer Behaviour Analytics: Billions of Events to one Customer-Product Prop...Customer Behaviour Analytics: Billions of Events to one Customer-Product Prop...
Customer Behaviour Analytics: Billions of Events to one Customer-Product Prop...
 
Take it easy with markdown
Take it easy with markdownTake it easy with markdown
Take it easy with markdown
 
Strata Big data presentation
Strata Big data presentationStrata Big data presentation
Strata Big data presentation
 
Quo Vadis GNSS?
Quo Vadis GNSS?Quo Vadis GNSS?
Quo Vadis GNSS?
 
Pseudoranges from your Android smartphone
Pseudoranges from your Android smartphonePseudoranges from your Android smartphone
Pseudoranges from your Android smartphone
 
The Myth of Learning Styles
The Myth of Learning StylesThe Myth of Learning Styles
The Myth of Learning Styles
 
NYC* 2013 — "Using Cassandra for DVR Scheduling at Comcast"
NYC* 2013 — "Using Cassandra for DVR Scheduling at Comcast"NYC* 2013 — "Using Cassandra for DVR Scheduling at Comcast"
NYC* 2013 — "Using Cassandra for DVR Scheduling at Comcast"
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per month
 
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
 
Microbiologia degli alimenti
Microbiologia degli alimentiMicrobiologia degli alimenti
Microbiologia degli alimenti
 
La presentazione efficace con le slide
La presentazione efficace con le slideLa presentazione efficace con le slide
La presentazione efficace con le slide
 
Powerpoint e il fascino morboso dello sfondo
Powerpoint e il fascino morboso dello sfondoPowerpoint e il fascino morboso dello sfondo
Powerpoint e il fascino morboso dello sfondo
 
Comunicazione politica efficace: tre elementi per aumentare l'efficacia dei t...
Comunicazione politica efficace: tre elementi per aumentare l'efficacia dei t...Comunicazione politica efficace: tre elementi per aumentare l'efficacia dei t...
Comunicazione politica efficace: tre elementi per aumentare l'efficacia dei t...
 

Similar to Data Science Provenance: From Drug Discovery to Fake Fans

INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
Attila Barta
 

Similar to Data Science Provenance: From Drug Discovery to Fake Fans (20)

Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Data, data, data
Data, data, dataData, data, data
Data, data, data
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
Moving forward data centric sciences weaving AI, Big Data & HPC
Moving forward data centric sciences  weaving AI, Big Data & HPCMoving forward data centric sciences  weaving AI, Big Data & HPC
Moving forward data centric sciences weaving AI, Big Data & HPC
 
Publishing your research: Research Data Management (Introduction)
Publishing your research: Research Data Management (Introduction) Publishing your research: Research Data Management (Introduction)
Publishing your research: Research Data Management (Introduction)
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
Intro big data.pdf
Intro big data.pdfIntro big data.pdf
Intro big data.pdf
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
Make your data great now
Make your data great nowMake your data great now
Make your data great now
 
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGargColloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
 
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity DebateIn Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
 
Introduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdfIntroduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdf
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic Algorithm
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 

Data Science Provenance: From Drug Discovery to Fake Fans

  • 1. Data Science Provenance: From Drug Discovery to Fake Fans Dr Jameel Syed @tilapia
  • 2. Overview  Knowledge work adds value to raw data  How determines whether results can be reliably reproduced and scrutinized  Solving parts of the problem - Inforsense (life sciences workflow analytics platform) - Musicmetric (social media analytics for music)  What's Provenance & why its important  Representations of provenance  Considerations to allow analysis computation to be recreated  Reliable collection of noisy data from the Internet  Archiving of data and accommodating retrospective changes  Using linked data to direct Big Data analytics
  • 3. What is Data (Science) Provenance?  Scientific research is generally held to be of good provenance when it is documented in detail sufficient to allow reproducibility. Scientific workflows assist scientists and programmers with tracking their data through all transformations, analyses, and interpretations. Data sets are reliable when the process used to create them are reproducible and analyzable for defects. Current initiatives to effectively manage, share, and reuse ecological data are indicative of the increasing importance of data provenance.  Reproducibility of data & research process - Explanation - Why were the end conclusions reached? - Debugging and verification – Sharing, auditing - Re-application
  • 4. The Economist, October 19th 2013  Last year researchers at one biotech firm, Amgen, found they could reproduce just six of 53 “landmark” studies in cancer research. Earlier, a group at Bayer, a drug company, managed to repeat just a quarter of 67 similarly important papers.  Ideally, research protocols should be registered in advance and monitored in virtual notebooks. This would curb the temptation to fiddle with the experiment’s design midstream so as to make the results look more substantial than they are. ... Where possible, trial data also should be open for other researchers to inspect and test.  http://econ.st/H3qU5a  Nature, Vol 500, 1st August 2013; http://go.nature.com/zqtrnp
  • 5. Reinhart and Rogoff's spreadsheet error  "Growth in a Time of Debt" paper shaping decisions affecting national economies  BBC; 20 April 2013 http://www.bbc.co.uk/news/magazine-22223190 - After some correspondence, Reinhart and Rogoff provided Thomas with the actual working spreadsheet they'd used to obtain their results. "Everyone says seeing is believing, but I almost didn't believe my eyes," he says. - The Harvard professors had accidentally only included 15 of the 20 countries under analysis in their key calculation (of average GDP growth in countries with high public debt). Australia, Austria, Belgium, Canada and Denmark were missing. - Businessweek FAQ http://buswk.co/YZgwSA  "Spreadsheets: The Ununderstood Dark Matter Of IT" - Y2K bug was not just COBOL!
  • 6. Open Data Science  Open Source Software is the foundation  Open Access to data and methodology - errors happen, but are they found?  Many efforts... - Open Access publication (PubMed, arXiv.org) - Mozilla Science Lab @MozillaScience - Open Knowledge Foundation http://okfn.org - Open Data Institute http://theodi.org/  Licensing - Panton Principles - Creative commons license data - Non-commercial API access
  • 7. Inforsense  Workflow analytics platform for Life Sciences - “in silico” research / e-Science - Process representation and re-use - Which data sets were used, where are they from, how were they computed?  Spin out from research at Imperial College London - Discovery Net e-Science project  Used by pharmaceutical and biotech companies
  • 8. “Big Data”  Gene chips (DNA microarray) – rather than a PhD on a few genes, 10's of thousands a time (& culmination of Human Genome Project)  High-throughput screening (HTS) – drug discovery; thousands of automated experiments per day  What to do with the data? - Paper published - Data set sometimes published - Reproduce and expand methodology manually http://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Affymetrix-microarray.jpg/150px-Affymetrix-microarray.jpg
  • 9. Representations  How to represent or codify ideas? (beyond writing a traditional paper)  Statistician - Business Intelligence Analyst - Data Scientist - Software engineer - Some coding? - How much? - Scientists have been using Fortran for decades (& S+, R, Matlab...) - GRAIL (1969, RAND corporation) flow charts and light pens - http://www.rand.org/content/dam/rand/pubs/research_memoranda/2006/RM6001.pdf - Bioinformaticians (back in the day) Perl hackers, open source, sharing data
  • 10. Declarative Workflows  Academic paper & data set → encoded as workflow → computed results  What should the set of operations be? - Deterministic, no side effects - Common functions between workflows  Functional composition
  • 11. Functional Programming  "Functional programming combines the flexibility and power of abstract mathematics with the intuitive clarity of abstract mathematics."  http://xkcd.com/1270/
  • 12. Declarative vs Imperative  Maths proof scrutiny - Axioms and deductive steps; describe assumptions  Functional composition - No side effects! - The code documents itself!  Combination - no silver bullet (in memory speed, out of core scale) - “e-Lab notebook” http://ipython.org/notebook.html - Inline visualisations (see also Mathematica) - Hadoop does the heavy lifting (ETL) - Pig, Hive, Cascading (Scalding, Cascalog), Crunch/Scrunch, Java MR
  • 13. Live vs static  A static representation of knowledge does not allow for discourse with the data and process - In Phaedrus, Socrates says: - "Writing shares a strange feature with painting. The offspring of painting stand there as if they were alive, but if anyone asks them anything they are solemnly silent"... - "alone, it cannot defend itself or come to its own support"  Writing programs or solving problems?  Encapsulate and generalize specific instance of a process - To run again - To run on similar data (making a tool to solve problems)  Russel Jurney – Agile Data Analyis book
  • 14. Metadata of datasets  What is this? - 5.1,3.5,1.4,0.2,setosa - 4.9,3.0,1.4,0.2,setosa - 4.7,3.2,1.3,0.2,setosa - 4.6,3.1,1.5,0.2,setosa - 5.0,3.6,1.4,0.2,setosa - 5.4,3.9,1.7,0.4,setosa - 4.6,3.4,1.4,0.3,setosa - 5.0,3.4,1.5,0.2,setosa - 4.4,2.9,1.4,0.2,setosa
  • 15.  Modified version of http://archive.ics.uci.edu/ml/datasets/Iris  1. Title: Iris Plants Database  Updated Sept 21 by C.Blake - Added discrepency information  2. Sources:  (a) Creator: R.A. Fisher  (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)  (c) Date: July, 1988  3. Past Usage:  - Publications: too many to mention!!! Here are a few.  1. Fisher,R.A. "The use of multiple measurements in taxonomic problems"  Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions  to Mathematical Statistics" (John Wiley, NY, 1950).  2. Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.  (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. ...
  • 16. Process Methodology  Mathematical method / Scientific method - Understanding / Characterize from experience & observation - Analysis / Hypothesis: a proposed explanation - Synthesis / Deduction: prediction from the hypothesis - Review/Extend / Test and experiment - http://en.wikipedia.org/wiki/Scientific_method#Relationship_with_mathematics  CRISP-DM - http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining  OSEMN ('awesome') (Hilary Mason) - Obtaining, Scrubbing, Exploring, Modeling, and iNterpreting data - http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
  • 17. Musicmetric (Semetric Ltd)  Analytics for musical artists (and beyond) - Collecting data from the Internet/APIs - Provenance of Data - Linked entities  Hadoop-based Big Data processing → NoSQL → RESTful API - Nathan Marz/”Lambda Architecture” - http://www.ymc.ch/en/lambda-architecture-part-1  Used by record labels, artist managers, brand owners, festivals, publishers, broadcasters
  • 18. Lots of data about lots of entities
  • 19. I read it on the Internet, it must be true?  Collection and archiving of web data is not straightforward  Dealing with noisy or incorrect data - Issues with data from APIs - Filter between raw data and data used in analysis (preprocessing/data cleaning) - Data and metadata retrospectively changing - Present processed data, with access to raw data  Sample rate frequency - Collect hourly, present daily - Interpolation to accommodate irregularities in update frequency  Anomalies...
  • 20. Fake fans  “Fake Fans” or “Fake Followers” - Social media activity caused by artificially created and controlled social media user profiles → fraud - “Buying fans” to get noticed  Fan count goes up - Collect more data, detect and remove anomalous data - “daily diff” time series – how many fans did I gain today? (compared to yesterday)  Fan count goes down - Twitter et al try to fix the problem → Massive removal of fans → This is also a problem!  Data Science for pre-processing - Predict what is normal using all historical data (for artist, for data source) - Death event detector :-/  http://www.musicmetric.com/2013/04/fake-fans-and-anomaly-detection-at-musicmetric/
  • 21. Versioning (raw) Data  Git - https://github.com/blog/1601-see-your-csvs  Dat Version control for data (git alternative) - http://dat-data.com/ - @maxogden http://strataconf.com/strataeu2013/public/schedule/detail/32390  Figshare  S3 (e.g. Datasift Twitter firehose)  Dropbox? (“consumerisation” of enterprise tools)  TSV not CSV (consider bz2 rather than gzip; don't forget to md5sum)
  • 23. Using linked data to direct Big Data analytics  Linked data platform - Profiles for The Beatles - Puff Daddy/P Diddy, Prince/TAFKAP - Macklemore & Ryan Lewis, Simon & Garfunkel - Canonicalise URLs - Temporal logic? IDs change; not good, but it happens (Musicbrainz NGS) - RESTful API / UUIDs / external IDs  Manual curation separated from data processing - Resist all temptation for any manual manipulation of data!
  • 24. Future  Data to knowledge - Value chain of data - Provenance is key to this  Epistemology / Justified True Belief - Semantic Web - Big Metadata: internet of things (the archetypes, not the physical objects)
  • 25. Summary  How you made your discovery is as important as the discovery - Reproduce, debug, verify, share, re-application  Open Data Science  How to represent (declarative vs imperative, maths vs software engineering) - Separate Data Science from Software Engineering in a well defined way - Don't orphan data from how it was computed  Don't rely on your input data/metadata - a) never changing b) being available - Version control and share your (meta)data  Resist all temptation for any manual manipulation of data!  Consider the entities you are analysing
  • 26. Thank You!  Any questions?  Jameel Syed  @tilapia