Data Science Provenance: From Drug Discovery to Fake Fans

Data Science Provenance:
From Drug Discovery
to Fake Fans
Dr Jameel Syed
@tilapia

Overview
 Knowledge work adds value to raw data
 How determines whether results can be reliably reproduced and scrutinized
 Solving parts of the problem
- Inforsense (life sciences workflow analytics platform)
- Musicmetric (social media analytics for music)
 What's Provenance & why its important
 Representations of provenance
 Considerations to allow analysis computation to be recreated
 Reliable collection of noisy data from the Internet
 Archiving of data and accommodating retrospective changes
 Using linked data to direct Big Data analytics

What is Data (Science) Provenance?
 Scientific research is generally held to be of good provenance when it is documented in
detail sufficient to allow reproducibility. Scientific workflows assist scientists and
programmers with tracking their data through all transformations, analyses, and
interpretations. Data sets are reliable when the process used to create them are
reproducible and analyzable for defects. Current initiatives to effectively manage,
share, and reuse ecological data are indicative of the increasing importance of data
provenance.
 Reproducibility of data & research process
- Explanation - Why were the end conclusions reached?
- Debugging and verification – Sharing, auditing
- Re-application

The Economist, October 19th 2013
 Last year researchers at one biotech firm,
Amgen, found they could reproduce just six of 53
“landmark” studies in cancer research. Earlier, a
group at Bayer, a drug company, managed to
repeat just a quarter of 67 similarly important
papers.
 Ideally, research protocols should be registered
in advance and monitored in virtual notebooks.
This would curb the temptation to fiddle with the
experiment’s design midstream so as to make
the results look more substantial than they
are. ... Where possible, trial data also should be
open for other researchers to inspect and test.
 http://econ.st/H3qU5a
 Nature, Vol 500, 1st August 2013; http://go.nature.com/zqtrnp

Reinhart and Rogoff's spreadsheet error
 "Growth in a Time of Debt" paper shaping decisions affecting
national economies
 BBC; 20 April 2013 http://www.bbc.co.uk/news/magazine-22223190
- After some correspondence, Reinhart and Rogoff provided
Thomas with the actual working spreadsheet they'd used to obtain
their results. "Everyone says seeing is believing, but I almost
didn't believe my eyes," he says.
- The Harvard professors had accidentally only included 15 of the
20 countries under analysis in their key calculation (of average
GDP growth in countries with high public debt). Australia, Austria,
Belgium, Canada and Denmark were missing.
- Businessweek FAQ http://buswk.co/YZgwSA
 "Spreadsheets: The Ununderstood Dark Matter Of IT"
- Y2K bug was not just COBOL!

Open Data Science
 Open Source Software is the foundation
 Open Access to data and methodology - errors happen, but are they found?
 Many efforts...
- Open Access publication (PubMed, arXiv.org)
- Mozilla Science Lab @MozillaScience
- Open Knowledge Foundation http://okfn.org
- Open Data Institute http://theodi.org/
 Licensing
- Panton Principles
- Creative commons license data
- Non-commercial API access

Inforsense
 Workflow analytics platform for Life Sciences
- “in silico” research / e-Science
- Process representation and re-use
- Which data sets were used, where are they from, how were they computed?
 Spin out from research at Imperial College London
- Discovery Net e-Science project
 Used by pharmaceutical and biotech companies

“Big Data”
 Gene chips (DNA microarray) – rather than a PhD on a few genes, 10's of thousands a
time (& culmination of Human Genome Project)
 High-throughput screening (HTS) – drug discovery; thousands of automated
experiments per day
 What to do with the data?
- Paper published
- Data set sometimes published
- Reproduce and expand methodology manually

http://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Affymetrix-microarray.jpg/150px-Affymetrix-microarray.jpg

Representations
 How to represent or codify ideas? (beyond writing a traditional paper)
 Statistician - Business Intelligence Analyst - Data Scientist - Software engineer
- Some coding?
- How much?
- Scientists have been using Fortran for decades (& S+, R, Matlab...)
- GRAIL (1969, RAND corporation) flow charts and light pens
- http://www.rand.org/content/dam/rand/pubs/research_memoranda/2006/RM6001.pdf

- Bioinformaticians (back in the day) Perl hackers, open source, sharing data

Declarative Workflows

 Academic paper & data set → encoded as workflow → computed results
 What should the set of operations be?
- Deterministic, no side effects
- Common functions between workflows
 Functional composition

Functional Programming

 "Functional programming combines
the flexibility and power of abstract
mathematics with the intuitive clarity
of abstract mathematics."
 http://xkcd.com/1270/

Declarative vs Imperative
 Maths proof scrutiny
- Axioms and deductive steps; describe assumptions
 Functional composition
- No side effects!
- The code documents itself!
 Combination - no silver bullet (in memory speed, out of core scale)
- “e-Lab notebook” http://ipython.org/notebook.html
- Inline visualisations (see also Mathematica)
- Hadoop does the heavy lifting (ETL)
- Pig, Hive, Cascading (Scalding, Cascalog), Crunch/Scrunch, Java MR

Live vs static
 A static representation of knowledge does not allow for
discourse with the data and process
- In Phaedrus, Socrates says:
- "Writing shares a strange feature with painting. The offspring
of painting stand there as if they were alive, but if anyone
asks them anything they are solemnly silent"...
- "alone, it cannot defend itself or come to its own support"
 Writing programs or solving problems?
 Encapsulate and generalize specific instance of a process
- To run again
- To run on similar data (making a tool to solve problems)
 Russel Jurney – Agile Data Analyis book

Metadata of datasets
 What is this?
- 5.1,3.5,1.4,0.2,setosa
- 4.9,3.0,1.4,0.2,setosa
- 4.7,3.2,1.3,0.2,setosa
- 4.6,3.1,1.5,0.2,setosa
- 5.0,3.6,1.4,0.2,setosa
- 5.4,3.9,1.7,0.4,setosa
- 4.6,3.4,1.4,0.3,setosa
- 5.0,3.4,1.5,0.2,setosa
- 4.4,2.9,1.4,0.2,setosa

 Modified version of http://archive.ics.uci.edu/ml/datasets/Iris
 1. Title: Iris Plants Database
 Updated Sept 21 by C.Blake - Added discrepency information
 2. Sources:


(a) Creator: R.A. Fisher



(b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)



(c) Date: July, 1988

 3. Past Usage:


- Publications: too many to mention!!! Here are a few.



1. Fisher,R.A. "The use of multiple measurements in taxonomic problems"



Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions



to Mathematical Statistics" (John Wiley, NY, 1950).



2. Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.



(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. ...

Process Methodology

 Mathematical method / Scientific method
- Understanding / Characterize from experience &
observation
- Analysis / Hypothesis: a proposed explanation
- Synthesis / Deduction: prediction from the hypothesis
- Review/Extend / Test and experiment
- http://en.wikipedia.org/wiki/Scientific_method#Relationship_with_mathematics

 CRISP-DM
- http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining

 OSEMN ('awesome') (Hilary Mason)
- Obtaining, Scrubbing, Exploring, Modeling, and
iNterpreting data
- http://www.dataists.com/2010/09/a-taxonomy-of-data-science/

Musicmetric (Semetric Ltd)
 Analytics for musical artists (and beyond)
- Collecting data from the Internet/APIs - Provenance of Data
- Linked entities
 Hadoop-based Big Data processing → NoSQL → RESTful API
- Nathan Marz/”Lambda Architecture”
- http://www.ymc.ch/en/lambda-architecture-part-1
 Used by record labels, artist managers, brand owners, festivals, publishers,
broadcasters

Lots of data about lots of entities

I read it on the Internet, it must be true?
 Collection and archiving of web data is not straightforward
 Dealing with noisy or incorrect data
- Issues with data from APIs
- Filter between raw data and data used in analysis (preprocessing/data cleaning)
- Data and metadata retrospectively changing
- Present processed data, with access to raw data
 Sample rate frequency
- Collect hourly, present daily
- Interpolation to accommodate irregularities in update frequency
 Anomalies...

Fake fans
 “Fake Fans” or “Fake Followers”
- Social media activity caused by artificially created and controlled social media user profiles → fraud
- “Buying fans” to get noticed
 Fan count goes up
- Collect more data, detect and remove anomalous data
- “daily diff” time series – how many fans did I gain today? (compared to yesterday)
 Fan count goes down
- Twitter et al try to fix the problem → Massive removal of fans → This is also a problem!
 Data Science for pre-processing
- Predict what is normal using all historical data (for artist, for data source)
- Death event detector :-/
 http://www.musicmetric.com/2013/04/fake-fans-and-anomaly-detection-at-musicmetric/

Versioning (raw) Data
 Git - https://github.com/blog/1601-see-your-csvs
 Dat Version control for data (git alternative)
- http://dat-data.com/
- @maxogden http://strataconf.com/strataeu2013/public/schedule/detail/32390
 Figshare
 S3 (e.g. Datasift Twitter firehose)
 Dropbox? (“consumerisation” of enterprise tools)
 TSV not CSV (consider bz2 rather than gzip; don't forget to md5sum)

Using linked data to direct Big Data analytics
 Linked data platform
- Profiles for The Beatles
- Puff Daddy/P Diddy, Prince/TAFKAP
- Macklemore & Ryan Lewis, Simon & Garfunkel
- Canonicalise URLs
- Temporal logic? IDs change; not good, but it happens (Musicbrainz NGS)
- RESTful API / UUIDs / external IDs
 Manual curation separated from data processing
- Resist all temptation for any manual manipulation of data!

Future
 Data to knowledge - Value chain of data
- Provenance is key to this
 Epistemology / Justified True Belief
- Semantic Web
- Big Metadata: internet of things (the archetypes, not the physical objects)

Summary
 How you made your discovery is as important as the discovery
- Reproduce, debug, verify, share, re-application
 Open Data Science
 How to represent (declarative vs imperative, maths vs software engineering)
- Separate Data Science from Software Engineering in a well defined way
- Don't orphan data from how it was computed
 Don't rely on your input data/metadata
- a) never changing b) being available
- Version control and share your (meta)data
 Resist all temptation for any manual manipulation of data!
 Consider the entities you are analysing

Thank You!
 Any questions?
 Jameel Syed
 @tilapia

Data Science Provenance: From Drug Discovery to Fake Fans

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Data Science Provenance: From Drug Discovery to Fake Fans

Similar to Data Science Provenance: From Drug Discovery to Fake Fans (20)

Recently uploaded

Recently uploaded (20)

Data Science Provenance: From Drug Discovery to Fake Fans