Knowledge work adds value to raw data; how this activity is performed is critical for how reliably results can be reproduced and scrutinized. With a brief diversion into epistemology, the presentation will outline the challenges for practitioners and consumers of Big Data analysis, and demonstrate how these were tackled at Inforsense (life sciences workflow analytics platform) and Musicmetric (social media analytics for music).
The talk covers the following issues with concrete examples:
- Representations of provenance
- Considerations to allow analysis computation to be recreated
- Reliable collection of noisy data from the internet
- Archiving of data and accommodating retrospective changes
- Using linked data to direct Big Data analytics
2. Overview
Knowledge work adds value to raw data
How determines whether results can be reliably reproduced and scrutinized
Solving parts of the problem
- Inforsense (life sciences workflow analytics platform)
- Musicmetric (social media analytics for music)
What's Provenance & why its important
Representations of provenance
Considerations to allow analysis computation to be recreated
Reliable collection of noisy data from the Internet
Archiving of data and accommodating retrospective changes
Using linked data to direct Big Data analytics
3. What is Data (Science) Provenance?
Scientific research is generally held to be of good provenance when it is documented in
detail sufficient to allow reproducibility. Scientific workflows assist scientists and
programmers with tracking their data through all transformations, analyses, and
interpretations. Data sets are reliable when the process used to create them are
reproducible and analyzable for defects. Current initiatives to effectively manage,
share, and reuse ecological data are indicative of the increasing importance of data
provenance.
Reproducibility of data & research process
- Explanation - Why were the end conclusions reached?
- Debugging and verification – Sharing, auditing
- Re-application
4. The Economist, October 19th 2013
Last year researchers at one biotech firm,
Amgen, found they could reproduce just six of 53
“landmark” studies in cancer research. Earlier, a
group at Bayer, a drug company, managed to
repeat just a quarter of 67 similarly important
papers.
Ideally, research protocols should be registered
in advance and monitored in virtual notebooks.
This would curb the temptation to fiddle with the
experiment’s design midstream so as to make
the results look more substantial than they
are. ... Where possible, trial data also should be
open for other researchers to inspect and test.
http://econ.st/H3qU5a
Nature, Vol 500, 1st August 2013; http://go.nature.com/zqtrnp
5. Reinhart and Rogoff's spreadsheet error
"Growth in a Time of Debt" paper shaping decisions affecting
national economies
BBC; 20 April 2013 http://www.bbc.co.uk/news/magazine-22223190
- After some correspondence, Reinhart and Rogoff provided
Thomas with the actual working spreadsheet they'd used to obtain
their results. "Everyone says seeing is believing, but I almost
didn't believe my eyes," he says.
- The Harvard professors had accidentally only included 15 of the
20 countries under analysis in their key calculation (of average
GDP growth in countries with high public debt). Australia, Austria,
Belgium, Canada and Denmark were missing.
- Businessweek FAQ http://buswk.co/YZgwSA
"Spreadsheets: The Ununderstood Dark Matter Of IT"
- Y2K bug was not just COBOL!
6. Open Data Science
Open Source Software is the foundation
Open Access to data and methodology - errors happen, but are they found?
Many efforts...
- Open Access publication (PubMed, arXiv.org)
- Mozilla Science Lab @MozillaScience
- Open Knowledge Foundation http://okfn.org
- Open Data Institute http://theodi.org/
Licensing
- Panton Principles
- Creative commons license data
- Non-commercial API access
7. Inforsense
Workflow analytics platform for Life Sciences
- “in silico” research / e-Science
- Process representation and re-use
- Which data sets were used, where are they from, how were they computed?
Spin out from research at Imperial College London
- Discovery Net e-Science project
Used by pharmaceutical and biotech companies
8. “Big Data”
Gene chips (DNA microarray) – rather than a PhD on a few genes, 10's of thousands a
time (& culmination of Human Genome Project)
High-throughput screening (HTS) – drug discovery; thousands of automated
experiments per day
What to do with the data?
- Paper published
- Data set sometimes published
- Reproduce and expand methodology manually
http://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Affymetrix-microarray.jpg/150px-Affymetrix-microarray.jpg
9. Representations
How to represent or codify ideas? (beyond writing a traditional paper)
Statistician - Business Intelligence Analyst - Data Scientist - Software engineer
- Some coding?
- How much?
- Scientists have been using Fortran for decades (& S+, R, Matlab...)
- GRAIL (1969, RAND corporation) flow charts and light pens
- http://www.rand.org/content/dam/rand/pubs/research_memoranda/2006/RM6001.pdf
- Bioinformaticians (back in the day) Perl hackers, open source, sharing data
10. Declarative Workflows
Academic paper & data set → encoded as workflow → computed results
What should the set of operations be?
- Deterministic, no side effects
- Common functions between workflows
Functional composition
11. Functional Programming
"Functional programming combines
the flexibility and power of abstract
mathematics with the intuitive clarity
of abstract mathematics."
http://xkcd.com/1270/
12. Declarative vs Imperative
Maths proof scrutiny
- Axioms and deductive steps; describe assumptions
Functional composition
- No side effects!
- The code documents itself!
Combination - no silver bullet (in memory speed, out of core scale)
- “e-Lab notebook” http://ipython.org/notebook.html
- Inline visualisations (see also Mathematica)
- Hadoop does the heavy lifting (ETL)
- Pig, Hive, Cascading (Scalding, Cascalog), Crunch/Scrunch, Java MR
13. Live vs static
A static representation of knowledge does not allow for
discourse with the data and process
- In Phaedrus, Socrates says:
- "Writing shares a strange feature with painting. The offspring
of painting stand there as if they were alive, but if anyone
asks them anything they are solemnly silent"...
- "alone, it cannot defend itself or come to its own support"
Writing programs or solving problems?
Encapsulate and generalize specific instance of a process
- To run again
- To run on similar data (making a tool to solve problems)
Russel Jurney – Agile Data Analyis book
14. Metadata of datasets
What is this?
- 5.1,3.5,1.4,0.2,setosa
- 4.9,3.0,1.4,0.2,setosa
- 4.7,3.2,1.3,0.2,setosa
- 4.6,3.1,1.5,0.2,setosa
- 5.0,3.6,1.4,0.2,setosa
- 5.4,3.9,1.7,0.4,setosa
- 4.6,3.4,1.4,0.3,setosa
- 5.0,3.4,1.5,0.2,setosa
- 4.4,2.9,1.4,0.2,setosa
15. Modified version of http://archive.ics.uci.edu/ml/datasets/Iris
1. Title: Iris Plants Database
Updated Sept 21 by C.Blake - Added discrepency information
2. Sources:
(a) Creator: R.A. Fisher
(b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
(c) Date: July, 1988
3. Past Usage:
- Publications: too many to mention!!! Here are a few.
1. Fisher,R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions
to Mathematical Statistics" (John Wiley, NY, 1950).
2. Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. ...
16. Process Methodology
Mathematical method / Scientific method
- Understanding / Characterize from experience &
observation
- Analysis / Hypothesis: a proposed explanation
- Synthesis / Deduction: prediction from the hypothesis
- Review/Extend / Test and experiment
- http://en.wikipedia.org/wiki/Scientific_method#Relationship_with_mathematics
CRISP-DM
- http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
OSEMN ('awesome') (Hilary Mason)
- Obtaining, Scrubbing, Exploring, Modeling, and
iNterpreting data
- http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
17. Musicmetric (Semetric Ltd)
Analytics for musical artists (and beyond)
- Collecting data from the Internet/APIs - Provenance of Data
- Linked entities
Hadoop-based Big Data processing → NoSQL → RESTful API
- Nathan Marz/”Lambda Architecture”
- http://www.ymc.ch/en/lambda-architecture-part-1
Used by record labels, artist managers, brand owners, festivals, publishers,
broadcasters
19. I read it on the Internet, it must be true?
Collection and archiving of web data is not straightforward
Dealing with noisy or incorrect data
- Issues with data from APIs
- Filter between raw data and data used in analysis (preprocessing/data cleaning)
- Data and metadata retrospectively changing
- Present processed data, with access to raw data
Sample rate frequency
- Collect hourly, present daily
- Interpolation to accommodate irregularities in update frequency
Anomalies...
20. Fake fans
“Fake Fans” or “Fake Followers”
- Social media activity caused by artificially created and controlled social media user profiles → fraud
- “Buying fans” to get noticed
Fan count goes up
- Collect more data, detect and remove anomalous data
- “daily diff” time series – how many fans did I gain today? (compared to yesterday)
Fan count goes down
- Twitter et al try to fix the problem → Massive removal of fans → This is also a problem!
Data Science for pre-processing
- Predict what is normal using all historical data (for artist, for data source)
- Death event detector :-/
http://www.musicmetric.com/2013/04/fake-fans-and-anomaly-detection-at-musicmetric/
21. Versioning (raw) Data
Git - https://github.com/blog/1601-see-your-csvs
Dat Version control for data (git alternative)
- http://dat-data.com/
- @maxogden http://strataconf.com/strataeu2013/public/schedule/detail/32390
Figshare
S3 (e.g. Datasift Twitter firehose)
Dropbox? (“consumerisation” of enterprise tools)
TSV not CSV (consider bz2 rather than gzip; don't forget to md5sum)
23. Using linked data to direct Big Data analytics
Linked data platform
- Profiles for The Beatles
- Puff Daddy/P Diddy, Prince/TAFKAP
- Macklemore & Ryan Lewis, Simon & Garfunkel
- Canonicalise URLs
- Temporal logic? IDs change; not good, but it happens (Musicbrainz NGS)
- RESTful API / UUIDs / external IDs
Manual curation separated from data processing
- Resist all temptation for any manual manipulation of data!
24. Future
Data to knowledge - Value chain of data
- Provenance is key to this
Epistemology / Justified True Belief
- Semantic Web
- Big Metadata: internet of things (the archetypes, not the physical objects)
25. Summary
How you made your discovery is as important as the discovery
- Reproduce, debug, verify, share, re-application
Open Data Science
How to represent (declarative vs imperative, maths vs software engineering)
- Separate Data Science from Software Engineering in a well defined way
- Don't orphan data from how it was computed
Don't rely on your input data/metadata
- a) never changing b) being available
- Version control and share your (meta)data
Resist all temptation for any manual manipulation of data!
Consider the entities you are analysing