From the webinar presentation "Data Science: Not Just for Big Data", hosted by Kalido and presented by:
David Smith, Data Scientist at Revolution Analytics, and
Gregory Piatetsky, Editor, KDnuggets
These are the slides for David Smith's portion of the presentation.
Watch the full webinar at:
http://www.kalido.com/data-science.htm
2. Big Data: the new oil?
Photo: Sarah&Boston (flickr: pocheco) Creative Commons BY-SA 2.0
Revolution Confidential
2
3. Big Data is just raw material
Revolution Confidential
Data Distillation
Extract quantities of interest
Find complete cases
Derive missing information
Big Data Pitfalls:
Data cleanliness & accuracy
Observational bias
Do the data I have represent the population I’m
interested in?
3
4. Surveys & Experiments
Revolution Confidential
Even with Big Data, the data you need isn’t
always in the building!
… so ask (survey)!
Survey design
Stratified sampling
… or experiment!
A/B Testing
Experimental Design
4
5. Data Exploration & Visualization
Revolution Confidential
Limited by pixels
Big data = a big black
blob
Extract signal from
noise
Aggregations
Heat maps
Smoothing
Small multiples
5
6. Statistical Modeling & Forecasting
Revolution Confidential
You don’t always need big data
Sampling can help with observational bias
Model selection
Feature extraction
Confounding?
Interactions?
Model validation
Overfitting
Prediction
Extrapolation
Confidence
http://xkcd.com/605/
6
7. Summary
Revolution Confidential
Big Data is great, but think of it as the “raw
materials” for data science
After refining, “big” isn’t always so “Big”
Use statistical insight to avoid pitfalls:
Inferences: Observational bias / Sampling bias
Predictions: Confounding / Overfitting
Think about variances and means (risk!)
Some data scientists may miss these issues
Look for statistical expertise
Further reading:
ComputerWorld: 12 predictive analytics screw-ups
7