My keynote talk at San Diego Superdata conference, looking at history and current state of Analytics and Data Mining, and examining the effects of Big Data
1. Analytics Industry Overview:
To Big Data and Beyond !
Gregory Piatetsky
www.KDnuggets.com/gps.html
(c) KDnuggets 2011 1
2. My Data Path
• PhD in applying Machine Learning to databases
• Researcher at GTE Labs – started first project
on Knowledge Discovery in Databases in 1989
• Organized first 3 KDD workshops (1989-93),
cofounded KDD conferences and ACM SIGKDD
• Chief Scientist at analytics startup 1998-2001
• Chair, SIGKDD, 2005-2009
• Analytics/Data Mining Consultant, 2001-
(c) KDnuggets 2011 2
3. KDnuggets
• Stands for Knowledge Discovery
Nuggets
• 1993 - started KDnuggets News email newsletter (~
12,000 email subscribers now)
• early website in 1994, www.KDnuggets.com in 1997
– 2011 best year, 45-50,000 unique visitors/month
• twitter.com/kdnuggets ~3,000 followers
• facebook.com/kdnuggets page
• group: KDnuggets Analytics & Data Mining
• Recently featured on CNN
(c) KDnuggets 2011 3
4. KDnuggets mission
Cover Analytics and Data Mining field :
• News, Jobs, Software, Data (most popular)
• Also Academic positions, CFP, Companies,
Consulting, Courses, Meetings, Polls,
Publications, Solutions, Webcasts
• Subscribe to bi-weekly KDnuggets News at
www.kdnuggets.com/subscribe.html
(c) KDnuggets 2011 4
5. Analyzing Data or …
• Statistics
• Data mining Core:
• Knowledge Discovery in Data Finding
• KDD Useful
• Analytics Patterns
• Data Science in Data
• …?
(c) KDnuggets 2011 5
6. History
• Statistics: 1800 -
• Data dredging, data “fishing” : 1960s
• Data Mining: 1980 –
• Database Mining ~ 1985 (was HNC trademark, not used)
• Knowledge Discovery in Data: 1989 –
– KDD workshop in 1989
• Analytics : 2006 –
– Google Analytics, “Competing on Analytics” book
• Data Science: 2010 –
(c) KDnuggets 2011 6
7. Pre-history
Statistics is the biggest term in 20th century, but
data mining and analytics appears in late 1990s
From Google Ngram viewer – English language books
Note: Our analysis uses only English language data.
Other languages, especially Chinese , need to be considered for full picture
(c) KDnuggets 2011 7
8. Recent History:
Analytics, Data Mining, Knowledge Discovery
Analytics has been used since 1800, but started to rise in 2005
Data Mining jumps around 1996 (soon after first KDD conference) but declines after
2003 (TIA controversy, associated with gov. invasion of privacy).
Knowledge Discovery appears in 1989, jumps in 1996, and plateaus after 2000
(c) KDnuggets 2011 8
9. Google N-gram Results case sensitive
Different capitalizations changes counts, but using lowercase is probably
appropriate to measure general popularity.
(c) KDnuggets 2011 9
10. Earliest use of “data mining” 1962?
After eliminating many “following data. Mining cost is ” examples
which refer to Mining of minerals,
and books from “1958” that have a CD attached (errors in book year)
The earliest “data mining” reference I found is
Source: Google Books
(c) KDnuggets 2011 10
12. Google Trends:
Analytics observations
Competing on Analytics
Google Analytics introduced, book, Apr 2007 December vacation drop
Dec 2005 (c) KDnuggets 2011
13. Half of “Analytics” searches are for
“Google Analytics”
(c) KDnuggets 2011 13
23. Data Types w. Most Growth in 2011
• location/geo/mobile data
• music / audio
• time series
• Genomics, according to John Mattison
(c) KDnuggets 2011 23
24. Largest Dataset Analyzed?
2011 median dataset size
~10-20 GB,
vs 8-10 GB in 2010.
Increase in
10 GB to 1 PB range
www.KDnuggets.com/polls/2011/largest-dataset-analyzed-data-mined.html
(c) KDnuggets 2011 24
26. Which methods/algorithms did you
use for data analysis in 2011
% analysts who used it
0% 10% 20% 30% 40% 50% 60% 70%
Decision Trees
Regression
Clustering
Statistics
Visualization
Time series/Sequence analysis
Support Vector (SVM)
Association rules
Ensemble methods
Text Mining
Neural Nets
Boosting
Bayesian
Bagging
Factor Analysis
Anomaly/Deviation detection
Social Network Analysis
Survival Analysis
Genetic algorithms
Uplift modeling
www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html
(c) KDnuggets 2011 26
27. Algorithms with highest
Industry Affinity
www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html
(c) KDnuggets 2011 27
31. Shortage of Skills
• McKinsey: shortage by 2018 in the US of
– 140-190,000 people with deep analytical skills
– 1.5 M managers/analysts with the know-how to
use the analysis of big data to make effective
decisions.
Source:
www.mckinsey.com/mgi/publications/big_data/
(c) KDnuggets 2011 31
34. “Ground” Analytics (LinkedIn Skills)
~ 75,000 with Data Mining skill
~ 7,000 with Predictive Modeling
Also
~ 20,000 with Predictive Analytics
(not related with Predictive
Modeling ??
(c) KDnuggets 2011 34
37. Data Tsunami
• In 2010 enterprises
stored 7 exabytes
=7,000,000,000 GB
of new data (McKinsey)
• 90 percent of the
world's data has been
Image with apologies to KDD-2011
generated in the past
two years (IBM)
(c) KDnuggets 2011 37
38. Big Data Aspects?
• Volume
– Terabytes to Petabytes …
• Velocity
– online streaming
• Variety
– numbers, text, links, images, audio, video, …
(c) KDnuggets 2011 38
39. Volume + Velocity => No consistency
• CAP Theorem (Eric Brewer, 2000)
For highly scalable distributed systems, you can only
have two of following:
– 1) consistency,
– 2) high availability, and
– 3) (network) partition tolerance (network failure tolerance)
http://www.julianbrowne.com/article/viewer/brewers-cap-
theorem
Implication: Big data solutions must stop worrying
about consistency if they want high availability
(c) KDnuggets 2011 39
40. Big Data
• 2nd Industrial Revolution
• Do old activities better
• Create new activities/businesses
(c) KDnuggets 2011 40
41. Application areas
• Doing old things better
– Churn prediction
– Direct marketing/Customer modeling
– Recommendations
– Fraud detection
– Security/Intelligence
–…
• Competition will level companies
(c) KDnuggets 2011 41
42. Limit to Predicting Customer Behavior?
• There is fundamental randomness in human
behavior and once we find 1-level
effects, more data or better algorithms will
give diminishing returns in most cases
• Example: Netflix Prize: the most advanced
algorithms were only a few percentages better
than basic algorithms
(c) KDnuggets 2011 42
43. Direct Marketing:
Random and Model-sorted Lists
100
CPH: Cumulative Pct Hits
90
80
70
60 Random
50 Model
40
30
20
10
0
5
15
25
35
45
55
65
75
85
95
Pct list
5% of random list have 5% of hits
5% of model-score ranked list have 21% of hits.
Lift(5%) = 21%/5% = 4.2
44. Most lift curves are surprising similar
Study of lift curves in banking,
telecom Actual lift(T) Est. lift(T)
14
Best lift curves are similar 12
Special point T=Target 10
percentage 8
Lift
6
Lift(T) ~ sqrt (1/T) 4
2
0
0 5 10 15 20 25
G. Piatetsky-Shapiro, B. Masand,
Estimating Campaign Benefits and 100*T%
Modeling Lift, in Proceedings of
KDD-99 Conference, ACM
Press, 1999.
(c) KDnuggets 2011 44
45. Big Data Enables New Things !
– Google – first big success of big data
– Social networks (facebook, Twitter, LinkedIn, …)
success depends on network size, i.e. big data
– Location analytics
– Health-care
• Personalized medicine
– Semantics and AI ?
• Imagine IBM Watson, Siri in 2020 ?
(c) KDnuggets 2011 45
46. Big Data Growth By Industry
Source: http://www.mckinsey.com/mgi/publications/big_data/
(c) KDnuggets 2011 46
47. Research and Industry Disconnect?
• Uplift modeling – needs more research
• Association rules need less papers
• Data Mining with Privacy research – industry
use?
• KDD conference aims to bring researchers and
industry people together
(c) KDnuggets 2011 47
48. Hot Growth Areas
• Social Analytics
– Klout
– many twitter micro-analytics
(twitalyzer, TweetEffect, TweetStats)
• Mobile Analytics
– Privacy and data tracks (KDD Lab, Pisa)
(c) KDnuggets 2011 48