Analytics and Data Mining Industry Overview

Analytics Industry Overview:
To Big Data and Beyond !
Gregory Piatetsky
www.KDnuggets.com/gps.html

(c) KDnuggets 2011 1

My Data Path
• PhD in applying Machine Learning to databases
• Researcher at GTE Labs – started first project
on Knowledge Discovery in Databases in 1989
• Organized first 3 KDD workshops (1989-93),
cofounded KDD conferences and ACM SIGKDD
• Chief Scientist at analytics startup 1998-2001
• Chair, SIGKDD, 2005-2009
• Analytics/Data Mining Consultant, 2001-

KDnuggets
• Stands for Knowledge Discovery
Nuggets
• 1993 - started KDnuggets News email newsletter (~
12,000 email subscribers now)
• early website in 1994, www.KDnuggets.com in 1997
– 2011 best year, 45-50,000 unique visitors/month
• twitter.com/kdnuggets ~3,000 followers
• facebook.com/kdnuggets page
• group: KDnuggets Analytics & Data Mining

• Recently featured on CNN


KDnuggets mission
Cover Analytics and Data Mining field :
• News, Jobs, Software, Data (most popular)
• Also Academic positions, CFP, Companies,
Consulting, Courses, Meetings, Polls,
Publications, Solutions, Webcasts

• Subscribe to bi-weekly KDnuggets News at
www.kdnuggets.com/subscribe.html

Analyzing Data or …
• Statistics
• Data mining Core:
• Knowledge Discovery in Data Finding
• KDD Useful
• Analytics Patterns
• Data Science in Data
• …?


History
• Statistics: 1800 -
• Data dredging, data “fishing” : 1960s
• Data Mining: 1980 –
• Database Mining ~ 1985 (was HNC trademark, not used)
• Knowledge Discovery in Data: 1989 –
– KDD workshop in 1989
• Analytics : 2006 –
– Google Analytics, “Competing on Analytics” book
• Data Science: 2010 –


Pre-history

Statistics is the biggest term in 20th century, but
data mining and analytics appears in late 1990s
From Google Ngram viewer – English language books
Note: Our analysis uses only English language data.
Other languages, especially Chinese , need to be considered for full picture

Recent History:
Analytics, Data Mining, Knowledge Discovery

Analytics has been used since 1800, but started to rise in 2005
Data Mining jumps around 1996 (soon after first KDD conference) but declines after
2003 (TIA controversy, associated with gov. invasion of privacy).
Knowledge Discovery appears in 1989, jumps in 1996, and plateaus after 2000

Google N-gram Results case sensitive

Different capitalizations changes counts, but using lowercase is probably
appropriate to measure general popularity.


Earliest use of “data mining” 1962?
After eliminating many “following data. Mining cost is ” examples
which refer to Mining of minerals,
and books from “1958” that have a CD attached (errors in book year)

The earliest “data mining” reference I found is

Source: Google Books


Google Trends:
After 2006, Data Mining < Analytics


Google Trends:
Analytics observations

Competing on Analytics
Google Analytics introduced, book, Apr 2007 December vacation drop
Dec 2005 (c) KDnuggets 2011

Half of “Analytics” searches are for
“Google Analytics”


Excluding Google Analytics


Google Insights: searches for
data mining, analytics -google
are most popular in India, US


Data Mining >> Predictive Analytics


Business, Predictive, Text Analytics


Analytics > Data Mining > Data Science


Data Science, Big Data


Analytics Today

KDnuggets Polls Findings
www.KDnuggets.com/polls/


Where did you apply analytics/data mining?
0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0%

CRM/ consumer analytics
Banking
Health care/ HR
Fraud Detection
Direct Marketing/ Fundraising
Finance
Telecom / Cable
Science
Insurance
Advertising
Education
avg 2.4
Web usage mining
Credit Scoring
Retail
industries
Medical/ Pharma
Manufacturing
e-Commerce
Social Networks
Search / Web content mining
Government/Military
Biotech/Genomics
Investment / Stocks
Entertainment/ Music
Security / Anti-terrorism
Travel / Hospitality
Social Policy/Survey analysis
Junk email / Anti-spam
Other

www.KDnuggets.com/polls/2010/analytics-data-mining-industries-applications.html

Data Types Analyzed/Mined

www.KDnuggets.com/polls/2011/data-types-analyzed-mined.html

Data Types w. Most Growth in 2011
• location/geo/mobile data

• music / audio

• time series

• Genomics, according to John Mattison


Largest Dataset Analyzed?
2011 median dataset size
~10-20 GB,
vs 8-10 GB in 2010.

Increase in
10 GB to 1 PB range

www.KDnuggets.com/polls/2011/largest-dataset-analyzed-data-mined.html

Largest Dataset Analyzed by Region


Which methods/algorithms did you
use for data analysis in 2011
% analysts who used it
0% 10% 20% 30% 40% 50% 60% 70%

Decision Trees
Regression
Clustering
Statistics
Visualization
Time series/Sequence analysis
Support Vector (SVM)
Association rules
Ensemble methods
Text Mining
Neural Nets
Boosting
Bayesian
Bagging
Factor Analysis
Anomaly/Deviation detection
Social Network Analysis
Survival Analysis
Genetic algorithms
Uplift modeling

www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html

Algorithms with highest
Industry Affinity


“Academic” algorithms
lowest Industry affinity


Cloud Analytics is not common (yet)


JOBS AND SKILLS


Shortage of Skills
• McKinsey: shortage by 2018 in the US of
– 140-190,000 people with deep analytical skills

– 1.5 M managers/analysts with the know-how to
use the analysis of big data to make effective
decisions.

Source:
www.mckinsey.com/mgi/publications/big_data/


Job data: Data Scientist


Jobs: Data Mining >> Data Scientist


“Ground” Analytics (LinkedIn Skills)

~ 75,000 with Data Mining skill

~ 7,000 with Predictive Modeling

Also
~ 20,000 with Predictive Analytics
(not related with Predictive
Modeling ??


Cloud (Big Data) Analytics Skills


Analytics LinkedIn Skills

Predictive Analytics Machine Learning

Text
Mining MapReduce


Data Tsunami
• In 2010 enterprises
stored 7 exabytes
=7,000,000,000 GB
of new data (McKinsey)
• 90 percent of the
world's data has been
Image with apologies to KDD-2011
generated in the past
two years (IBM)


Big Data Aspects?
• Volume
– Terabytes to Petabytes …
• Velocity
– online streaming
• Variety
– numbers, text, links, images, audio, video, …


Volume + Velocity => No consistency
• CAP Theorem (Eric Brewer, 2000)
For highly scalable distributed systems, you can only
have two of following:
– 1) consistency,
– 2) high availability, and
– 3) (network) partition tolerance (network failure tolerance)
http://www.julianbrowne.com/article/viewer/brewers-cap-
theorem

Implication: Big data solutions must stop worrying
about consistency if they want high availability


Big Data
• 2nd Industrial Revolution

• Do old activities better

• Create new activities/businesses


Application areas
• Doing old things better
– Churn prediction
– Direct marketing/Customer modeling
– Recommendations
– Fraud detection
– Security/Intelligence
–…
• Competition will level companies


Limit to Predicting Customer Behavior?
• There is fundamental randomness in human
behavior and once we find 1-level
effects, more data or better algorithms will
give diminishing returns in most cases
• Example: Netflix Prize: the most advanced
algorithms were only a few percentages better
than basic algorithms


Direct Marketing:
Random and Model-sorted Lists
100
CPH: Cumulative Pct Hits

90
80
70
60 Random
50 Model
40
30
20
10
0
5

15

25

35

45

55

65

75

85

95
Pct list
5% of random list have 5% of hits
5% of model-score ranked list have 21% of hits.
Lift(5%) = 21%/5% = 4.2

Most lift curves are surprising similar
Study of lift curves in banking,
telecom Actual lift(T) Est. lift(T)
14
Best lift curves are similar 12
Special point T=Target 10
percentage 8

Lift
6

Lift(T) ~ sqrt (1/T) 4
2
0
0 5 10 15 20 25
G. Piatetsky-Shapiro, B. Masand,
Estimating Campaign Benefits and 100*T%
Modeling Lift, in Proceedings of
KDD-99 Conference, ACM
Press, 1999.


Big Data Enables New Things !
– Google – first big success of big data
– Social networks (facebook, Twitter, LinkedIn, …)
success depends on network size, i.e. big data

– Location analytics
– Health-care
• Personalized medicine
– Semantics and AI ?
• Imagine IBM Watson, Siri in 2020 ?


Big Data Growth By Industry

Source: http://www.mckinsey.com/mgi/publications/big_data/

Research and Industry Disconnect?
• Uplift modeling – needs more research
• Association rules need less papers
• Data Mining with Privacy research – industry
use?

• KDD conference aims to bring researchers and
industry people together


Hot Growth Areas
• Social Analytics
– Klout
– many twitter micro-analytics
(twitalyzer, TweetEffect, TweetStats)

• Mobile Analytics
– Privacy and data tracks (KDD Lab, Pisa)


Analytics and Data Mining Industry Overview

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (19)

Similaire à Analytics and Data Mining Industry Overview

Similaire à Analytics and Data Mining Industry Overview (20)

Dernier

Dernier (20)

Analytics and Data Mining Industry Overview

Notes de l'éditeur