SlideShare une entreprise Scribd logo
1  sur  49
Analytics Industry Overview:
 To Big Data and Beyond !
        Gregory Piatetsky
   www.KDnuggets.com/gps.html


             (c) KDnuggets 2011   1
My Data Path
• PhD in applying Machine Learning to databases
• Researcher at GTE Labs – started first project
  on Knowledge Discovery in Databases in 1989
• Organized first 3 KDD workshops (1989-93),
  cofounded KDD conferences and ACM SIGKDD
• Chief Scientist at analytics startup 1998-2001
• Chair, SIGKDD, 2005-2009
• Analytics/Data Mining Consultant, 2001-
                    (c) KDnuggets 2011        2
KDnuggets
• Stands for Knowledge Discovery
                           Nuggets
• 1993 - started KDnuggets News email newsletter (~
  12,000 email subscribers now)
• early website in 1994, www.KDnuggets.com in 1997
    – 2011 best year, 45-50,000 unique visitors/month
•      twitter.com/kdnuggets ~3,000 followers
•     facebook.com/kdnuggets page
•      group: KDnuggets Analytics & Data Mining

• Recently featured on CNN

                             (c) KDnuggets 2011         3
KDnuggets mission
Cover Analytics and Data Mining field :
• News, Jobs, Software, Data (most popular)
• Also Academic positions, CFP, Companies,
  Consulting, Courses, Meetings, Polls,
  Publications, Solutions, Webcasts

• Subscribe to bi-weekly KDnuggets News at
  www.kdnuggets.com/subscribe.html
                    (c) KDnuggets 2011        4
Analyzing Data or …
•   Statistics
•   Data mining                           Core:
•   Knowledge Discovery in Data           Finding
•   KDD                                   Useful
•   Analytics                             Patterns
•   Data Science                          in Data
•   …?

                     (c) KDnuggets 2011              5
History
• Statistics: 1800 -
• Data dredging, data “fishing” : 1960s
• Data Mining: 1980 –
• Database Mining ~ 1985 (was HNC trademark, not used)
• Knowledge Discovery in Data: 1989 –
   – KDD workshop in 1989
• Analytics : 2006 –
   – Google Analytics, “Competing on Analytics” book
• Data Science: 2010 –

                         (c) KDnuggets 2011              6
Pre-history




Statistics is the biggest term in 20th century, but
data mining           and analytics        appears in late 1990s
From Google Ngram viewer – English language books
Note: Our analysis uses only English language data.
Other languages, especially Chinese , need to be considered for full picture
                                   (c) KDnuggets 2011                          7
Recent History:
Analytics, Data Mining, Knowledge Discovery




  Analytics has been used since 1800, but started to rise in 2005
  Data Mining jumps around 1996 (soon after first KDD conference) but declines after
  2003 (TIA controversy, associated with gov. invasion of privacy).
  Knowledge Discovery appears in 1989, jumps in 1996, and plateaus after 2000
                              (c) KDnuggets 2011                              8
Google N-gram Results case sensitive




 Different capitalizations changes counts, but using lowercase is probably
 appropriate to measure general popularity.

                               (c) KDnuggets 2011                            9
Earliest use of “data mining” 1962?
       After eliminating many “following data. Mining cost is ” examples
       which refer to Mining of minerals,
       and books from “1958” that have a CD attached (errors in book year)

       The earliest “data mining” reference I found is




  Source: Google Books

                                (c) KDnuggets 2011                           10
Google Trends:
After 2006, Data Mining < Analytics




              (c) KDnuggets 2011      11
Google Trends:
               Analytics observations




                                Competing on Analytics
Google Analytics introduced,    book, Apr 2007           December vacation drop
Dec 2005                       (c) KDnuggets 2011
Half of “Analytics” searches are for
         “Google Analytics”




              (c) KDnuggets 2011       13
Excluding Google Analytics




          (c) KDnuggets 2011   14
Google Insights: searches for
data mining, analytics -google
 are most popular in India, US




            (c) KDnuggets 2011   15
Data Mining >> Predictive Analytics




              (c) KDnuggets 2011   16
Business, Predictive, Text Analytics




               (c) KDnuggets 2011   17
Analytics > Data Mining > Data Science




                (c) KDnuggets 2011   18
Data Science, Big Data




        (c) KDnuggets 2011   19
Analytics Today

KDnuggets Polls Findings
  www.KDnuggets.com/polls/




          (c) KDnuggets 2011   20
Where did you apply analytics/data mining?
                               0.0%   5.0%   10.0%   15.0%   20.0%   25.0%   30.0%

    CRM/ consumer analytics
                        Banking
               Health care/ HR
               Fraud Detection
Direct Marketing/ Fundraising
                        Finance
               Telecom / Cable
                        Science
                      Insurance
                    Advertising
                     Education
                                                                                       avg 2.4
             Web usage mining
                 Credit Scoring
                          Retail
                                                                                     industries
              Medical/ Pharma
                Manufacturing
                  e-Commerce
               Social Networks
 Search / Web content mining
          Government/Military
             Biotech/Genomics
           Investment / Stocks
         Entertainment/ Music
      Security / Anti-terrorism
            Travel / Hospitality
  Social Policy/Survey analysis
       Junk email / Anti-spam
                           Other




   www.KDnuggets.com/polls/2010/analytics-data-mining-industries-applications.html
                                                        (c) KDnuggets 2011                        21
Data Types Analyzed/Mined




www.KDnuggets.com/polls/2011/data-types-analyzed-mined.html
                            (c) KDnuggets 2011                22
Data Types w. Most Growth in 2011
• location/geo/mobile data

• music / audio

• time series

• Genomics, according to John Mattison

                   (c) KDnuggets 2011    23
Largest Dataset Analyzed?
                                                    2011 median dataset size
                                                    ~10-20 GB,
                                                    vs 8-10 GB in 2010.

                                                    Increase in
                                                    10 GB to 1 PB range




www.KDnuggets.com/polls/2011/largest-dataset-analyzed-data-mined.html
                             (c) KDnuggets 2011                           24
Largest Dataset Analyzed by Region




              (c) KDnuggets 2011   25
Which methods/algorithms did you
  use for data analysis in 2011
                                       % analysts who used it
                                  0%     10%   20%      30%         40%   50%   60%   70%

                Decision Trees
                    Regression
                    Clustering
                     Statistics
                 Visualization
 Time series/Sequence analysis
         Support Vector (SVM)
             Association rules
           Ensemble methods
                  Text Mining
                  Neural Nets
                     Boosting
                     Bayesian
                      Bagging
               Factor Analysis
 Anomaly/Deviation detection
       Social Network Analysis
              Survival Analysis
           Genetic algorithms
               Uplift modeling



www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html
                                               (c) KDnuggets 2011                           26
Algorithms with highest
              Industry Affinity




www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html
                            (c) KDnuggets 2011                       27
“Academic” algorithms
           lowest Industry affinity




www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html
                            (c) KDnuggets 2011                       28
Cloud Analytics is not common (yet)




 www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html
                             (c) KDnuggets 2011                       29
JOBS AND SKILLS




   (c) KDnuggets 2011   30
Shortage of Skills
• McKinsey: shortage by 2018 in the US of
  – 140-190,000 people with deep analytical skills

  – 1.5 M managers/analysts with the know-how to
    use the analysis of big data to make effective
    decisions.

  Source:
    www.mckinsey.com/mgi/publications/big_data/

                      (c) KDnuggets 2011             31
Job data: Data Scientist




         (c) KDnuggets 2011   32
Jobs: Data Mining >> Data Scientist




              (c) KDnuggets 2011   33
“Ground” Analytics (LinkedIn Skills)

                                    ~ 75,000 with Data Mining skill

                                    ~ 7,000 with Predictive Modeling



                                    Also
                                    ~ 20,000 with Predictive Analytics
                                    (not related with Predictive
                                    Modeling ??




               (c) KDnuggets 2011                                     34
Cloud (Big Data) Analytics Skills




             (c) KDnuggets 2011     35
Analytics LinkedIn Skills




 Predictive Analytics      Machine Learning



Text
Mining                                        MapReduce




                        (c) KDnuggets 2011                36
Data Tsunami
• In 2010 enterprises
  stored 7 exabytes
  =7,000,000,000 GB
of new data (McKinsey)
• 90 percent of the
  world's data has been
                                        Image with apologies to KDD-2011
  generated in the past
  two years (IBM)

                   (c) KDnuggets 2011                                      37
Big Data Aspects?
• Volume
  – Terabytes to Petabytes …
• Velocity
  – online streaming
• Variety
  – numbers, text, links, images, audio, video, …




                       (c) KDnuggets 2011           38
Volume + Velocity => No consistency
• CAP Theorem (Eric Brewer, 2000)
  For highly scalable distributed systems, you can only
    have two of following:
  – 1) consistency,
  – 2) high availability, and
  – 3) (network) partition tolerance (network failure tolerance)
   http://www.julianbrowne.com/article/viewer/brewers-cap-
     theorem

  Implication: Big data solutions must stop worrying
    about consistency if they want high availability

                           (c) KDnuggets 2011                      39
Big Data
• 2nd Industrial Revolution

• Do old activities better

• Create new activities/businesses




                     (c) KDnuggets 2011   40
Application areas
• Doing old things better
  – Churn prediction
  – Direct marketing/Customer modeling
  – Recommendations
  – Fraud detection
  – Security/Intelligence
  –…
• Competition will level companies

                    (c) KDnuggets 2011   41
Limit to Predicting Customer Behavior?
• There is fundamental randomness in human
  behavior and once we find 1-level
  effects, more data or better algorithms will
  give diminishing returns in most cases
• Example: Netflix Prize: the most advanced
  algorithms were only a few percentages better
  than basic algorithms


                    (c) KDnuggets 2011        42
Direct Marketing:
                                Random and Model-sorted Lists
                                 100
     CPH: Cumulative Pct Hits



                                  90
                                  80
                                  70
                                  60                                                       Random
                                  50                                                       Model
                                  40
                                  30
                                  20
                                  10
                                   0
                                       5

                                           15

                                                25

                                                     35

                                                          45

                                                               55

                                                                    65

                                                                         75

                                                                              85

                                                                                   95
                                                                                        Pct list
5% of random list have 5% of hits
5% of model-score ranked list have 21% of hits.
Lift(5%) = 21%/5% = 4.2
Most lift curves are surprising similar
Study of lift curves in banking,
   telecom                                                      Actual lift(T)    Est. lift(T)
                                                 14
Best lift curves are similar                     12
Special point T=Target                           10
   percentage                                     8




                                          Lift
                                                  6

Lift(T) ~          sqrt (1/T)                     4
                                                  2
                                                  0
                                                      0     5       10           15         20   25
G. Piatetsky-Shapiro, B. Masand,
Estimating Campaign Benefits and                                      100*T%
    Modeling Lift, in Proceedings of
    KDD-99 Conference, ACM
    Press, 1999.




                                       (c) KDnuggets 2011                                        44
Big Data Enables New Things !
– Google – first big success of big data
– Social networks (facebook, Twitter, LinkedIn, …)
  success depends on network size, i.e. big data

– Location analytics
– Health-care
   • Personalized medicine
– Semantics and AI ?
   • Imagine IBM Watson, Siri in 2020 ?

                       (c) KDnuggets 2011            45
Big Data Growth By Industry




  Source: http://www.mckinsey.com/mgi/publications/big_data/
                      (c) KDnuggets 2011                       46
Research and Industry Disconnect?
• Uplift modeling – needs more research
• Association rules need less papers
• Data Mining with Privacy research – industry
  use?

• KDD conference aims to bring researchers and
  industry people together


                    (c) KDnuggets 2011           47
Hot Growth Areas
• Social Analytics
  – Klout
  – many twitter micro-analytics
    (twitalyzer, TweetEffect, TweetStats)


• Mobile Analytics
  – Privacy and data tracks (KDD Lab, Pisa)



                       (c) KDnuggets 2011     48
Big Data Bubble?

Big Data




           Gartner Hype Cycle

                                                 49
                    Copyright © 2011 KDnuggets

Contenu connexe

Tendances

DataEd Webinar: Implementing Successful Data Strategies - Developing Organiza...
DataEd Webinar: Implementing Successful Data Strategies - Developing Organiza...DataEd Webinar: Implementing Successful Data Strategies - Developing Organiza...
DataEd Webinar: Implementing Successful Data Strategies - Developing Organiza...DATAVERSITY
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
 
Key Elements of a Successful Data Governance Program
Key Elements of a Successful Data Governance ProgramKey Elements of a Successful Data Governance Program
Key Elements of a Successful Data Governance ProgramDATAVERSITY
 
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdfManaging-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdfQuantUniversity
 
How to Create a Data Analytics Roadmap
How to Create a Data Analytics RoadmapHow to Create a Data Analytics Roadmap
How to Create a Data Analytics RoadmapCCG
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best PracticesDATAVERSITY
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY
 
Denodo as the Core Pillar of your API Strategy
Denodo as the Core Pillar of your API StrategyDenodo as the Core Pillar of your API Strategy
Denodo as the Core Pillar of your API StrategyDenodo
 
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive-Scale Entity Resolution Using the Power of Apache Spark and GraphMassive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive-Scale Entity Resolution Using the Power of Apache Spark and GraphDatabricks
 
Improving Healthcare Operations Using Process Data Mining
Improving Healthcare Operations Using Process Data Mining Improving Healthcare Operations Using Process Data Mining
Improving Healthcare Operations Using Process Data Mining Splunk
 
DAS Slides: Data Governance - Combining Data Management with Organizational ...
DAS Slides: Data Governance -  Combining Data Management with Organizational ...DAS Slides: Data Governance -  Combining Data Management with Organizational ...
DAS Slides: Data Governance - Combining Data Management with Organizational ...DATAVERSITY
 
Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...
Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...
Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...Databricks
 
How to Build & Sustain a Data Governance Operating Model
How to Build & Sustain a Data Governance Operating Model How to Build & Sustain a Data Governance Operating Model
How to Build & Sustain a Data Governance Operating Model DATUM LLC
 
kinds of analytics
kinds of analyticskinds of analytics
kinds of analyticsBenila Paul
 
Data-Ed: Data-centric Strategy & Roadmap
Data-Ed: Data-centric Strategy & RoadmapData-Ed: Data-centric Strategy & Roadmap
Data-Ed: Data-centric Strategy & RoadmapData Blueprint
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecasesSreenatha Reddy K R
 

Tendances (20)

DataEd Webinar: Implementing Successful Data Strategies - Developing Organiza...
DataEd Webinar: Implementing Successful Data Strategies - Developing Organiza...DataEd Webinar: Implementing Successful Data Strategies - Developing Organiza...
DataEd Webinar: Implementing Successful Data Strategies - Developing Organiza...
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Key Elements of a Successful Data Governance Program
Key Elements of a Successful Data Governance ProgramKey Elements of a Successful Data Governance Program
Key Elements of a Successful Data Governance Program
 
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdfManaging-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
 
How to Create a Data Analytics Roadmap
How to Create a Data Analytics RoadmapHow to Create a Data Analytics Roadmap
How to Create a Data Analytics Roadmap
 
Three Big Data Case Studies
Three Big Data Case StudiesThree Big Data Case Studies
Three Big Data Case Studies
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best Practices
 
Data Science
Data ScienceData Science
Data Science
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Denodo as the Core Pillar of your API Strategy
Denodo as the Core Pillar of your API StrategyDenodo as the Core Pillar of your API Strategy
Denodo as the Core Pillar of your API Strategy
 
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive-Scale Entity Resolution Using the Power of Apache Spark and GraphMassive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
 
Improving Healthcare Operations Using Process Data Mining
Improving Healthcare Operations Using Process Data Mining Improving Healthcare Operations Using Process Data Mining
Improving Healthcare Operations Using Process Data Mining
 
DAS Slides: Data Governance - Combining Data Management with Organizational ...
DAS Slides: Data Governance -  Combining Data Management with Organizational ...DAS Slides: Data Governance -  Combining Data Management with Organizational ...
DAS Slides: Data Governance - Combining Data Management with Organizational ...
 
Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...
Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...
Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...
 
How to Build & Sustain a Data Governance Operating Model
How to Build & Sustain a Data Governance Operating Model How to Build & Sustain a Data Governance Operating Model
How to Build & Sustain a Data Governance Operating Model
 
kinds of analytics
kinds of analyticskinds of analytics
kinds of analytics
 
Data-Ed: Data-centric Strategy & Roadmap
Data-Ed: Data-centric Strategy & RoadmapData-Ed: Data-centric Strategy & Roadmap
Data-Ed: Data-centric Strategy & Roadmap
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 

En vedette

NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...Ryan Rosario
 
Machine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataMachine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataPier Luca Lanzi
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an IntroductionAli Abbasi
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining AreaMahamudHasanCSE
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data MiningAmritanshu Mehra
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
Data warehousing and data mining
Data warehousing and data miningData warehousing and data mining
Data warehousing and data miningSnehali Chake
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAmdocs
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Data Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsData Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsMotaz Saad
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data miningDataminingTools Inc
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data MiningSushil Kulkarni
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 

En vedette (19)

NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
 
Machine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataMachine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web Data
 
Big Data v Data Mining
Big Data v Data MiningBig Data v Data Mining
Big Data v Data Mining
 
Data mining and_big_data_web
Data mining and_big_data_webData mining and_big_data_web
Data mining and_big_data_web
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
Lecture 01 Data Mining
Lecture 01 Data MiningLecture 01 Data Mining
Lecture 01 Data Mining
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining Area
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Data warehousing and data mining
Data warehousing and data miningData warehousing and data mining
Data warehousing and data mining
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Data Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsData Mining and Business Intelligence Tools
Data Mining and Business Intelligence Tools
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data Mining
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Data mining
Data miningData mining
Data mining
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 

Similaire à Analytics and Data Mining Industry Overview

NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesVishy Poosala
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
New Opportunities for Connected Data - Emil Eifrem @ GraphConnect Boston + Ch...
New Opportunities for Connected Data - Emil Eifrem @ GraphConnect Boston + Ch...New Opportunities for Connected Data - Emil Eifrem @ GraphConnect Boston + Ch...
New Opportunities for Connected Data - Emil Eifrem @ GraphConnect Boston + Ch...Neo4j
 
Trends in DM.pptx
Trends in DM.pptxTrends in DM.pptx
Trends in DM.pptxImXaib
 
Using Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsUsing Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsNeo4j
 
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaBIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaMaria de la Iglesia
 
Data Mining - Presentation.pptx
Data Mining - Presentation.pptxData Mining - Presentation.pptx
Data Mining - Presentation.pptxfahadusman23
 
Bring survey sampling techniques into big data
Bring survey sampling techniques into big dataBring survey sampling techniques into big data
Bring survey sampling techniques into big dataAntoine Rebecq
 
Presentation emerging tecnology
Presentation  emerging tecnologyPresentation  emerging tecnology
Presentation emerging tecnologyAmalAltarge
 
Manila Workshop Strategies for web data dissemination
Manila Workshop Strategies for web data disseminationManila Workshop Strategies for web data dissemination
Manila Workshop Strategies for web data disseminationZoltan Nagy
 
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...BigData_Europe
 
Big data a possible game changer for e-governance
Big data   a possible game changer for e-governanceBig data   a possible game changer for e-governance
Big data a possible game changer for e-governanceSomenath Nag
 
BIMCV: The Perfect "Big Data" Storm.
BIMCV: The Perfect "Big Data" Storm. BIMCV: The Perfect "Big Data" Storm.
BIMCV: The Perfect "Big Data" Storm. maigva
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsIJMER
 

Similaire à Analytics and Data Mining Industry Overview (20)

Analytics Education in the era of Big Data
Analytics Education in the era of Big DataAnalytics Education in the era of Big Data
Analytics Education in the era of Big Data
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, Opportunities
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
isd314-01
isd314-01isd314-01
isd314-01
 
New Opportunities for Connected Data - Emil Eifrem @ GraphConnect Boston + Ch...
New Opportunities for Connected Data - Emil Eifrem @ GraphConnect Boston + Ch...New Opportunities for Connected Data - Emil Eifrem @ GraphConnect Boston + Ch...
New Opportunities for Connected Data - Emil Eifrem @ GraphConnect Boston + Ch...
 
13 pv-do es-18-bigdata-v3
13 pv-do es-18-bigdata-v313 pv-do es-18-bigdata-v3
13 pv-do es-18-bigdata-v3
 
Trends in DM.pptx
Trends in DM.pptxTrends in DM.pptx
Trends in DM.pptx
 
Using Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsUsing Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale Analytics
 
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaBIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
 
Promise notes
Promise notesPromise notes
Promise notes
 
Study: #Big Data in #Austria
Study: #Big Data in #AustriaStudy: #Big Data in #Austria
Study: #Big Data in #Austria
 
Data Mining - Presentation.pptx
Data Mining - Presentation.pptxData Mining - Presentation.pptx
Data Mining - Presentation.pptx
 
Bring survey sampling techniques into big data
Bring survey sampling techniques into big dataBring survey sampling techniques into big data
Bring survey sampling techniques into big data
 
Presentation emerging tecnology
Presentation  emerging tecnologyPresentation  emerging tecnology
Presentation emerging tecnology
 
Manila Workshop Strategies for web data dissemination
Manila Workshop Strategies for web data disseminationManila Workshop Strategies for web data dissemination
Manila Workshop Strategies for web data dissemination
 
Presentation_Final
Presentation_FinalPresentation_Final
Presentation_Final
 
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
 
Big data a possible game changer for e-governance
Big data   a possible game changer for e-governanceBig data   a possible game changer for e-governance
Big data a possible game changer for e-governance
 
BIMCV: The Perfect "Big Data" Storm.
BIMCV: The Perfect "Big Data" Storm. BIMCV: The Perfect "Big Data" Storm.
BIMCV: The Perfect "Big Data" Storm.
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 

Dernier

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Dernier (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

Analytics and Data Mining Industry Overview

  • 1. Analytics Industry Overview: To Big Data and Beyond ! Gregory Piatetsky www.KDnuggets.com/gps.html (c) KDnuggets 2011 1
  • 2. My Data Path • PhD in applying Machine Learning to databases • Researcher at GTE Labs – started first project on Knowledge Discovery in Databases in 1989 • Organized first 3 KDD workshops (1989-93), cofounded KDD conferences and ACM SIGKDD • Chief Scientist at analytics startup 1998-2001 • Chair, SIGKDD, 2005-2009 • Analytics/Data Mining Consultant, 2001- (c) KDnuggets 2011 2
  • 3. KDnuggets • Stands for Knowledge Discovery Nuggets • 1993 - started KDnuggets News email newsletter (~ 12,000 email subscribers now) • early website in 1994, www.KDnuggets.com in 1997 – 2011 best year, 45-50,000 unique visitors/month • twitter.com/kdnuggets ~3,000 followers • facebook.com/kdnuggets page • group: KDnuggets Analytics & Data Mining • Recently featured on CNN (c) KDnuggets 2011 3
  • 4. KDnuggets mission Cover Analytics and Data Mining field : • News, Jobs, Software, Data (most popular) • Also Academic positions, CFP, Companies, Consulting, Courses, Meetings, Polls, Publications, Solutions, Webcasts • Subscribe to bi-weekly KDnuggets News at www.kdnuggets.com/subscribe.html (c) KDnuggets 2011 4
  • 5. Analyzing Data or … • Statistics • Data mining Core: • Knowledge Discovery in Data Finding • KDD Useful • Analytics Patterns • Data Science in Data • …? (c) KDnuggets 2011 5
  • 6. History • Statistics: 1800 - • Data dredging, data “fishing” : 1960s • Data Mining: 1980 – • Database Mining ~ 1985 (was HNC trademark, not used) • Knowledge Discovery in Data: 1989 – – KDD workshop in 1989 • Analytics : 2006 – – Google Analytics, “Competing on Analytics” book • Data Science: 2010 – (c) KDnuggets 2011 6
  • 7. Pre-history Statistics is the biggest term in 20th century, but data mining and analytics appears in late 1990s From Google Ngram viewer – English language books Note: Our analysis uses only English language data. Other languages, especially Chinese , need to be considered for full picture (c) KDnuggets 2011 7
  • 8. Recent History: Analytics, Data Mining, Knowledge Discovery Analytics has been used since 1800, but started to rise in 2005 Data Mining jumps around 1996 (soon after first KDD conference) but declines after 2003 (TIA controversy, associated with gov. invasion of privacy). Knowledge Discovery appears in 1989, jumps in 1996, and plateaus after 2000 (c) KDnuggets 2011 8
  • 9. Google N-gram Results case sensitive Different capitalizations changes counts, but using lowercase is probably appropriate to measure general popularity. (c) KDnuggets 2011 9
  • 10. Earliest use of “data mining” 1962? After eliminating many “following data. Mining cost is ” examples which refer to Mining of minerals, and books from “1958” that have a CD attached (errors in book year) The earliest “data mining” reference I found is Source: Google Books (c) KDnuggets 2011 10
  • 11. Google Trends: After 2006, Data Mining < Analytics (c) KDnuggets 2011 11
  • 12. Google Trends: Analytics observations Competing on Analytics Google Analytics introduced, book, Apr 2007 December vacation drop Dec 2005 (c) KDnuggets 2011
  • 13. Half of “Analytics” searches are for “Google Analytics” (c) KDnuggets 2011 13
  • 14. Excluding Google Analytics (c) KDnuggets 2011 14
  • 15. Google Insights: searches for data mining, analytics -google are most popular in India, US (c) KDnuggets 2011 15
  • 16. Data Mining >> Predictive Analytics (c) KDnuggets 2011 16
  • 17. Business, Predictive, Text Analytics (c) KDnuggets 2011 17
  • 18. Analytics > Data Mining > Data Science (c) KDnuggets 2011 18
  • 19. Data Science, Big Data (c) KDnuggets 2011 19
  • 20. Analytics Today KDnuggets Polls Findings www.KDnuggets.com/polls/ (c) KDnuggets 2011 20
  • 21. Where did you apply analytics/data mining? 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% CRM/ consumer analytics Banking Health care/ HR Fraud Detection Direct Marketing/ Fundraising Finance Telecom / Cable Science Insurance Advertising Education avg 2.4 Web usage mining Credit Scoring Retail industries Medical/ Pharma Manufacturing e-Commerce Social Networks Search / Web content mining Government/Military Biotech/Genomics Investment / Stocks Entertainment/ Music Security / Anti-terrorism Travel / Hospitality Social Policy/Survey analysis Junk email / Anti-spam Other www.KDnuggets.com/polls/2010/analytics-data-mining-industries-applications.html (c) KDnuggets 2011 21
  • 23. Data Types w. Most Growth in 2011 • location/geo/mobile data • music / audio • time series • Genomics, according to John Mattison (c) KDnuggets 2011 23
  • 24. Largest Dataset Analyzed? 2011 median dataset size ~10-20 GB, vs 8-10 GB in 2010. Increase in 10 GB to 1 PB range www.KDnuggets.com/polls/2011/largest-dataset-analyzed-data-mined.html (c) KDnuggets 2011 24
  • 25. Largest Dataset Analyzed by Region (c) KDnuggets 2011 25
  • 26. Which methods/algorithms did you use for data analysis in 2011 % analysts who used it 0% 10% 20% 30% 40% 50% 60% 70% Decision Trees Regression Clustering Statistics Visualization Time series/Sequence analysis Support Vector (SVM) Association rules Ensemble methods Text Mining Neural Nets Boosting Bayesian Bagging Factor Analysis Anomaly/Deviation detection Social Network Analysis Survival Analysis Genetic algorithms Uplift modeling www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html (c) KDnuggets 2011 26
  • 27. Algorithms with highest Industry Affinity www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html (c) KDnuggets 2011 27
  • 28. “Academic” algorithms lowest Industry affinity www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html (c) KDnuggets 2011 28
  • 29. Cloud Analytics is not common (yet) www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html (c) KDnuggets 2011 29
  • 30. JOBS AND SKILLS (c) KDnuggets 2011 30
  • 31. Shortage of Skills • McKinsey: shortage by 2018 in the US of – 140-190,000 people with deep analytical skills – 1.5 M managers/analysts with the know-how to use the analysis of big data to make effective decisions. Source: www.mckinsey.com/mgi/publications/big_data/ (c) KDnuggets 2011 31
  • 32. Job data: Data Scientist (c) KDnuggets 2011 32
  • 33. Jobs: Data Mining >> Data Scientist (c) KDnuggets 2011 33
  • 34. “Ground” Analytics (LinkedIn Skills) ~ 75,000 with Data Mining skill ~ 7,000 with Predictive Modeling Also ~ 20,000 with Predictive Analytics (not related with Predictive Modeling ?? (c) KDnuggets 2011 34
  • 35. Cloud (Big Data) Analytics Skills (c) KDnuggets 2011 35
  • 36. Analytics LinkedIn Skills Predictive Analytics Machine Learning Text Mining MapReduce (c) KDnuggets 2011 36
  • 37. Data Tsunami • In 2010 enterprises stored 7 exabytes =7,000,000,000 GB of new data (McKinsey) • 90 percent of the world's data has been Image with apologies to KDD-2011 generated in the past two years (IBM) (c) KDnuggets 2011 37
  • 38. Big Data Aspects? • Volume – Terabytes to Petabytes … • Velocity – online streaming • Variety – numbers, text, links, images, audio, video, … (c) KDnuggets 2011 38
  • 39. Volume + Velocity => No consistency • CAP Theorem (Eric Brewer, 2000) For highly scalable distributed systems, you can only have two of following: – 1) consistency, – 2) high availability, and – 3) (network) partition tolerance (network failure tolerance) http://www.julianbrowne.com/article/viewer/brewers-cap- theorem Implication: Big data solutions must stop worrying about consistency if they want high availability (c) KDnuggets 2011 39
  • 40. Big Data • 2nd Industrial Revolution • Do old activities better • Create new activities/businesses (c) KDnuggets 2011 40
  • 41. Application areas • Doing old things better – Churn prediction – Direct marketing/Customer modeling – Recommendations – Fraud detection – Security/Intelligence –… • Competition will level companies (c) KDnuggets 2011 41
  • 42. Limit to Predicting Customer Behavior? • There is fundamental randomness in human behavior and once we find 1-level effects, more data or better algorithms will give diminishing returns in most cases • Example: Netflix Prize: the most advanced algorithms were only a few percentages better than basic algorithms (c) KDnuggets 2011 42
  • 43. Direct Marketing: Random and Model-sorted Lists 100 CPH: Cumulative Pct Hits 90 80 70 60 Random 50 Model 40 30 20 10 0 5 15 25 35 45 55 65 75 85 95 Pct list 5% of random list have 5% of hits 5% of model-score ranked list have 21% of hits. Lift(5%) = 21%/5% = 4.2
  • 44. Most lift curves are surprising similar Study of lift curves in banking, telecom Actual lift(T) Est. lift(T) 14 Best lift curves are similar 12 Special point T=Target 10 percentage 8 Lift 6 Lift(T) ~ sqrt (1/T) 4 2 0 0 5 10 15 20 25 G. Piatetsky-Shapiro, B. Masand, Estimating Campaign Benefits and 100*T% Modeling Lift, in Proceedings of KDD-99 Conference, ACM Press, 1999. (c) KDnuggets 2011 44
  • 45. Big Data Enables New Things ! – Google – first big success of big data – Social networks (facebook, Twitter, LinkedIn, …) success depends on network size, i.e. big data – Location analytics – Health-care • Personalized medicine – Semantics and AI ? • Imagine IBM Watson, Siri in 2020 ? (c) KDnuggets 2011 45
  • 46. Big Data Growth By Industry Source: http://www.mckinsey.com/mgi/publications/big_data/ (c) KDnuggets 2011 46
  • 47. Research and Industry Disconnect? • Uplift modeling – needs more research • Association rules need less papers • Data Mining with Privacy research – industry use? • KDD conference aims to bring researchers and industry people together (c) KDnuggets 2011 47
  • 48. Hot Growth Areas • Social Analytics – Klout – many twitter micro-analytics (twitalyzer, TweetEffect, TweetStats) • Mobile Analytics – Privacy and data tracks (KDD Lab, Pisa) (c) KDnuggets 2011 48
  • 49. Big Data Bubble? Big Data Gartner Hype Cycle 49 Copyright © 2011 KDnuggets

Notes de l'éditeur

  1. Boris Evelson, Forrester also adds 4th V – Variability (meaning not constant)