SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
Data Mining
Data Mining and Text Mining (UIC 583 @ Politecnico)
Lecture outline                            2



  Why Data Mining?
  What is Data Mining?
  What are the typical tasks?
  What are the primitives?
  What are the typical applications?
  What are the major issues?




                   Prof. Pier Luca Lanzi
Why
Data Mining?
Why Data Mining?                              4
“Necessity is the mother of invention”

  Explosive Growth of Data
     Terabytes of available data
     Data collections and data availability
     Major sources of abundant data

  Pressing need for the automated analysis of massive data




                   Prof. Pier Luca Lanzi
Evolution of Database Technology                        5



  1960s:
     Data collection, database creation, IMS and network DBMS

  1970s:
     Relational data model, relational DBMS implementation

  1980s:
     RDBMS, advanced data models
     (extended-relational, OO, deductive, etc.)
     Application-oriented DBMS
     (spatial, scientific, engineering, etc.)

  1990s:
     Data mining, data warehousing, multimedia databases,
     and Web databases

  2000s
     Stream data management and mining
     Data mining and its applications
     Web technology (XML, data integration)
     Global information systems


                      Prof. Pier Luca Lanzi
Examples                                       6



  In vitro fertilization
     Given: embryos described by 60 features
     Problem: selection of embryos that will survive
     Data: historical records of embryos and outcome

  Cow culling
    Given: cows described by 700 features
    Problem: selection of cows that should be culled
    Data: historical records and farmers’ decisions




                  Prof. Pier Luca Lanzi
Examples                                        7



  Customer attrition
     Given: customer information for the past months
     Problem: predict who is likely to attrite next month,
     or estimate customer value
     Data: historical customer records

  Credit assessment
     Given: a loan application
     Problem: predict whether the bank should
     approve the loan
     Data: records from other loans




                  Prof. Pier Luca Lanzi
What is
Data Mining?
What is Data Mining?                           9



  The non-trivial process of identifying
     valid
     novel
     potentially useful, and
     ultimately understandable patterns in data.

  Alternative names,
     Data Fishing, Data Dredging (1960-)
     Data Mining (1990-), used by DB and business
     Knowledge Discovery in Databases (1989-), used by AI
     Business Intelligence, Information Harvesting,
     Information Discovery, Knowledge Extraction, ...

  Currently, Data Mining and Knowledge Discovery
  are used interchangeably


                  Prof. Pier Luca Lanzi
Example: Credit Risk                     10




      loan
             not repaid                       repaid




                                         salary

                 Prof. Pier Luca Lanzi
Example: Credit Risk                                      11




                                             IF salary<k THEN not repaid
      loan




                                         k               salary

                 Prof. Pier Luca Lanzi
Example: Credit Risk                             12



  Is it valid?
      The pattern has to be valid with respect
      to a certainty level (rule true for the 86%)

  Is it novel?
      The value k should be previously
      unknown or obvious

  Is it useful?
      The pattern should provide information
      useful to the bank for assessing credit risk

  Is it understandable?




                   Prof. Pier Luca Lanzi
What is the general idea?                      13



  Build computer programs that sift through databases
  automatically, seeking regularities or patterns

  There will be problems
     Most patterns are banal and uninteresting
     Most patterns are spurious, inexact, or contingent on
     accidental coincidences in the particular dataset used
     Real data is imperfect: Some parts will be garbled,
     and some will be missing

  Algorithms need to be robust enough to cope with imperfect
  data and to extract regularities that are inexact but useful




                   Prof. Pier Luca Lanzi
What are the related fields?                  14




                                          Machine
        Visualization
                                          Learning

                  Knowledge Discovery
                    And Data Mining

           Statistics                     Databases


                  Prof. Pier Luca Lanzi
Statistics, Machine Learning,                  15
and Data Mining

  Statistics:
     more theory-based, focused on testing hypotheses
  Machine learning
     more heuristic, focused on building program
     that learns, more general than Data Mining
  Knowledge Discovery
     integrates theory and heuristics
     focus on the entire process of discovery, including
     data cleaning, learning, integration and visualization
  Data Mining
     focus on the algorithms to extract patterns from data



                  Distinctions are blurred!

                   Prof. Pier Luca Lanzi
Why Not Traditional Data Analysis?                  16



  Tremendous amount of data
     High scalability to handle terabytes of data

  High-dimensionality of data
     Micro-array may have tens of thousands of dimensions

  High complexity of data
     Data streams and sensor data
     Time-series data, temporal data, sequence data
     Structure data, graphs, social networks
     and multi-linked data
     Heterogeneous databases and legacy databases
     Spatial, spatiotemporal, multimedia, text and Web data
     Software programs, scientific simulations

  New and sophisticated applications



                     Prof. Pier Luca Lanzi
Knowledge Discovery Process                            17



     raw data




      selection


                  cleaning
                                                      evaluation

                        transformation


                                             mining
                     Prof. Pier Luca Lanzi
Knowledge Discovery Process                    18
What are the main steps?

  Learning the application domain to extract
  relevant prior knowledge and goals
  Data selection
  Data cleaning
  Data reduction and transformation
  Mining
     Select the mining approach: classification,
     regression, association, clustering, etc.
     Choosing the mining algorithm(s)
     Perform mining: search for patterns of interest
  Pattern evaluation and knowledge presentation
     visualization, transformation,
     removing redundant patterns, etc.
  Use of discovered knowledge


                  Prof. Pier Luca Lanzi
Knowledge Discovery and                                19
Business Intelligence


    Increasing potential
    to support business                                       End User
         decisions                Making
                                 Decisions


                           Data Presentation                  Business
                        Visualization Techniques               Analyst

                               Data Mining                        Data
                          Information Discovery                 Analyst

                            Data Exploration
              Statistical Analysis, Querying and Reporting

                              OLAP, MDA
                      Data Warehouses / Data Marts
                                                                     DBA
                               Data Sources
       Paper, Files, Information Providers, Database Systems, OLTP

                   Prof. Pier Luca Lanzi
Architecture of a Typical                 20
Knowledge Discovery System



            Graphical user interface


              Pattern evaluation
                                               KB
             Data mining engine


      Database or data warehouse server




               DB                   DW


                 Prof. Pier Luca Lanzi
What tasks?
Major Data Mining Tasks                          22



  Classification: predicting an item class
  Clustering: finding clusters in data
  Associations: frequent occurring events…
  Visualization: to facilitate human discovery
  Summarization: describing a group
  Deviation Detection: finding changes
  Estimation: predicting a continuous value
  Link Analysis: finding relationship




                   Prof. Pier Luca Lanzi
Data Mining Tasks: classification                          23




                                              IF salary<k THEN not repaid
       loan
                   ?
                                                                ?



                                          k               salary

                  Prof. Pier Luca Lanzi
Data Mining Tasks: classification                24



  Classification and Prediction
     Finding models (functions) that describe
     and distinguish classes or concepts
     The goal is to describe the data or
     to make future prediction
     E.g., classify countries based on climate,
     or classify cars based on gas mileage
     Presentation: decision-tree, classification rule, neural
     network
     Prediction: Predict some unknown numerical values




                   Prof. Pier Luca Lanzi
Data Mining Tasks: clustering            25




                 Prof. Pier Luca Lanzi
Data Mining Tasks: clustering                   26



  Cluster analysis
     The class label is unknown
     Group data to form new classes, e.g., cluster houses to
     find distribution patterns
     Clustering based on the principle: maximizing the intra-
     class similarity and minimizing the interclass similarity




                   Prof. Pier Luca Lanzi
Data Mining Tasks: associations                       27




 Bread         Bread                       Steak     Jam
 Peanuts       Jam                         Jam       Soda
 Milk          Soda                        Soda      Peanuts
 Fruit         Chips                       Chips     Milk
 Jam           Milk                        Bread     Fruit
               Fruit


   Is there something interesting?
 Jam            Fruit                      Fruit     Fruit
 Soda           Soda                       Soda      Peanuts
 Chips          Chips                      Peanuts   Cheese
 Milk           Milk                       Milk      Yogurt
 Bread




                   Prof. Pier Luca Lanzi
Data Mining Tasks: associations                28



  Association Rule Mining
     Finds interesting associations and/or correlation
     relationships among large set of data items.
     E.g., 98% of people who purchase tires and auto
     accessories also get automotive services done




                  Prof. Pier Luca Lanzi
Data Mining Tasks: others                          29



  Outlier analysis
    Outlier: a data object that does not comply with the
    general behavior of the data
    It can be considered as noise or exception but is quite
    useful in fraud detection, rare events analysis

  Trend and evolution analysis
     Trend and deviation: regression analysis
     Sequential pattern mining, periodicity analysis
     Similarity-based analysis

  Text Mining, Graph Mining, Data Streams

  Other pattern-directed or statistical analyses


                   Prof. Pier Luca Lanzi
Are all the “Discovered” Patterns              30
Interesting?

  Data Mining may generate thousands of patterns,
  not all of them are interesting.
     Suggested approach: Human-centered, query-based,
     focused mining

  Interestingness measures: a pattern is interesting if it is
  easily understood by humans, valid on new or test data with
  some degree of certainty, potentially useful, novel, or
  validates some hypothesis that a user seeks to confirm

  Objective vs. subjective interestingness measures:
    Objective: based on statistics and structures of patterns,
    e.g., support, confidence, etc.
    Subjective: based on user’s belief in the data, e.g.,
    unexpectedness, novelty, etc.



                  Prof. Pier Luca Lanzi
Can we find all and only                           31
interesting patterns?

  Completeness: Find all the interesting patterns
    Can a data mining system find all
    the interesting patterns?
    Association vs. classification vs. clustering

  Optimization: Search for only interesting patterns:
    Can a data mining system find only
    the interesting patterns?
    Approaches
      • First general all the patterns and then filter out the
        uninteresting ones.
      • Generate only the interesting patterns—mining query
        optimization




                   Prof. Pier Luca Lanzi
Data Mining tasks                              32



  General functionality
    Descriptive data mining
    Predictive data mining

  Different views, different classifications
      Kinds of data to be mined
      Kinds of knowledge to be discovered
      Kinds of techniques utilized
      Kinds of applications adapted




                    Prof. Pier Luca Lanzi
What
primitives?
Primitives that Define a Data Mining Task      34



  Task-relevant data
  Type of knowledge to be mined
  Background knowledge
  Pattern interestingness measurements
  Visualization/presentation of discovered patterns




                   Prof. Pier Luca Lanzi
Primitive 1:                                35
Task-Relevant Data

  Database or data warehouse name
  Database tables or data warehouse cubes
  Condition for data selection
  Relevant attributes or dimensions
  Data grouping criteria




                 Prof. Pier Luca Lanzi
Primitive 2:                               36
Types of Knowledge to Be Mined

  Characterization
  Discrimination
  Association
  Classification/prediction
  Clustering
  Outlier analysis
  Other data mining tasks




                   Prof. Pier Luca Lanzi
Primitive 3:                                  37
Background Knowledge

  A typical kind of background knowledge: Concept hierarchies

  Schema hierarchy
     E.g., Street < City < ProvinceOrState < Country

  Set-grouping hierarchy
     E.g., {20-39} = young, {40-59} = middle_aged

  Operation-derived hierarchy
    email address: hagonzal@cs.uiuc.edu
    login-name < department < university < country

  Rule-based hierarchy
     LowProfitMargin (X) <= Price(X, P1) and Cost (X, P2)
     and (P1 - P2) < $50


                  Prof. Pier Luca Lanzi
Primitive 4:                             38
Pattern Interestingness Measure

  Simplicity
  Certainty
  Utility
  Novelty




                 Prof. Pier Luca Lanzi
Primitive 5:                                      39
Presentation of Discovered Patterns

  Different backgrounds/usages may require
  different forms of representation
      E.g., rules, tables, crosstabs, pie/bar chart, etc.

  Concept hierarchy is also important
    Discovered knowledge might be more understandable
    when represented at high level of abstraction
    Interactive drill up/down, pivoting, slicing and dicing
    provide different perspectives to data

  Different kinds of knowledge require different representation:
  association, classification, clustering, etc.




                    Prof. Pier Luca Lanzi
Integration of Data Mining and                 40
Data Warehousing

  Data mining systems, DBMS, Data warehouse systems
  coupling
     No coupling, loose-coupling, semi-tight-coupling, tight-
     coupling
  On-line analytical mining data
     integration of mining and OLAP technologies
  Interactive mining multi-level knowledge
     Necessity of mining knowledge and patterns at different
     levels of abstraction by drilling/rolling, pivoting,
     slicing/dicing, etc.
  Integration of multiple mining functions
      Characterized classification, first clustering and then
     association




                  Prof. Pier Luca Lanzi
Coupling Data Mining with                     41
Data bases and Datawarehouses

  No coupling—flat file processing, not recommended
  Loose coupling
     Fetching data from DB/DW
  Semi-tight coupling—enhanced DM performance
     Provide efficient implement a few data mining primitives
     in a DB/DW system, e.g., sorting, indexing, aggregation,
     histogram analysis, multiway join, precomputation of
     some stat functions
  Tight coupling—A uniform information processing
  environment
     DM is smoothly integrated into a DB/DW system, mining
     query is optimized based on mining query, indexing,
     query processing methods, etc.




                  Prof. Pier Luca Lanzi
What issues?
Major Issues in Data Mining                                43



  Mining methodology
      Mining different kinds of knowledge from diverse data types, e.g., bio,
      stream, Web
      Performance: efficiency, effectiveness, and scalability
      Pattern evaluation: the interestingness problem
      Incorporation of background knowledge
      Handling noise and incomplete data
      Parallel, distributed and incremental mining methods
      Integration of the discovered knowledge with existing one: knowledge
      fusion

  User interaction
     Data mining query languages and ad-hoc mining
     Expression and visualization of data mining results
     Interactive mining of knowledge at multiple levels of abstraction

  Applications and social impacts
     Domain-specific data mining & invisible data mining
     Protection of data security, integrity, and privacy




                       Prof. Pier Luca Lanzi
Summary
Summary                                        45



 Data mining: Discovering interesting patterns
 from large amounts of data
 A natural evolution of database technology,
 in great demand, with wide applications
 A KDD process includes data cleaning, data integration,
 data selection, transformation, data mining,
 pattern evaluation, and knowledge presentation
 Data mining functionalities: characterization, discrimination,
 association, classification, clustering,
 outlier and trend analysis, etc.
 Data mining systems and architectures
 Major issues in data mining




                  Prof. Pier Luca Lanzi

Contenu connexe

Tendances

Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
DataWorks Summit
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
smj
 
Data Warehouse Architectures
Data Warehouse ArchitecturesData Warehouse Architectures
Data Warehouse Architectures
Theju Paul
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
Slideshare
 

Tendances (20)

Data mining in social network
Data mining in social networkData mining in social network
Data mining in social network
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data Mining Techniques
Data Mining TechniquesData Mining Techniques
Data Mining Techniques
 
Big Data
Big DataBig Data
Big Data
 
Big Data Ppt PowerPoint Presentation Slides
Big Data Ppt PowerPoint Presentation Slides Big Data Ppt PowerPoint Presentation Slides
Big Data Ppt PowerPoint Presentation Slides
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Data Warehouse Architectures
Data Warehouse ArchitecturesData Warehouse Architectures
Data Warehouse Architectures
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data Science
Data ScienceData Science
Data Science
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Classification and prediction
Classification and predictionClassification and prediction
Classification and prediction
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 

En vedette

Big Data v Data Mining
Big Data v Data MiningBig Data v Data Mining
Big Data v Data Mining
University of Hertfordshire
 

En vedette (17)

Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining Area
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
Machine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataMachine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web Data
 
Data mining and_big_data_web
Data mining and_big_data_webData mining and_big_data_web
Data mining and_big_data_web
 
Big Data v Data Mining
Big Data v Data MiningBig Data v Data Mining
Big Data v Data Mining
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
 
Data warehousing and data mining
Data warehousing and data miningData warehousing and data mining
Data warehousing and data mining
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Data Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsData Mining and Business Intelligence Tools
Data Mining and Business Intelligence Tools
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data Mining
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Data mining
Data miningData mining
Data mining
 

Similaire à Lecture 01 Data Mining

ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
Karry Lu
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Template
butest
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.doc
butest
 

Similaire à Lecture 01 Data Mining (20)

DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data mining
 
Dbm630_lecture01
Dbm630_lecture01Dbm630_lecture01
Dbm630_lecture01
 
Dbm630 Lecture01
Dbm630 Lecture01Dbm630 Lecture01
Dbm630 Lecture01
 
Text Mining : Experience
Text Mining : ExperienceText Mining : Experience
Text Mining : Experience
 
Graph
GraphGraph
Graph
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013
 
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
 
UNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfUNIT2-Data Mining.pdf
UNIT2-Data Mining.pdf
 
First, Firster, Firstest: Three lessons from history on information overload
First, Firster, Firstest: Three lessons from history on information overloadFirst, Firster, Firstest: Three lessons from history on information overload
First, Firster, Firstest: Three lessons from history on information overload
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Template
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.doc
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1
 
isd314-01
isd314-01isd314-01
isd314-01
 
Data mining & Decison Trees
Data mining & Decison TreesData mining & Decison Trees
Data mining & Decison Trees
 
Data mining Introduction
Data mining IntroductionData mining Introduction
Data mining Introduction
 
Exploring the Information Ecosystem
Exploring the Information EcosystemExploring the Information Ecosystem
Exploring the Information Ecosystem
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .ppt
 

Plus de Pier Luca Lanzi

Plus de Pier Luca Lanzi (20)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Dernier (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Lecture 01 Data Mining

  • 1. Data Mining Data Mining and Text Mining (UIC 583 @ Politecnico)
  • 2. Lecture outline 2 Why Data Mining? What is Data Mining? What are the typical tasks? What are the primitives? What are the typical applications? What are the major issues? Prof. Pier Luca Lanzi
  • 4. Why Data Mining? 4 “Necessity is the mother of invention” Explosive Growth of Data Terabytes of available data Data collections and data availability Major sources of abundant data Pressing need for the automated analysis of massive data Prof. Pier Luca Lanzi
  • 5. Evolution of Database Technology 5 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s: Data mining, data warehousing, multimedia databases, and Web databases 2000s Stream data management and mining Data mining and its applications Web technology (XML, data integration) Global information systems Prof. Pier Luca Lanzi
  • 6. Examples 6 In vitro fertilization Given: embryos described by 60 features Problem: selection of embryos that will survive Data: historical records of embryos and outcome Cow culling Given: cows described by 700 features Problem: selection of cows that should be culled Data: historical records and farmers’ decisions Prof. Pier Luca Lanzi
  • 7. Examples 7 Customer attrition Given: customer information for the past months Problem: predict who is likely to attrite next month, or estimate customer value Data: historical customer records Credit assessment Given: a loan application Problem: predict whether the bank should approve the loan Data: records from other loans Prof. Pier Luca Lanzi
  • 9. What is Data Mining? 9 The non-trivial process of identifying valid novel potentially useful, and ultimately understandable patterns in data. Alternative names, Data Fishing, Data Dredging (1960-) Data Mining (1990-), used by DB and business Knowledge Discovery in Databases (1989-), used by AI Business Intelligence, Information Harvesting, Information Discovery, Knowledge Extraction, ... Currently, Data Mining and Knowledge Discovery are used interchangeably Prof. Pier Luca Lanzi
  • 10. Example: Credit Risk 10 loan not repaid repaid salary Prof. Pier Luca Lanzi
  • 11. Example: Credit Risk 11 IF salary<k THEN not repaid loan k salary Prof. Pier Luca Lanzi
  • 12. Example: Credit Risk 12 Is it valid? The pattern has to be valid with respect to a certainty level (rule true for the 86%) Is it novel? The value k should be previously unknown or obvious Is it useful? The pattern should provide information useful to the bank for assessing credit risk Is it understandable? Prof. Pier Luca Lanzi
  • 13. What is the general idea? 13 Build computer programs that sift through databases automatically, seeking regularities or patterns There will be problems Most patterns are banal and uninteresting Most patterns are spurious, inexact, or contingent on accidental coincidences in the particular dataset used Real data is imperfect: Some parts will be garbled, and some will be missing Algorithms need to be robust enough to cope with imperfect data and to extract regularities that are inexact but useful Prof. Pier Luca Lanzi
  • 14. What are the related fields? 14 Machine Visualization Learning Knowledge Discovery And Data Mining Statistics Databases Prof. Pier Luca Lanzi
  • 15. Statistics, Machine Learning, 15 and Data Mining Statistics: more theory-based, focused on testing hypotheses Machine learning more heuristic, focused on building program that learns, more general than Data Mining Knowledge Discovery integrates theory and heuristics focus on the entire process of discovery, including data cleaning, learning, integration and visualization Data Mining focus on the algorithms to extract patterns from data Distinctions are blurred! Prof. Pier Luca Lanzi
  • 16. Why Not Traditional Data Analysis? 16 Tremendous amount of data High scalability to handle terabytes of data High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications Prof. Pier Luca Lanzi
  • 17. Knowledge Discovery Process 17 raw data selection cleaning evaluation transformation mining Prof. Pier Luca Lanzi
  • 18. Knowledge Discovery Process 18 What are the main steps? Learning the application domain to extract relevant prior knowledge and goals Data selection Data cleaning Data reduction and transformation Mining Select the mining approach: classification, regression, association, clustering, etc. Choosing the mining algorithm(s) Perform mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge Prof. Pier Luca Lanzi
  • 19. Knowledge Discovery and 19 Business Intelligence Increasing potential to support business End User decisions Making Decisions Data Presentation Business Visualization Techniques Analyst Data Mining Data Information Discovery Analyst Data Exploration Statistical Analysis, Querying and Reporting OLAP, MDA Data Warehouses / Data Marts DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP Prof. Pier Luca Lanzi
  • 20. Architecture of a Typical 20 Knowledge Discovery System Graphical user interface Pattern evaluation KB Data mining engine Database or data warehouse server DB DW Prof. Pier Luca Lanzi
  • 22. Major Data Mining Tasks 22 Classification: predicting an item class Clustering: finding clusters in data Associations: frequent occurring events… Visualization: to facilitate human discovery Summarization: describing a group Deviation Detection: finding changes Estimation: predicting a continuous value Link Analysis: finding relationship Prof. Pier Luca Lanzi
  • 23. Data Mining Tasks: classification 23 IF salary<k THEN not repaid loan ? ? k salary Prof. Pier Luca Lanzi
  • 24. Data Mining Tasks: classification 24 Classification and Prediction Finding models (functions) that describe and distinguish classes or concepts The goal is to describe the data or to make future prediction E.g., classify countries based on climate, or classify cars based on gas mileage Presentation: decision-tree, classification rule, neural network Prediction: Predict some unknown numerical values Prof. Pier Luca Lanzi
  • 25. Data Mining Tasks: clustering 25 Prof. Pier Luca Lanzi
  • 26. Data Mining Tasks: clustering 26 Cluster analysis The class label is unknown Group data to form new classes, e.g., cluster houses to find distribution patterns Clustering based on the principle: maximizing the intra- class similarity and minimizing the interclass similarity Prof. Pier Luca Lanzi
  • 27. Data Mining Tasks: associations 27 Bread Bread Steak Jam Peanuts Jam Jam Soda Milk Soda Soda Peanuts Fruit Chips Chips Milk Jam Milk Bread Fruit Fruit Is there something interesting? Jam Fruit Fruit Fruit Soda Soda Soda Peanuts Chips Chips Peanuts Cheese Milk Milk Milk Yogurt Bread Prof. Pier Luca Lanzi
  • 28. Data Mining Tasks: associations 28 Association Rule Mining Finds interesting associations and/or correlation relationships among large set of data items. E.g., 98% of people who purchase tires and auto accessories also get automotive services done Prof. Pier Luca Lanzi
  • 29. Data Mining Tasks: others 29 Outlier analysis Outlier: a data object that does not comply with the general behavior of the data It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis Trend and evolution analysis Trend and deviation: regression analysis Sequential pattern mining, periodicity analysis Similarity-based analysis Text Mining, Graph Mining, Data Streams Other pattern-directed or statistical analyses Prof. Pier Luca Lanzi
  • 30. Are all the “Discovered” Patterns 30 Interesting? Data Mining may generate thousands of patterns, not all of them are interesting. Suggested approach: Human-centered, query-based, focused mining Interestingness measures: a pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures: Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, etc. Prof. Pier Luca Lanzi
  • 31. Can we find all and only 31 interesting patterns? Completeness: Find all the interesting patterns Can a data mining system find all the interesting patterns? Association vs. classification vs. clustering Optimization: Search for only interesting patterns: Can a data mining system find only the interesting patterns? Approaches • First general all the patterns and then filter out the uninteresting ones. • Generate only the interesting patterns—mining query optimization Prof. Pier Luca Lanzi
  • 32. Data Mining tasks 32 General functionality Descriptive data mining Predictive data mining Different views, different classifications Kinds of data to be mined Kinds of knowledge to be discovered Kinds of techniques utilized Kinds of applications adapted Prof. Pier Luca Lanzi
  • 34. Primitives that Define a Data Mining Task 34 Task-relevant data Type of knowledge to be mined Background knowledge Pattern interestingness measurements Visualization/presentation of discovered patterns Prof. Pier Luca Lanzi
  • 35. Primitive 1: 35 Task-Relevant Data Database or data warehouse name Database tables or data warehouse cubes Condition for data selection Relevant attributes or dimensions Data grouping criteria Prof. Pier Luca Lanzi
  • 36. Primitive 2: 36 Types of Knowledge to Be Mined Characterization Discrimination Association Classification/prediction Clustering Outlier analysis Other data mining tasks Prof. Pier Luca Lanzi
  • 37. Primitive 3: 37 Background Knowledge A typical kind of background knowledge: Concept hierarchies Schema hierarchy E.g., Street < City < ProvinceOrState < Country Set-grouping hierarchy E.g., {20-39} = young, {40-59} = middle_aged Operation-derived hierarchy email address: hagonzal@cs.uiuc.edu login-name < department < university < country Rule-based hierarchy LowProfitMargin (X) <= Price(X, P1) and Cost (X, P2) and (P1 - P2) < $50 Prof. Pier Luca Lanzi
  • 38. Primitive 4: 38 Pattern Interestingness Measure Simplicity Certainty Utility Novelty Prof. Pier Luca Lanzi
  • 39. Primitive 5: 39 Presentation of Discovered Patterns Different backgrounds/usages may require different forms of representation E.g., rules, tables, crosstabs, pie/bar chart, etc. Concept hierarchy is also important Discovered knowledge might be more understandable when represented at high level of abstraction Interactive drill up/down, pivoting, slicing and dicing provide different perspectives to data Different kinds of knowledge require different representation: association, classification, clustering, etc. Prof. Pier Luca Lanzi
  • 40. Integration of Data Mining and 40 Data Warehousing Data mining systems, DBMS, Data warehouse systems coupling No coupling, loose-coupling, semi-tight-coupling, tight- coupling On-line analytical mining data integration of mining and OLAP technologies Interactive mining multi-level knowledge Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc. Integration of multiple mining functions Characterized classification, first clustering and then association Prof. Pier Luca Lanzi
  • 41. Coupling Data Mining with 41 Data bases and Datawarehouses No coupling—flat file processing, not recommended Loose coupling Fetching data from DB/DW Semi-tight coupling—enhanced DM performance Provide efficient implement a few data mining primitives in a DB/DW system, e.g., sorting, indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functions Tight coupling—A uniform information processing environment DM is smoothly integrated into a DB/DW system, mining query is optimized based on mining query, indexing, query processing methods, etc. Prof. Pier Luca Lanzi
  • 43. Major Issues in Data Mining 43 Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion User interaction Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy Prof. Pier Luca Lanzi
  • 45. Summary 45 Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Data mining systems and architectures Major issues in data mining Prof. Pier Luca Lanzi