SlideShare une entreprise Scribd logo
1  sur  60
Data Warehousing & Data Mining Lecturer: Dr. Bo Yuan    E-mail: yuanb@sz.tsinghua.edu.cn
Welcome 2
Mining? Warehousing?  3
Data Rich, Information Poor 4
Heterogeneous Data 5
The Value of Data 6
Data Integration & Analysis 7
From Data To Intelligence 8 Decision Models Decision Support Data Mining Knowledge Preprocessing Information Database Data
Business Intelligence 9
Related Areas 10
Is DM really important? Q: Your job sounds extremely interesting. What jobs would you recommend to a young person with an interest, and maybe a bachelors degree, in economics? A: If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. So my recommendation is to take lots of courses about how to manipulate and analyze data: databases, machine learning, econometrics, statistics, visualization, and so on. An interview with Google Chief Economist  Hal Varian from the New York Times 11
It is all about data … 12 Retail Financial Institutions WWW Healthcare Consulting Companies Government Bioinformatics Telecommunication
Course Profile Lecturer:	Dr. Bo Yuan Contact Phone:	2603 6067 E-mail:	yuanb@sz.tsinghua.edu.cn Room: 	F-401A Time 2:00 pm – 3:35 pm, Friday Venue:  CI-105 Consultation   2:00pm – 3:00pm, Wednesday   Appointment via phone or e-mail preferred 13
Aims & Objectives Course Aims To gain a good understanding of popular data mining techniques. To gain experience in implementing and using data mining methods. To gain an appreciation for the basic principles of data warehousing. Learning Objectives Able to implement and apply data mining techniques to solve problems. Understand the main issues and core problems in data mining. Understand the relationship between data mining and other fields. Appreciate data mining research ideas and practice. Get familiar with academic writing and presentation. Graduate Attributes In-depth knowledge of the field of study Effective communication Independence and teamwork Critical judgment 14
Learning Activities Week 1:  Introduction Week 2:  Principles of Data Warehousing ETL, OLAP, Metadata Week 3:  Data Preprocessing Week 4 – Week 7: Data Mining (Foundations) Bayesian Classifiers, Decision Trees, Neural Networks, Regression, Clustering Support Vector Machines, Association Rules Week 8:  Field Study Week 9 – Week 11: Data Mining (Advanced) Semi-supervised Learning, Active Learning Ensemble Learning, Evolutionary Computation Week 12 – Week 13: Special Topic A (Text Mining & Web Information Retrieval) Week 14: Special Topic B (Bioinformatics, CRM, Privacy Issue) Week 15: Project Presentation   15
Assessment Assignment 1 Type: Class Presentation Weight: 10% Task Description: Individual 25 minutes talks on selected topics Assignment 2 Type: Algorithm Experimentation Weight: 10% Task Description: Coding and testing of selected data mining algorithms Assignment 3 Type: Problem Solving Weight: 30% Task Description: Group project on solving real-world data mining problems Final Exam Type: Closed Book Examination Weight: 50% Duration: 120 minutes 16 Presentation matters!
Learning Resources 17
Learning Resources 18 International Conference on Data Mining International Conference on Data Engineering International Conference on Machine Learning Pacific-Asia Conference on Knowledge Discovery and Data Mining ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Rules & Policies Plagiarism Plagiarism is the act of misrepresenting as one's own original work the ideas, interpretations, words or creative works of another.  Direct copying of paragraphs, sentences, a single sentence or significant parts of a sentence. Presenting as independent work done in collaboration with others. Copying ideas, concepts, research results, computer codes, statistical tables, designs, images, sounds or text or any combination of these. Paraphrasing, summarizing or simply rearranging another person's words, ideas, etc without changing the basic structure and/or meaning of the text. Copying or adapting another student's original work into a submitted assessment item.  19
Rules & Policies Late Submission Late submissions will incur a penalty of 10% of the total marks for each day that the submission is late (including weekends). Submissions more than 5 days late will not be accepted. Assumed Background This course will deal with concepts using algorithms and data structures, mathematics, statistics and probability. 20
21 10 Minutes …
Data Definition “Data are pieces of information that represent the qualitative or quantitative attributes of a variable or set of variables. Data are often viewed as the lowest level of abstraction from which information and knowledge are derived.” Data Types Continuous, Binary Discrete, String Symbolic Storage Physical Logical Major Issues Transformation Errors and corruption  22
Database Definition “A database is an integrated collection of logically related records or files that is stored in a computer system which consolidates records previously stored in separate files into a common pool of data records that provides data for many applications.”  “A database is a collection of information that is organized so that it can easily be accessed, managed, and updated.” Relational Databases 23
Relational Model 24
First Normal Form(1NF) There's no top-to-bottom ordering to the rows.  There's no left-to-right ordering to the columns.  There are no duplicate rows. Every cell contains exactly one value from the applicable domain. 25
First Normal Form(1NF) 26
First Normal Form(1NF) 27
Second Normal Form(2NF) Definition A 1NF table is in 2NF if and only if none of its non-prime attributes are functionally dependent on a part (proper subset) of a candidate key. 28
Second Normal Form(2NF) 29
Third Normal Form(3NF) Definition: Every non-prime attribute of R is non-transitively dependent (directly dependent) on every key of R.  30
Third Normal Form(3NF) 31
Data Warehouse Operational databases are optimized for the preservation of data integrity and speed of recording of business transactions. Data warehouses are optimized for the speed of data retrieval.  Data warehouse is a repository of an organization's electronically stored data, which are designed to facilitate reporting and analysis. W. H. Inmon states that the data warehouse is: Subject-oriented   Time-variant   Non-volatile   Integrated   Data Warehousing Business Intelligence Tools Tools to extract, transform, and load data into the repository Tools to manage and retrieve metadata 32
Multidimensional Data 33 OLAP Cube
Star Schema 34
To Build a Data Warehouse Data must be extracted from multiple, heterogeneous sources such as databases or other data feeds.  Data must be formatted for consistency within the data warehouse. Names, meanings and domains of data from unrelated sources must be reconciled.  Data must be cleaned to ensure validity. Data cleaning is an important part in building a data warehouse and it is one of the most labor-demanding tasks. Data must be fitted into the data model of the warehouse. Data may have to be converted from relational, object-oriented, or legacy databases.  Data must be loaded into the warehouse. The sheer volume of data in the warehouse makes loading the data a significant task. 35
Data Warehouse vs. Database 36
Performance Dashboard 37
38 5 Minutes …
Data Mining People have been analysing and investigating data for centuries. Statistics Mean, Variance, Correlation, Distribution … In modern days, data are often far beyond human comprehension. Diversity Volume Dimensionality Definition Data Mining is the process of automatically extracting interesting and useful hidden patterns from usually massive, incomplete and noisy data. Not a fully automatic process Human interventions are often inevitable. Domain Knowledge Data  Collection and Pre-processing Synonym: Knowledge Discovery One Field, Many Techniques, Unlimited Applications 39
The Process of Data Mining 40
DM Techniques - Classification “Classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as variables, characters, etc) and based on a training set of previously labeled items”. Given training data {(x1, y1), …, (xn, yn)}, the task is to produce a classifier that maps any unknown object xi to its true classification label yi defined by some unknown mapping. Algorithms Decision Trees K-nearest neighbours Neural Networks Support Vector Machines Applications Credit Scoring Churn Prediction Medical Diagnosis 41 X Y
Classification Boundaries 42 ? ?
Confusion Matrix 43 Accuracy=(TP+TN)/(P+N)
Receiver Operating Characteristic 44
Lift  45
DM Techniques - Clustering Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Distance Metrics Euclidean distance Manhattan distance Mahalanobis distance Algorithms K-means Leader RPCL Affinity Propagation Applications Market Research Image Segmentation Social Network Analysis 46 What is the difference between classification and clustering?
Hierarchical Clustering 47
DM Techniques – Association Rule 48
Association Rule 49
DM Techniques – Regression 50
Regression 51
Overfitting – Regression 52
Overfitting – Classification 53
Cross Validation 54 Training Set Generated Models Evaluation Data Test Set
Seeing is Knowing 55
Data Preprocessing Why data processing? Real data are often surprisingly dirty. Incomplete Data Inconsistent Data Noisy Data Typical Issues Missing Attribute Values Different Coding/Naming Schemes Infeasible Values Outliers Data Quality Accuracy Completeness Consistency Interpretability Credibility Timeliness 56
Data Preprocessing Data quality is a crucial factor in successful data mining tasks. Data Cleaning Fill in missing values. Correct inconsistent data. Identify outliers and noisy data. Data Integration Combine data from different sources. Data Transformation Normalization Aggregation Type Conversion Data Reduction Feature Selection Sampling 57
Review What is data mining? Why is data mining important? What are the typical data mining applications? What is the general procedure of data mining? What are the major techniques in data mining? What is the difference between data warehouses and databases? What to expect in this course? Where to find relevant information? How to make the most of this course? 58
Just in Case Someone Asks … 59
Just in Case Someone Asks … 60

Contenu connexe

Tendances

Datawarehousing
DatawarehousingDatawarehousing
Datawarehousingwork
 
Basic Introduction of Data Warehousing from Adiva Consulting
Basic Introduction of  Data Warehousing from Adiva ConsultingBasic Introduction of  Data Warehousing from Adiva Consulting
Basic Introduction of Data Warehousing from Adiva Consultingadivasoft
 
Dwdm 2(data warehouse)
Dwdm 2(data warehouse)Dwdm 2(data warehouse)
Dwdm 2(data warehouse)Er Bansal
 
Data Warehousing Overview
Data Warehousing OverviewData Warehousing Overview
Data Warehousing OverviewAhmed Gamal
 
Business intelligence and data warehousing
Business intelligence and data warehousingBusiness intelligence and data warehousing
Business intelligence and data warehousingOZ Assignment help
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data WarehousingEdureka!
 
Introduction to data warehousing
Introduction to data warehousingIntroduction to data warehousing
Introduction to data warehousinguncleRhyme
 
Seminar datawarehousing
Seminar datawarehousingSeminar datawarehousing
Seminar datawarehousingKavisha Uniyal
 
Gulabs Ppt On Data Warehousing And Mining
Gulabs Ppt On Data Warehousing And MiningGulabs Ppt On Data Warehousing And Mining
Gulabs Ppt On Data Warehousing And Mininggulab sharma
 
Data mining and data warehousing
Data mining and data warehousingData mining and data warehousing
Data mining and data warehousingumesh patil
 
Introduction To Data Warehousing
Introduction To Data WarehousingIntroduction To Data Warehousing
Introduction To Data WarehousingAlex Meadows
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing conceptspcherukumalla
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSINGKing Julian
 
Data Ware Housing And Data Mining
Data Ware Housing And Data MiningData Ware Housing And Data Mining
Data Ware Housing And Data Miningcpjcollege
 
Data Warehousing & Basic Architectural Framework
Data Warehousing & Basic Architectural FrameworkData Warehousing & Basic Architectural Framework
Data Warehousing & Basic Architectural FrameworkDr. Sunil Kr. Pandey
 

Tendances (20)

Data warehouse
Data warehouseData warehouse
Data warehouse
 
Ppt
PptPpt
Ppt
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
Basic Introduction of Data Warehousing from Adiva Consulting
Basic Introduction of  Data Warehousing from Adiva ConsultingBasic Introduction of  Data Warehousing from Adiva Consulting
Basic Introduction of Data Warehousing from Adiva Consulting
 
Dwdm 2(data warehouse)
Dwdm 2(data warehouse)Dwdm 2(data warehouse)
Dwdm 2(data warehouse)
 
Data Warehousing Overview
Data Warehousing OverviewData Warehousing Overview
Data Warehousing Overview
 
Business intelligence and data warehousing
Business intelligence and data warehousingBusiness intelligence and data warehousing
Business intelligence and data warehousing
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Introduction to data warehousing
Introduction to data warehousingIntroduction to data warehousing
Introduction to data warehousing
 
Seminar datawarehousing
Seminar datawarehousingSeminar datawarehousing
Seminar datawarehousing
 
Gulabs Ppt On Data Warehousing And Mining
Gulabs Ppt On Data Warehousing And MiningGulabs Ppt On Data Warehousing And Mining
Gulabs Ppt On Data Warehousing And Mining
 
Data mining and data warehousing
Data mining and data warehousingData mining and data warehousing
Data mining and data warehousing
 
Introduction To Data Warehousing
Introduction To Data WarehousingIntroduction To Data Warehousing
Introduction To Data Warehousing
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Data Ware Housing And Data Mining
Data Ware Housing And Data MiningData Ware Housing And Data Mining
Data Ware Housing And Data Mining
 
Data Warehousing & Basic Architectural Framework
Data Warehousing & Basic Architectural FrameworkData Warehousing & Basic Architectural Framework
Data Warehousing & Basic Architectural Framework
 

En vedette

Comparable entity mining from comparative questions
Comparable entity mining from comparative questionsComparable entity mining from comparative questions
Comparable entity mining from comparative questionsIEEEFINALYEARPROJECTS
 
Data mining process powerpoint presentation templates
Data mining process powerpoint presentation templatesData mining process powerpoint presentation templates
Data mining process powerpoint presentation templatesSlideTeam.net
 
Reversible watermarking based on invariant image classification and dynamic h...
Reversible watermarking based on invariant image classification and dynamic h...Reversible watermarking based on invariant image classification and dynamic h...
Reversible watermarking based on invariant image classification and dynamic h...IEEEFINALYEARPROJECTS
 
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...IEEEGLOBALSOFTTECHNOLOGIES
 
Project book on WINDS OF CHANGE:FROM VENDOR LOCK-IN TO THE META CLOUD
Project book on WINDS OF CHANGE:FROM VENDOR LOCK-IN TO THE META CLOUDProject book on WINDS OF CHANGE:FROM VENDOR LOCK-IN TO THE META CLOUD
Project book on WINDS OF CHANGE:FROM VENDOR LOCK-IN TO THE META CLOUDNAWAZ KHAN
 
Bhawani prasad data integration-ppt
Bhawani prasad data integration-pptBhawani prasad data integration-ppt
Bhawani prasad data integration-pptBhawani N Prasad
 
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR RecognitionBharat Kalia
 
SAP Warehouse Management (SAP WM) or SAP Extended Warehouse Management (SAP E...
SAP Warehouse Management (SAP WM) or SAP Extended Warehouse Management (SAP E...SAP Warehouse Management (SAP WM) or SAP Extended Warehouse Management (SAP E...
SAP Warehouse Management (SAP WM) or SAP Extended Warehouse Management (SAP E...Rocket Consulting Ltd
 
Presentation On Warehousing
Presentation On WarehousingPresentation On Warehousing
Presentation On WarehousingRRChandran
 
Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks Chiranjeevi Adi
 
Twig: Friendly Curly Braces Invade Your Templates!
Twig: Friendly Curly Braces Invade Your Templates!Twig: Friendly Curly Braces Invade Your Templates!
Twig: Friendly Curly Braces Invade Your Templates!Ryan Weaver
 
Ashtavakra Gita Chapter 18 - Wonders of Patience
Ashtavakra Gita Chapter 18 - Wonders of PatienceAshtavakra Gita Chapter 18 - Wonders of Patience
Ashtavakra Gita Chapter 18 - Wonders of PatienceVinod Kad
 
Technology In Schools What Is Changing
Technology  In  Schools  What  Is  ChangingTechnology  In  Schools  What  Is  Changing
Technology In Schools What Is ChangingYarmouth Schools
 

En vedette (20)

Comparable entity mining from comparative questions
Comparable entity mining from comparative questionsComparable entity mining from comparative questions
Comparable entity mining from comparative questions
 
Star schema
Star schemaStar schema
Star schema
 
Data mining process powerpoint presentation templates
Data mining process powerpoint presentation templatesData mining process powerpoint presentation templates
Data mining process powerpoint presentation templates
 
Concept
ConceptConcept
Concept
 
Reversible watermarking based on invariant image classification and dynamic h...
Reversible watermarking based on invariant image classification and dynamic h...Reversible watermarking based on invariant image classification and dynamic h...
Reversible watermarking based on invariant image classification and dynamic h...
 
Beaconsoft
BeaconsoftBeaconsoft
Beaconsoft
 
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
 
Project book on WINDS OF CHANGE:FROM VENDOR LOCK-IN TO THE META CLOUD
Project book on WINDS OF CHANGE:FROM VENDOR LOCK-IN TO THE META CLOUDProject book on WINDS OF CHANGE:FROM VENDOR LOCK-IN TO THE META CLOUD
Project book on WINDS OF CHANGE:FROM VENDOR LOCK-IN TO THE META CLOUD
 
Star Model
Star ModelStar Model
Star Model
 
Bhawani prasad data integration-ppt
Bhawani prasad data integration-pptBhawani prasad data integration-ppt
Bhawani prasad data integration-ppt
 
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR Recognition
 
SAP Warehouse Management (SAP WM) or SAP Extended Warehouse Management (SAP E...
SAP Warehouse Management (SAP WM) or SAP Extended Warehouse Management (SAP E...SAP Warehouse Management (SAP WM) or SAP Extended Warehouse Management (SAP E...
SAP Warehouse Management (SAP WM) or SAP Extended Warehouse Management (SAP E...
 
Presentation On Warehousing
Presentation On WarehousingPresentation On Warehousing
Presentation On Warehousing
 
Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks
 
Twig: Friendly Curly Braces Invade Your Templates!
Twig: Friendly Curly Braces Invade Your Templates!Twig: Friendly Curly Braces Invade Your Templates!
Twig: Friendly Curly Braces Invade Your Templates!
 
Ch1 a
Ch1 aCh1 a
Ch1 a
 
Ashtavakra Gita Chapter 18 - Wonders of Patience
Ashtavakra Gita Chapter 18 - Wonders of PatienceAshtavakra Gita Chapter 18 - Wonders of Patience
Ashtavakra Gita Chapter 18 - Wonders of Patience
 
Technology In Schools What Is Changing
Technology  In  Schools  What  Is  ChangingTechnology  In  Schools  What  Is  Changing
Technology In Schools What Is Changing
 
Dapan
DapanDapan
Dapan
 
Mig gig first draft
Mig gig first draftMig gig first draft
Mig gig first draft
 

Similaire à PowerPoint Template

Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleDr. Radhey Shyam
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huwekineheshete
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfDr. Radhey Shyam
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfDr. Radhey Shyam
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.docbutest
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1Mahmoud Alfarra
 
KDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxKDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxYogeshGairola2
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesasnaparveen414
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAswathy S Nair
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 
Lecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdfLecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdfJojo314349
 
Chapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxChapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxJethroDignadice2
 

Similaire à PowerPoint Template (20)

U - 2 Emerging.pptx
U - 2 Emerging.pptxU - 2 Emerging.pptx
U - 2 Emerging.pptx
 
Data science
Data scienceData science
Data science
 
data mining
data miningdata mining
data mining
 
Chapter 2 - EMTE.pptx
Chapter 2 - EMTE.pptxChapter 2 - EMTE.pptx
Chapter 2 - EMTE.pptx
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycle
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.doc
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1
 
KDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxKDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptx
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notes
 
G045033841
G045033841G045033841
G045033841
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Lecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdfLecture-1-Introduction-to-Data-Mining.pdf
Lecture-1-Introduction-to-Data-Mining.pdf
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
Chapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxChapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptx
 

Plus de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Plus de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

PowerPoint Template

  • 1. Data Warehousing & Data Mining Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn
  • 6. The Value of Data 6
  • 7. Data Integration & Analysis 7
  • 8. From Data To Intelligence 8 Decision Models Decision Support Data Mining Knowledge Preprocessing Information Database Data
  • 11. Is DM really important? Q: Your job sounds extremely interesting. What jobs would you recommend to a young person with an interest, and maybe a bachelors degree, in economics? A: If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. So my recommendation is to take lots of courses about how to manipulate and analyze data: databases, machine learning, econometrics, statistics, visualization, and so on. An interview with Google Chief Economist Hal Varian from the New York Times 11
  • 12. It is all about data … 12 Retail Financial Institutions WWW Healthcare Consulting Companies Government Bioinformatics Telecommunication
  • 13. Course Profile Lecturer: Dr. Bo Yuan Contact Phone: 2603 6067 E-mail: yuanb@sz.tsinghua.edu.cn Room: F-401A Time 2:00 pm – 3:35 pm, Friday Venue: CI-105 Consultation 2:00pm – 3:00pm, Wednesday Appointment via phone or e-mail preferred 13
  • 14. Aims & Objectives Course Aims To gain a good understanding of popular data mining techniques. To gain experience in implementing and using data mining methods. To gain an appreciation for the basic principles of data warehousing. Learning Objectives Able to implement and apply data mining techniques to solve problems. Understand the main issues and core problems in data mining. Understand the relationship between data mining and other fields. Appreciate data mining research ideas and practice. Get familiar with academic writing and presentation. Graduate Attributes In-depth knowledge of the field of study Effective communication Independence and teamwork Critical judgment 14
  • 15. Learning Activities Week 1: Introduction Week 2: Principles of Data Warehousing ETL, OLAP, Metadata Week 3: Data Preprocessing Week 4 – Week 7: Data Mining (Foundations) Bayesian Classifiers, Decision Trees, Neural Networks, Regression, Clustering Support Vector Machines, Association Rules Week 8: Field Study Week 9 – Week 11: Data Mining (Advanced) Semi-supervised Learning, Active Learning Ensemble Learning, Evolutionary Computation Week 12 – Week 13: Special Topic A (Text Mining & Web Information Retrieval) Week 14: Special Topic B (Bioinformatics, CRM, Privacy Issue) Week 15: Project Presentation 15
  • 16. Assessment Assignment 1 Type: Class Presentation Weight: 10% Task Description: Individual 25 minutes talks on selected topics Assignment 2 Type: Algorithm Experimentation Weight: 10% Task Description: Coding and testing of selected data mining algorithms Assignment 3 Type: Problem Solving Weight: 30% Task Description: Group project on solving real-world data mining problems Final Exam Type: Closed Book Examination Weight: 50% Duration: 120 minutes 16 Presentation matters!
  • 18. Learning Resources 18 International Conference on Data Mining International Conference on Data Engineering International Conference on Machine Learning Pacific-Asia Conference on Knowledge Discovery and Data Mining ACM SIGKDD Conference on Knowledge Discovery and Data Mining
  • 19. Rules & Policies Plagiarism Plagiarism is the act of misrepresenting as one's own original work the ideas, interpretations, words or creative works of another. Direct copying of paragraphs, sentences, a single sentence or significant parts of a sentence. Presenting as independent work done in collaboration with others. Copying ideas, concepts, research results, computer codes, statistical tables, designs, images, sounds or text or any combination of these. Paraphrasing, summarizing or simply rearranging another person's words, ideas, etc without changing the basic structure and/or meaning of the text. Copying or adapting another student's original work into a submitted assessment item. 19
  • 20. Rules & Policies Late Submission Late submissions will incur a penalty of 10% of the total marks for each day that the submission is late (including weekends). Submissions more than 5 days late will not be accepted. Assumed Background This course will deal with concepts using algorithms and data structures, mathematics, statistics and probability. 20
  • 22. Data Definition “Data are pieces of information that represent the qualitative or quantitative attributes of a variable or set of variables. Data are often viewed as the lowest level of abstraction from which information and knowledge are derived.” Data Types Continuous, Binary Discrete, String Symbolic Storage Physical Logical Major Issues Transformation Errors and corruption 22
  • 23. Database Definition “A database is an integrated collection of logically related records or files that is stored in a computer system which consolidates records previously stored in separate files into a common pool of data records that provides data for many applications.” “A database is a collection of information that is organized so that it can easily be accessed, managed, and updated.” Relational Databases 23
  • 25. First Normal Form(1NF) There's no top-to-bottom ordering to the rows. There's no left-to-right ordering to the columns. There are no duplicate rows. Every cell contains exactly one value from the applicable domain. 25
  • 28. Second Normal Form(2NF) Definition A 1NF table is in 2NF if and only if none of its non-prime attributes are functionally dependent on a part (proper subset) of a candidate key. 28
  • 30. Third Normal Form(3NF) Definition: Every non-prime attribute of R is non-transitively dependent (directly dependent) on every key of R. 30
  • 32. Data Warehouse Operational databases are optimized for the preservation of data integrity and speed of recording of business transactions. Data warehouses are optimized for the speed of data retrieval. Data warehouse is a repository of an organization's electronically stored data, which are designed to facilitate reporting and analysis. W. H. Inmon states that the data warehouse is: Subject-oriented  Time-variant  Non-volatile  Integrated  Data Warehousing Business Intelligence Tools Tools to extract, transform, and load data into the repository Tools to manage and retrieve metadata 32
  • 35. To Build a Data Warehouse Data must be extracted from multiple, heterogeneous sources such as databases or other data feeds. Data must be formatted for consistency within the data warehouse. Names, meanings and domains of data from unrelated sources must be reconciled. Data must be cleaned to ensure validity. Data cleaning is an important part in building a data warehouse and it is one of the most labor-demanding tasks. Data must be fitted into the data model of the warehouse. Data may have to be converted from relational, object-oriented, or legacy databases. Data must be loaded into the warehouse. The sheer volume of data in the warehouse makes loading the data a significant task. 35
  • 36. Data Warehouse vs. Database 36
  • 39. Data Mining People have been analysing and investigating data for centuries. Statistics Mean, Variance, Correlation, Distribution … In modern days, data are often far beyond human comprehension. Diversity Volume Dimensionality Definition Data Mining is the process of automatically extracting interesting and useful hidden patterns from usually massive, incomplete and noisy data. Not a fully automatic process Human interventions are often inevitable. Domain Knowledge Data Collection and Pre-processing Synonym: Knowledge Discovery One Field, Many Techniques, Unlimited Applications 39
  • 40. The Process of Data Mining 40
  • 41. DM Techniques - Classification “Classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as variables, characters, etc) and based on a training set of previously labeled items”. Given training data {(x1, y1), …, (xn, yn)}, the task is to produce a classifier that maps any unknown object xi to its true classification label yi defined by some unknown mapping. Algorithms Decision Trees K-nearest neighbours Neural Networks Support Vector Machines Applications Credit Scoring Churn Prediction Medical Diagnosis 41 X Y
  • 43. Confusion Matrix 43 Accuracy=(TP+TN)/(P+N)
  • 46. DM Techniques - Clustering Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Distance Metrics Euclidean distance Manhattan distance Mahalanobis distance Algorithms K-means Leader RPCL Affinity Propagation Applications Market Research Image Segmentation Social Network Analysis 46 What is the difference between classification and clustering?
  • 48. DM Techniques – Association Rule 48
  • 50. DM Techniques – Regression 50
  • 54. Cross Validation 54 Training Set Generated Models Evaluation Data Test Set
  • 56. Data Preprocessing Why data processing? Real data are often surprisingly dirty. Incomplete Data Inconsistent Data Noisy Data Typical Issues Missing Attribute Values Different Coding/Naming Schemes Infeasible Values Outliers Data Quality Accuracy Completeness Consistency Interpretability Credibility Timeliness 56
  • 57. Data Preprocessing Data quality is a crucial factor in successful data mining tasks. Data Cleaning Fill in missing values. Correct inconsistent data. Identify outliers and noisy data. Data Integration Combine data from different sources. Data Transformation Normalization Aggregation Type Conversion Data Reduction Feature Selection Sampling 57
  • 58. Review What is data mining? Why is data mining important? What are the typical data mining applications? What is the general procedure of data mining? What are the major techniques in data mining? What is the difference between data warehouses and databases? What to expect in this course? Where to find relevant information? How to make the most of this course? 58
  • 59. Just in Case Someone Asks … 59
  • 60. Just in Case Someone Asks … 60