11. Is DM really important? Q: Your job sounds extremely interesting. What jobs would you recommend to a young person with an interest, and maybe a bachelors degree, in economics? A: If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. So my recommendation is to take lots of courses about how to manipulate and analyze data: databases, machine learning, econometrics, statistics, visualization, and so on. An interview with Google Chief Economist Hal Varian from the New York Times 11
12. It is all about data … 12 Retail Financial Institutions WWW Healthcare Consulting Companies Government Bioinformatics Telecommunication
13. Course Profile Lecturer: Dr. Bo Yuan Contact Phone: 2603 6067 E-mail: yuanb@sz.tsinghua.edu.cn Room: F-401A Time 2:00 pm – 3:35 pm, Friday Venue: CI-105 Consultation 2:00pm – 3:00pm, Wednesday Appointment via phone or e-mail preferred 13
14. Aims & Objectives Course Aims To gain a good understanding of popular data mining techniques. To gain experience in implementing and using data mining methods. To gain an appreciation for the basic principles of data warehousing. Learning Objectives Able to implement and apply data mining techniques to solve problems. Understand the main issues and core problems in data mining. Understand the relationship between data mining and other fields. Appreciate data mining research ideas and practice. Get familiar with academic writing and presentation. Graduate Attributes In-depth knowledge of the field of study Effective communication Independence and teamwork Critical judgment 14
15. Learning Activities Week 1: Introduction Week 2: Principles of Data Warehousing ETL, OLAP, Metadata Week 3: Data Preprocessing Week 4 – Week 7: Data Mining (Foundations) Bayesian Classifiers, Decision Trees, Neural Networks, Regression, Clustering Support Vector Machines, Association Rules Week 8: Field Study Week 9 – Week 11: Data Mining (Advanced) Semi-supervised Learning, Active Learning Ensemble Learning, Evolutionary Computation Week 12 – Week 13: Special Topic A (Text Mining & Web Information Retrieval) Week 14: Special Topic B (Bioinformatics, CRM, Privacy Issue) Week 15: Project Presentation 15
16. Assessment Assignment 1 Type: Class Presentation Weight: 10% Task Description: Individual 25 minutes talks on selected topics Assignment 2 Type: Algorithm Experimentation Weight: 10% Task Description: Coding and testing of selected data mining algorithms Assignment 3 Type: Problem Solving Weight: 30% Task Description: Group project on solving real-world data mining problems Final Exam Type: Closed Book Examination Weight: 50% Duration: 120 minutes 16 Presentation matters!
18. Learning Resources 18 International Conference on Data Mining International Conference on Data Engineering International Conference on Machine Learning Pacific-Asia Conference on Knowledge Discovery and Data Mining ACM SIGKDD Conference on Knowledge Discovery and Data Mining
19. Rules & Policies Plagiarism Plagiarism is the act of misrepresenting as one's own original work the ideas, interpretations, words or creative works of another. Direct copying of paragraphs, sentences, a single sentence or significant parts of a sentence. Presenting as independent work done in collaboration with others. Copying ideas, concepts, research results, computer codes, statistical tables, designs, images, sounds or text or any combination of these. Paraphrasing, summarizing or simply rearranging another person's words, ideas, etc without changing the basic structure and/or meaning of the text. Copying or adapting another student's original work into a submitted assessment item. 19
20. Rules & Policies Late Submission Late submissions will incur a penalty of 10% of the total marks for each day that the submission is late (including weekends). Submissions more than 5 days late will not be accepted. Assumed Background This course will deal with concepts using algorithms and data structures, mathematics, statistics and probability. 20
22. Data Definition “Data are pieces of information that represent the qualitative or quantitative attributes of a variable or set of variables. Data are often viewed as the lowest level of abstraction from which information and knowledge are derived.” Data Types Continuous, Binary Discrete, String Symbolic Storage Physical Logical Major Issues Transformation Errors and corruption 22
23. Database Definition “A database is an integrated collection of logically related records or files that is stored in a computer system which consolidates records previously stored in separate files into a common pool of data records that provides data for many applications.” “A database is a collection of information that is organized so that it can easily be accessed, managed, and updated.” Relational Databases 23
25. First Normal Form(1NF) There's no top-to-bottom ordering to the rows. There's no left-to-right ordering to the columns. There are no duplicate rows. Every cell contains exactly one value from the applicable domain. 25
28. Second Normal Form(2NF) Definition A 1NF table is in 2NF if and only if none of its non-prime attributes are functionally dependent on a part (proper subset) of a candidate key. 28
32. Data Warehouse Operational databases are optimized for the preservation of data integrity and speed of recording of business transactions. Data warehouses are optimized for the speed of data retrieval. Data warehouse is a repository of an organization's electronically stored data, which are designed to facilitate reporting and analysis. W. H. Inmon states that the data warehouse is: Subject-oriented Time-variant Non-volatile Integrated Data Warehousing Business Intelligence Tools Tools to extract, transform, and load data into the repository Tools to manage and retrieve metadata 32
35. To Build a Data Warehouse Data must be extracted from multiple, heterogeneous sources such as databases or other data feeds. Data must be formatted for consistency within the data warehouse. Names, meanings and domains of data from unrelated sources must be reconciled. Data must be cleaned to ensure validity. Data cleaning is an important part in building a data warehouse and it is one of the most labor-demanding tasks. Data must be fitted into the data model of the warehouse. Data may have to be converted from relational, object-oriented, or legacy databases. Data must be loaded into the warehouse. The sheer volume of data in the warehouse makes loading the data a significant task. 35
39. Data Mining People have been analysing and investigating data for centuries. Statistics Mean, Variance, Correlation, Distribution … In modern days, data are often far beyond human comprehension. Diversity Volume Dimensionality Definition Data Mining is the process of automatically extracting interesting and useful hidden patterns from usually massive, incomplete and noisy data. Not a fully automatic process Human interventions are often inevitable. Domain Knowledge Data Collection and Pre-processing Synonym: Knowledge Discovery One Field, Many Techniques, Unlimited Applications 39
41. DM Techniques - Classification “Classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as variables, characters, etc) and based on a training set of previously labeled items”. Given training data {(x1, y1), …, (xn, yn)}, the task is to produce a classifier that maps any unknown object xi to its true classification label yi defined by some unknown mapping. Algorithms Decision Trees K-nearest neighbours Neural Networks Support Vector Machines Applications Credit Scoring Churn Prediction Medical Diagnosis 41 X Y
46. DM Techniques - Clustering Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Distance Metrics Euclidean distance Manhattan distance Mahalanobis distance Algorithms K-means Leader RPCL Affinity Propagation Applications Market Research Image Segmentation Social Network Analysis 46 What is the difference between classification and clustering?
56. Data Preprocessing Why data processing? Real data are often surprisingly dirty. Incomplete Data Inconsistent Data Noisy Data Typical Issues Missing Attribute Values Different Coding/Naming Schemes Infeasible Values Outliers Data Quality Accuracy Completeness Consistency Interpretability Credibility Timeliness 56
57. Data Preprocessing Data quality is a crucial factor in successful data mining tasks. Data Cleaning Fill in missing values. Correct inconsistent data. Identify outliers and noisy data. Data Integration Combine data from different sources. Data Transformation Normalization Aggregation Type Conversion Data Reduction Feature Selection Sampling 57
58. Review What is data mining? Why is data mining important? What are the typical data mining applications? What is the general procedure of data mining? What are the major techniques in data mining? What is the difference between data warehouses and databases? What to expect in this course? Where to find relevant information? How to make the most of this course? 58