2. INTRODUCTION
• New buzzword, old idea.
• Inferring new information from already
collected data.
• Traditionally job of Data Analysts
• Computers have changed this.
Far more efficient to comb through
data using a machine than eyeballing
statistical data.
Page 2
3. DEFINITION
“Data mining is the entire
process of applying computer-
based methodology, including
new techniques for knowledge
discovery, from data.”
Page 3
4. Two Main Components
Knowledge Discovery
Concrete information gleaned from known
data. Data you may not have known, but which
is supported by recorded facts.
Knowledge Prediction
Uses known data to forecast future trends,
events, etc. (ie: Stock market predictions)
Page 4
5. Uses of Data Mining
• AI/Machine Learning
Combinatorial/Game Data Mining
Good for analyzing winning strategies to games, and
thus developing intelligent AI opponents. (ie: Chess)
• Business Strategies
Market Basket Analysis
Identify customer demographics, preferences, and
purchasing patterns.
• Risk Analysis
Product Defect Analysis
Analyze product defect rates for given plants and
predict possible complications (read: lawsuits) down
the line.
Page 5
6. (Continued)
• User Behavior Validation
Fraud Detection
In the realm of cell phones
Comparing phone activity to calling records.
Can help detect calls made on cloned
phones.
Similarly, with credit cards, comparing
purchases with historical purchases. Can
detect activity with stolen cards.
Page 6
7. Sources of Data for Mining
Databases (most obvious)
Text Documents
Computer Simulations
Social Networks
Page 7
9. Database Processing vs. Data Mining
Processing
• Query • Query
– Well defined – Poorly defined
– SQL – No precise query
language
• Data • Data
-Operational data – - Not operational data
• Output • Output
- Precise – - Fuzzy
- Subset of database – - Not a subset of
database
Page 9
11. Basic Data Mining Tasks
• Classification maps data into predefined
groups or classes
– Supervised learning
– Pattern recognition
– Prediction
• Regression is used to map a data item
to a real valued prediction variable.
• Clustering groups similar data together into
clusters.
– Unsupervised learning
– Segmentation
– Partitioning Page 11
12. (cont’d)
• Summarization maps data into subsets with
associated simple descriptions.
– Characterization
– Generalization
• Link Analysis uncovers relationships
among data.
– Affinity Analysis
– Association Rules
– Sequential Analysis determines sequential
patterns.
Page 12
13. Data Mining Techniques
• Statistical
– Point Estimation
– Models Based on Summarization
– Bayes Theorem
– Hypothesis Testing
– Regression and Correlation
• Similarity Measures
• Decision Trees
• Neural Networks
– Activation Functions
• Genetic Algorithms
Page 13
14. Challenges of Data Mining
q Scalability
q Dimensionality
q Complex and Heterogeneous Data
q Data Quality
q Data Ownership and Distribution
q Privacy Preservation
q Streaming Data
Page 14