2. Motivation : Data Flood
Data explosion problem
Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in
databases, data warehouses and other information repositories.
We are drowning in data, but starving for knowledge!
Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing
Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
3. Data Mining(knowledge mining from data) is an
area of research and practice that is focused on
discovering novel patterns in data using
algorithms and computer , it is good at finding
the hidden patterns of a dataset by analyzing
correlations among attribute values.
4. Today we have software that
can search through massive
data haystacks looking for lots
of interesting and usable
needles.
6. Data Mining Problems
• What other products are purchased together with a digital
camera?
– Based on previous purchases (shopping cart)
– E.g., If a digital camera is purchased, flash memory, battery, printer
are also purchased.
Association Analysis
• Similar questions:
– What products to recommend in on-line stores such as
Amazon.com, movie rental, wireless themes, etc.
– What items should be displayed together in merchant.
– What genes appear together in toxic mushrooms.
7. Data Mining Problems (cont.)
• Is this student going to go to a college?
– Based on Gender, ParentIncome, ParentEncouragement, IQ, etc.
– E.g., if ParentEncouragement=Yes and IQ>100, College=Yes
Classification (prediction)
• Similar questions:
– Is this a spam email? (spam filtering)
– How good/bad is your credit? (credit scoring)
– Recognition of hand-written letters (pen recognition)
– What is this gene like? (bioinformatics)
– Does this person behave like a terrorist?
8. Data Mining Problems (cont.)
• What is the age of a person?
– Based on Hobby, MaritalStatus, NumberOfChildren, Income,
HouseOwnership, NumberOfCars, …
– E.g., If MaritalStatus=Yes, Age =
20+4*NumberOfChildren+0.0001*Income+…
Regression (prediction)
• Similar questions:
– What’s the sales amount of ice cream next month? (sales prediction)
– What’s the stock price of A next week? (stock prediction)
– What’s the income of a customer? (marketing)
– What’s the life-time of a software bug? (bug tracking)
9. Data Mining Problems (cont.)
• Who are my Web visitor?
– Identify similar groups based on demographics, visiting patterns
– E.g., Daily news readers, email users, shoppers, short-stayers, etc
Segmentation (clustering)
• Similar questions:
– Identify groups of genes (bioinformatics)
– Identify groups of locations of Cholera incidents in London (spatial
data mining)
– Identify group of customers in merchants (Amazon, E-Bay, MSN,
WalMart, etc) (target marketing)
– Identify groups of documents. (text categorization)
10. Data Mining Problems (cont.)
• Could this network packet be from a virus
attack?
– Predict likelihood of the network packet pattern
Anomaly detection (outlier detection)
• Similar questions:
– Are the hospital lab results normal (Adverse drug effect
detection)
– Is this credit transaction fraudulent? (fraud detection)
– Does this person behave unusual, maybe worth high-level
of security clearance?
11. Data mining and machine learning
• Machine learning focuses on creating computer algorithms
that can use pre-existing inputs to refine and improve their
own capabilities for dealing with future inputs.
• Machine learning is not exactly the same thing as data mining
and vice versa. Not all data mining techniques rely on what
researchers would consider machine learning.
• machine learning is used in areas like robotics that we don’t
commonly think of when we are thinking of data mining as
such.
• Data mining is an area that has taken much of its inspiration
and techniques from machine learning (and some, also, from
statistics), but is put to different ends.
12. Data mining as a step in the process of knowledge discovery.
13. • 1. Data cleaning (to remove noise and inconsistent data).
• 2. Data integration (where multiple data sources may be
combined).
• 3. Data selection (where data relevant to the analysis task
are retrieved from the database).
• 4. Data transformation (where data are transformed or
consolidated into forms appropriate for mining by
performing summary or aggregation operations, for
instance).
• 5. Data mining (an essential process where intelligent
methods are applied in order to extract data patterns)
• 6. Pattern evaluation (to identify the truly interesting
patterns representing knowledge based on some
interestingness measures)
• 7. Knowledge presentation (where visualization and
knowledge representation techniques are used to present
the mined knowledge to the user)
14. according to this view, data mining is only one step in the entire
process .
We agree that data mining is a step in the knowledge discovery
process. However, in industry, in media, and in the database
research milieu, the term data mining is becoming more popular
than the longer term of knowledge discovery from data.
15.
16. Database, data warehouse ,WorldWideWeb, or other information repository: This
is one or a set of databases, data warehouses, spreadsheets, or other kinds of information
repositories. Data cleaning and data integration techniques may be performed
on the data.
Database or data warehouse server: The database or data warehouse server is responsible
for fetching the relevant data, based on the user’s data mining request.
Knowledge base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. Such knowledge can include concept
Hierarchies.
Data mining engine: This is essential to the data mining system and ideally consists of
a set of functional modules for tasks such as characterization, association and correlation
analysis, classification, prediction, cluster analysis, outlier analysis, and evolution
analysis.
Pattern evaluation module: This component typically employs interestingness measures
and interacts with the data mining modules so as to focus the search toward interesting
patterns . It may use interestingness thresholds to filter
out discovered patterns.
User interface: This module communicates between users and the data mining system,
allowing the user to interact with the system by specifying a data mining query or
Task.
17. Data mining typically consists of four processes:
1) data preparation.
2) exploratory data
analysis.
3) model
development.
4) Interpretation of
results.
18. Step 1
involves making sure that the data are organized in the right way , that
missing data fields are filled in, that inaccurate data are located and repaired
or deleted, and that data are "recoded" as necessary to make them amenable
to the kind of analysis we have in mind.
step2
getting to know the data using histograms and other visualization tools, and
looking for preliminary hints that will guide our model choice. The exploration
process also involves figuring out the right values for key parameters.
Step 3
choosing and developing a model - is by far the most complex and most
interesting of the activities of a data miner. It is here where you test out a
selection of the most appropriate data mining techniques. Depending upon
the structure of a dataset, there may be dozens of options, and choosing the
most promising one has as much art in it as science.
Step 4
the interpretation of results - focuses on making sense out of what the data
mining algorithm has produced. This is the most important step from the
perspective of the data user, because this is where an actionable conclusion is
formed.
20. Confidence: how frequently a particular pair occurs among all the
times when the first item is present.
Support: Support is the proportion of times that a particular
pairing occurs across all shopping carts.
to evaluate a long list of these rules for a value called:
Lift : takes into account the support for a rule, but also gives more
weight to rules where the LHS and/or the RHS occur less
frequently. In other words, lift favors situations where LHS and RHS
are not abundant but where the relatively few occurrences always
happen together. The larger the value of lift, the more
"interesting" the rule may be.
21. We can get started with association rules mining very easily using
the R package known as "arules" using the following commands
by using the Groceries data set, which is ready to be analyzed. So
we are skipping right to Step 2 in our four step proces exploratory:
> install.packages("arules")
library("arules")
You can make the Groceries data set ready with this command:
data(Groceries)
run the summary() function on Groceries so that we can see what
is in there:
> summary(Groceries)
22.
23. Notes
Groceries is an item Matrix object in sparse format ,
has rectangular data structure with 9835 rows and 169
columns , is called "sparse" is that very few of these
items exist in any given grocery basket.
when an item appears in a basket, its cell contains a
one, while if an item is not in a basket, its cell contains a
zero.
every cart has at least one item. output also shows us
which items occur in grocery baskets most frequently.
any non-zero amount of whole milk is represented by
a one. Other data mining techniques could take
advantage of knowing the exact amount of a product,
but association rules does not need to know that
amount .
24. the item "yogurt" appeared in 1372 out of
9835 rows or about 14% of cases. So we can
set the support parameter to somewhere
around 10%-15% in order to get a
manageable number of it.
item that occurs only very rarely in the
grocery baskets is unlikely to be of much use
to us in terms of creating meaningful Rules.
we want to focus our attention on items
that occur with some meaningful frequency in
the dataset.
itemFrequencyPlot(Groceries,support=0.1)
Bar graph
25. The term "apriori" refers to the specific algorithm that R will use to scan
the data set for appropriate rules. Apriori alrgorithm used at finding
rules in transaction data.
• Rules are in the form of "if LHS then RHS." ,each rule states that when
the thing or things on the left hand side of the equation occur(s) the
thing on the right hand side occurs a certain percentage of the time.
• For example
if Milk and Butter occur together in 10% of the grocery carts (that is
"support"), and Milk (by itself, ignoring Butter) occurs in 25% of the
carts, then the confidence of the Milk/Butter rule is 0.10/0.25 = 0.40.
> apriori(Groceries,parameter=list(support=0.005,+
confidence=0.5))
Apriori
26.
27. The "minlen" and "maxlen" parameters also have
sensible defaults: these refer to the minimum and
maximum.
Obviously you can’t generate a rule unless you have
at least one item in an item set.
28. Now we will examine ways of making sense out of a
large number of rules, but for now let’s agree that 15 is
too many rules to examine.
we will store the resulting rules in a
data structure called ruleset:
> ruleset <- apriori(Groceries,+
parameter=list(support=0.01,confidence=0.5))
31. Notes
Rules 7 and 8 have the highest level of lift: the fruits
and vegetables involved in these two rules have a
relatively low frequency of occurrence, but their
support and confidence are both relatively high.
Contrast these two rules with Rule 1, which also has
high confidence , but which has low support. The
reason for this is that milk is a frequently occurring
item, so there is not much novelty to that rule. On
the other hand, the combination of fruits, root
vegetables, and other vegetables suggest a need to
find out more about customers whose carts may
contain only vegetarian or vegan items.
32. to better insights we can use a data visualization
package to help explore this possibility.
The R package called arulesViz has methods of
visualizing the rule sets generated by apriori() that
can help us examine a larger set of rules. First, install
and library the arulesViz package:
> install.packages("arulesViz")
> library(arulesViz)
34. Notes
the lift is shown by the darkness of a dot that appears
on the plot. The darker the dot, the close the lift of
that rule is to 4.0.
the support of rules ranges from somewhere below
1% all the way up above 7%, all of the rules with high
lift seem to have support below 1%.On the other
hand, there are rules with high lift and high
confidence , which sounds quite positive.
35. focus on a smaller set of rules that only
have the very highest levels of lift.
goodrules <-
ruleset[quality(ruleset)$lift > 3.5]
Note that the use of the square braces
with our data structure ruleset allows
us to index only those elements
> inspect(goodrules)
36.
37. Notes
it seems evidence that shoppers are purchasing
particular combinations of items that go together in
recipes. The first three rules really seem like soup! Rules
four and five seem like a fruit platter with dip.
we might recommend that recipes could be published
along with coupons and popular recipes, such as for
homemade soup, might want to have all of the ingredients
group together in the store along with signs saying,
"Mmmm, homemade soup!"
38. R Functions Used in This Chapter
• apriori() - Uses the algorithm of the same name to analyze a
transaction data set and generate rules.
• itemFrequencyPlot() - Shows the relative frequency of commonly
occurring items in the spare occurrence matrix.
• inspect() - Shows the contents of the data object generated by
apriori() that generates the association rules.
• install.packages() - Loads package from the CRAN respository.
• summary() - Provides an overview of the contents of a data
structure.
39. REFRENCES
• Book :INTRODUCTION TO DATA SCIENCE
• Book : Data mining concepts and techniques
Second Edition
SLIDES :DR:BASSEL Alkteeb