SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
Motivation : Data Flood
Data explosion problem
Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in
databases, data warehouses and other information repositories.
We are drowning in data, but starving for knowledge!
Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing
Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
Data Mining(knowledge mining from data) is an
area of research and practice that is focused on
discovering novel patterns in data using
algorithms and computer , it is good at finding
the hidden patterns of a dataset by analyzing
correlations among attribute values.
Today we have software that
can search through massive
data haystacks looking for lots
of interesting and usable
needles.
Data Mining Tasks
• Classification
• Regression
• Segmentation
• Association
Analysis
• Anomaly
detection
• Sequence
Analysis
• Time-series
Analysis
• Text
categorization
• Advanced insights
discovery
• Others
Data Mining Problems
• What other products are purchased together with a digital
camera?
– Based on previous purchases (shopping cart)
– E.g., If a digital camera is purchased, flash memory, battery, printer
are also purchased.
 Association Analysis
• Similar questions:
– What products to recommend in on-line stores such as
Amazon.com, movie rental, wireless themes, etc.
– What items should be displayed together in merchant.
– What genes appear together in toxic mushrooms.
Data Mining Problems (cont.)
• Is this student going to go to a college?
– Based on Gender, ParentIncome, ParentEncouragement, IQ, etc.
– E.g., if ParentEncouragement=Yes and IQ>100, College=Yes
 Classification (prediction)
• Similar questions:
– Is this a spam email? (spam filtering)
– How good/bad is your credit? (credit scoring)
– Recognition of hand-written letters (pen recognition)
– What is this gene like? (bioinformatics)
– Does this person behave like a terrorist?
Data Mining Problems (cont.)
• What is the age of a person?
– Based on Hobby, MaritalStatus, NumberOfChildren, Income,
HouseOwnership, NumberOfCars, …
– E.g., If MaritalStatus=Yes, Age =
20+4*NumberOfChildren+0.0001*Income+…
 Regression (prediction)
• Similar questions:
– What’s the sales amount of ice cream next month? (sales prediction)
– What’s the stock price of A next week? (stock prediction)
– What’s the income of a customer? (marketing)
– What’s the life-time of a software bug? (bug tracking)
Data Mining Problems (cont.)
• Who are my Web visitor?
– Identify similar groups based on demographics, visiting patterns
– E.g., Daily news readers, email users, shoppers, short-stayers, etc
 Segmentation (clustering)
• Similar questions:
– Identify groups of genes (bioinformatics)
– Identify groups of locations of Cholera incidents in London (spatial
data mining)
– Identify group of customers in merchants (Amazon, E-Bay, MSN,
WalMart, etc) (target marketing)
– Identify groups of documents. (text categorization)
Data Mining Problems (cont.)
• Could this network packet be from a virus
attack?
– Predict likelihood of the network packet pattern
 Anomaly detection (outlier detection)
• Similar questions:
– Are the hospital lab results normal (Adverse drug effect
detection)
– Is this credit transaction fraudulent? (fraud detection)
– Does this person behave unusual, maybe worth high-level
of security clearance?
Data mining and machine learning
• Machine learning focuses on creating computer algorithms
that can use pre-existing inputs to refine and improve their
own capabilities for dealing with future inputs.
• Machine learning is not exactly the same thing as data mining
and vice versa. Not all data mining techniques rely on what
researchers would consider machine learning.
• machine learning is used in areas like robotics that we don’t
commonly think of when we are thinking of data mining as
such.
• Data mining is an area that has taken much of its inspiration
and techniques from machine learning (and some, also, from
statistics), but is put to different ends.
Data mining as a step in the process of knowledge discovery.
• 1. Data cleaning (to remove noise and inconsistent data).
• 2. Data integration (where multiple data sources may be
combined).
• 3. Data selection (where data relevant to the analysis task
are retrieved from the database).
• 4. Data transformation (where data are transformed or
consolidated into forms appropriate for mining by
performing summary or aggregation operations, for
instance).
• 5. Data mining (an essential process where intelligent
methods are applied in order to extract data patterns)
• 6. Pattern evaluation (to identify the truly interesting
patterns representing knowledge based on some
interestingness measures)
• 7. Knowledge presentation (where visualization and
knowledge representation techniques are used to present
the mined knowledge to the user)
according to this view, data mining is only one step in the entire
process .
We agree that data mining is a step in the knowledge discovery
process. However, in industry, in media, and in the database
research milieu, the term data mining is becoming more popular
than the longer term of knowledge discovery from data.
 Database, data warehouse ,WorldWideWeb, or other information repository: This
is one or a set of databases, data warehouses, spreadsheets, or other kinds of information
repositories. Data cleaning and data integration techniques may be performed
on the data.
 Database or data warehouse server: The database or data warehouse server is responsible
for fetching the relevant data, based on the user’s data mining request.
 Knowledge base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. Such knowledge can include concept
Hierarchies.
 Data mining engine: This is essential to the data mining system and ideally consists of
a set of functional modules for tasks such as characterization, association and correlation
analysis, classification, prediction, cluster analysis, outlier analysis, and evolution
analysis.
 Pattern evaluation module: This component typically employs interestingness measures
and interacts with the data mining modules so as to focus the search toward interesting
patterns . It may use interestingness thresholds to filter
out discovered patterns.
 User interface: This module communicates between users and the data mining system,
allowing the user to interact with the system by specifying a data mining query or
Task.
Data mining typically consists of four processes:
1) data preparation.
2) exploratory data
analysis.
3) model
development.
4) Interpretation of
results.
 Step 1
involves making sure that the data are organized in the right way , that
missing data fields are filled in, that inaccurate data are located and repaired
or deleted, and that data are "recoded" as necessary to make them amenable
to the kind of analysis we have in mind.
 step2
getting to know the data using histograms and other visualization tools, and
looking for preliminary hints that will guide our model choice. The exploration
process also involves figuring out the right values for key parameters.
 Step 3
choosing and developing a model - is by far the most complex and most
interesting of the activities of a data miner. It is here where you test out a
selection of the most appropriate data mining techniques. Depending upon
the structure of a dataset, there may be dozens of options, and choosing the
most promising one has as much art in it as science.
 Step 4
the interpretation of results - focuses on making sense out of what the data
mining algorithm has produced. This is the most important step from the
perspective of the data user, because this is where an actionable conclusion is
formed.
"association rules mining"
Confidence: how frequently a particular pair occurs among all the
times when the first item is present.
Support: Support is the proportion of times that a particular
pairing occurs across all shopping carts.
to evaluate a long list of these rules for a value called:
Lift : takes into account the support for a rule, but also gives more
weight to rules where the LHS and/or the RHS occur less
frequently. In other words, lift favors situations where LHS and RHS
are not abundant but where the relatively few occurrences always
happen together. The larger the value of lift, the more
"interesting" the rule may be.
We can get started with association rules mining very easily using
the R package known as "arules" using the following commands
by using the Groceries data set, which is ready to be analyzed. So
we are skipping right to Step 2 in our four step proces exploratory:
> install.packages("arules")
 library("arules")
You can make the Groceries data set ready with this command:
 data(Groceries)
run the summary() function on Groceries so that we can see what
is in there:
> summary(Groceries)
Notes
 Groceries is an item Matrix object in sparse format ,
has rectangular data structure with 9835 rows and 169
columns , is called "sparse" is that very few of these
items exist in any given grocery basket.
when an item appears in a basket, its cell contains a
one, while if an item is not in a basket, its cell contains a
zero.
 every cart has at least one item. output also shows us
which items occur in grocery baskets most frequently.
 any non-zero amount of whole milk is represented by
a one. Other data mining techniques could take
advantage of knowing the exact amount of a product,
but association rules does not need to know that
amount .
 the item "yogurt" appeared in 1372 out of
9835 rows or about 14% of cases. So we can
set the support parameter to somewhere
around 10%-15% in order to get a
manageable number of it.
 item that occurs only very rarely in the
grocery baskets is unlikely to be of much use
to us in terms of creating meaningful Rules.
we want to focus our attention on items
that occur with some meaningful frequency in
the dataset.
itemFrequencyPlot(Groceries,support=0.1)
Bar graph
The term "apriori" refers to the specific algorithm that R will use to scan
the data set for appropriate rules. Apriori alrgorithm used at finding
rules in transaction data.
• Rules are in the form of "if LHS then RHS." ,each rule states that when
the thing or things on the left hand side of the equation occur(s) the
thing on the right hand side occurs a certain percentage of the time.
• For example
if Milk and Butter occur together in 10% of the grocery carts (that is
"support"), and Milk (by itself, ignoring Butter) occurs in 25% of the
carts, then the confidence of the Milk/Butter rule is 0.10/0.25 = 0.40.
> apriori(Groceries,parameter=list(support=0.005,+
confidence=0.5))
Apriori
 The "minlen" and "maxlen" parameters also have
sensible defaults: these refer to the minimum and
maximum.
 Obviously you can’t generate a rule unless you have
at least one item in an item set.
Now we will examine ways of making sense out of a
large number of rules, but for now let’s agree that 15 is
too many rules to examine.
we will store the resulting rules in a
data structure called ruleset:
> ruleset <- apriori(Groceries,+
parameter=list(support=0.01,confidence=0.5))
The inspect() command
Notes
 Rules 7 and 8 have the highest level of lift: the fruits
and vegetables involved in these two rules have a
relatively low frequency of occurrence, but their
support and confidence are both relatively high.
 Contrast these two rules with Rule 1, which also has
high confidence , but which has low support. The
reason for this is that milk is a frequently occurring
item, so there is not much novelty to that rule. On
the other hand, the combination of fruits, root
vegetables, and other vegetables suggest a need to
find out more about customers whose carts may
contain only vegetarian or vegan items.
 to better insights we can use a data visualization
package to help explore this possibility.
 The R package called arulesViz has methods of
visualizing the rule sets generated by apriori() that
can help us examine a larger set of rules. First, install
and library the arulesViz package:
> install.packages("arulesViz")
> library(arulesViz)
> ruleset <-
apriori(Groceries,parameter=list(support=0.005,confidence=0.35))
generate 357 rules.
> plot(ruleset)
Notes
 the lift is shown by the darkness of a dot that appears
on the plot. The darker the dot, the close the lift of
that rule is to 4.0.
 the support of rules ranges from somewhere below
1% all the way up above 7%, all of the rules with high
lift seem to have support below 1%.On the other
hand, there are rules with high lift and high
confidence , which sounds quite positive.
focus on a smaller set of rules that only
have the very highest levels of lift.
goodrules <-
ruleset[quality(ruleset)$lift > 3.5]
Note that the use of the square braces
with our data structure ruleset allows
us to index only those elements
> inspect(goodrules)
Notes
 it seems evidence that shoppers are purchasing
particular combinations of items that go together in
recipes. The first three rules really seem like soup! Rules
four and five seem like a fruit platter with dip.
 we might recommend that recipes could be published
along with coupons and popular recipes, such as for
homemade soup, might want to have all of the ingredients
group together in the store along with signs saying,
"Mmmm, homemade soup!"
R Functions Used in This Chapter
• apriori() - Uses the algorithm of the same name to analyze a
transaction data set and generate rules.
• itemFrequencyPlot() - Shows the relative frequency of commonly
occurring items in the spare occurrence matrix.
• inspect() - Shows the contents of the data object generated by
apriori() that generates the association rules.
• install.packages() - Loads package from the CRAN respository.
• summary() - Provides an overview of the contents of a data
structure.
REFRENCES
• Book :INTRODUCTION TO DATA SCIENCE
• Book : Data mining concepts and techniques
Second Edition
SLIDES :DR:BASSEL Alkteeb
THANK YOU

Contenu connexe

Tendances

Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
Phi Jack
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
smj
 

Tendances (20)

Data mining
Data mining Data mining
Data mining
 
Data Mining : Concepts
Data Mining : ConceptsData Mining : Concepts
Data Mining : Concepts
 
Introduction to data mining technique
Introduction to data mining techniqueIntroduction to data mining technique
Introduction to data mining technique
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
data mining
data miningdata mining
data mining
 
Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Data mining
Data miningData mining
Data mining
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
 
Data mining
Data miningData mining
Data mining
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Text mining
Text miningText mining
Text mining
 
Big Data
Big DataBig Data
Big Data
 

Similaire à Data mining

Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
Thanveen
 

Similaire à Data mining (20)

data mining
data miningdata mining
data mining
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
Data mining
Data miningData mining
Data mining
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
 
data mining
data miningdata mining
data mining
 
Data Mining based on Hashing Technique
Data Mining based on Hashing TechniqueData Mining based on Hashing Technique
Data Mining based on Hashing Technique
 
Unit i
Unit iUnit i
Unit i
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
Week-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxWeek-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptx
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
An Efficient Approach for Asymmetric Data Classification
An Efficient Approach for Asymmetric Data ClassificationAn Efficient Approach for Asymmetric Data Classification
An Efficient Approach for Asymmetric Data Classification
 
notes_dmdw_chap1.docx
notes_dmdw_chap1.docxnotes_dmdw_chap1.docx
notes_dmdw_chap1.docx
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
Chapter 1.pdf
Chapter 1.pdfChapter 1.pdf
Chapter 1.pdf
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
BIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNINGBIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNING
 
chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
 
chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
 

Plus de heba_ahmad (13)

heba alsayed ahmad_Recomm_#
heba alsayed ahmad_Recomm_#heba alsayed ahmad_Recomm_#
heba alsayed ahmad_Recomm_#
 
heba alsayed ahmad_Recomm_#2
heba alsayed ahmad_Recomm_#2heba alsayed ahmad_Recomm_#2
heba alsayed ahmad_Recomm_#2
 
bassel alkhatib recommendation
bassel alkhatib recommendation bassel alkhatib recommendation
bassel alkhatib recommendation
 
recommendation dr jose
recommendation dr joserecommendation dr jose
recommendation dr jose
 
recommendation dr.miguel
recommendation dr.miguelrecommendation dr.miguel
recommendation dr.miguel
 
metaheuristic tabu pso
metaheuristic tabu psometaheuristic tabu pso
metaheuristic tabu pso
 
Line uo,please
Line uo,pleaseLine uo,please
Line uo,please
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
 
Shiny in R
Shiny in RShiny in R
Shiny in R
 
&Final presentation
 &Final presentation &Final presentation
&Final presentation
 
Chapter 18,19
Chapter 18,19Chapter 18,19
Chapter 18,19
 
Ggplot2 ch2
Ggplot2 ch2Ggplot2 ch2
Ggplot2 ch2
 
Final presentation
Final presentationFinal presentation
Final presentation
 

Dernier

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Dernier (20)

FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 

Data mining

  • 1.
  • 2. Motivation : Data Flood Data explosion problem Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining Data warehousing and on-line analytical processing Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
  • 3. Data Mining(knowledge mining from data) is an area of research and practice that is focused on discovering novel patterns in data using algorithms and computer , it is good at finding the hidden patterns of a dataset by analyzing correlations among attribute values.
  • 4. Today we have software that can search through massive data haystacks looking for lots of interesting and usable needles.
  • 5. Data Mining Tasks • Classification • Regression • Segmentation • Association Analysis • Anomaly detection • Sequence Analysis • Time-series Analysis • Text categorization • Advanced insights discovery • Others
  • 6. Data Mining Problems • What other products are purchased together with a digital camera? – Based on previous purchases (shopping cart) – E.g., If a digital camera is purchased, flash memory, battery, printer are also purchased.  Association Analysis • Similar questions: – What products to recommend in on-line stores such as Amazon.com, movie rental, wireless themes, etc. – What items should be displayed together in merchant. – What genes appear together in toxic mushrooms.
  • 7. Data Mining Problems (cont.) • Is this student going to go to a college? – Based on Gender, ParentIncome, ParentEncouragement, IQ, etc. – E.g., if ParentEncouragement=Yes and IQ>100, College=Yes  Classification (prediction) • Similar questions: – Is this a spam email? (spam filtering) – How good/bad is your credit? (credit scoring) – Recognition of hand-written letters (pen recognition) – What is this gene like? (bioinformatics) – Does this person behave like a terrorist?
  • 8. Data Mining Problems (cont.) • What is the age of a person? – Based on Hobby, MaritalStatus, NumberOfChildren, Income, HouseOwnership, NumberOfCars, … – E.g., If MaritalStatus=Yes, Age = 20+4*NumberOfChildren+0.0001*Income+…  Regression (prediction) • Similar questions: – What’s the sales amount of ice cream next month? (sales prediction) – What’s the stock price of A next week? (stock prediction) – What’s the income of a customer? (marketing) – What’s the life-time of a software bug? (bug tracking)
  • 9. Data Mining Problems (cont.) • Who are my Web visitor? – Identify similar groups based on demographics, visiting patterns – E.g., Daily news readers, email users, shoppers, short-stayers, etc  Segmentation (clustering) • Similar questions: – Identify groups of genes (bioinformatics) – Identify groups of locations of Cholera incidents in London (spatial data mining) – Identify group of customers in merchants (Amazon, E-Bay, MSN, WalMart, etc) (target marketing) – Identify groups of documents. (text categorization)
  • 10. Data Mining Problems (cont.) • Could this network packet be from a virus attack? – Predict likelihood of the network packet pattern  Anomaly detection (outlier detection) • Similar questions: – Are the hospital lab results normal (Adverse drug effect detection) – Is this credit transaction fraudulent? (fraud detection) – Does this person behave unusual, maybe worth high-level of security clearance?
  • 11. Data mining and machine learning • Machine learning focuses on creating computer algorithms that can use pre-existing inputs to refine and improve their own capabilities for dealing with future inputs. • Machine learning is not exactly the same thing as data mining and vice versa. Not all data mining techniques rely on what researchers would consider machine learning. • machine learning is used in areas like robotics that we don’t commonly think of when we are thinking of data mining as such. • Data mining is an area that has taken much of its inspiration and techniques from machine learning (and some, also, from statistics), but is put to different ends.
  • 12. Data mining as a step in the process of knowledge discovery.
  • 13. • 1. Data cleaning (to remove noise and inconsistent data). • 2. Data integration (where multiple data sources may be combined). • 3. Data selection (where data relevant to the analysis task are retrieved from the database). • 4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance). • 5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns) • 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) • 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)
  • 14. according to this view, data mining is only one step in the entire process . We agree that data mining is a step in the knowledge discovery process. However, in industry, in media, and in the database research milieu, the term data mining is becoming more popular than the longer term of knowledge discovery from data.
  • 15.
  • 16.  Database, data warehouse ,WorldWideWeb, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.  Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request.  Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept Hierarchies.  Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.  Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns . It may use interestingness thresholds to filter out discovered patterns.  User interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or Task.
  • 17. Data mining typically consists of four processes: 1) data preparation. 2) exploratory data analysis. 3) model development. 4) Interpretation of results.
  • 18.  Step 1 involves making sure that the data are organized in the right way , that missing data fields are filled in, that inaccurate data are located and repaired or deleted, and that data are "recoded" as necessary to make them amenable to the kind of analysis we have in mind.  step2 getting to know the data using histograms and other visualization tools, and looking for preliminary hints that will guide our model choice. The exploration process also involves figuring out the right values for key parameters.  Step 3 choosing and developing a model - is by far the most complex and most interesting of the activities of a data miner. It is here where you test out a selection of the most appropriate data mining techniques. Depending upon the structure of a dataset, there may be dozens of options, and choosing the most promising one has as much art in it as science.  Step 4 the interpretation of results - focuses on making sense out of what the data mining algorithm has produced. This is the most important step from the perspective of the data user, because this is where an actionable conclusion is formed.
  • 20. Confidence: how frequently a particular pair occurs among all the times when the first item is present. Support: Support is the proportion of times that a particular pairing occurs across all shopping carts. to evaluate a long list of these rules for a value called: Lift : takes into account the support for a rule, but also gives more weight to rules where the LHS and/or the RHS occur less frequently. In other words, lift favors situations where LHS and RHS are not abundant but where the relatively few occurrences always happen together. The larger the value of lift, the more "interesting" the rule may be.
  • 21. We can get started with association rules mining very easily using the R package known as "arules" using the following commands by using the Groceries data set, which is ready to be analyzed. So we are skipping right to Step 2 in our four step proces exploratory: > install.packages("arules")  library("arules") You can make the Groceries data set ready with this command:  data(Groceries) run the summary() function on Groceries so that we can see what is in there: > summary(Groceries)
  • 22.
  • 23. Notes  Groceries is an item Matrix object in sparse format , has rectangular data structure with 9835 rows and 169 columns , is called "sparse" is that very few of these items exist in any given grocery basket. when an item appears in a basket, its cell contains a one, while if an item is not in a basket, its cell contains a zero.  every cart has at least one item. output also shows us which items occur in grocery baskets most frequently.  any non-zero amount of whole milk is represented by a one. Other data mining techniques could take advantage of knowing the exact amount of a product, but association rules does not need to know that amount .
  • 24.  the item "yogurt" appeared in 1372 out of 9835 rows or about 14% of cases. So we can set the support parameter to somewhere around 10%-15% in order to get a manageable number of it.  item that occurs only very rarely in the grocery baskets is unlikely to be of much use to us in terms of creating meaningful Rules. we want to focus our attention on items that occur with some meaningful frequency in the dataset. itemFrequencyPlot(Groceries,support=0.1) Bar graph
  • 25. The term "apriori" refers to the specific algorithm that R will use to scan the data set for appropriate rules. Apriori alrgorithm used at finding rules in transaction data. • Rules are in the form of "if LHS then RHS." ,each rule states that when the thing or things on the left hand side of the equation occur(s) the thing on the right hand side occurs a certain percentage of the time. • For example if Milk and Butter occur together in 10% of the grocery carts (that is "support"), and Milk (by itself, ignoring Butter) occurs in 25% of the carts, then the confidence of the Milk/Butter rule is 0.10/0.25 = 0.40. > apriori(Groceries,parameter=list(support=0.005,+ confidence=0.5)) Apriori
  • 26.
  • 27.  The "minlen" and "maxlen" parameters also have sensible defaults: these refer to the minimum and maximum.  Obviously you can’t generate a rule unless you have at least one item in an item set.
  • 28. Now we will examine ways of making sense out of a large number of rules, but for now let’s agree that 15 is too many rules to examine. we will store the resulting rules in a data structure called ruleset: > ruleset <- apriori(Groceries,+ parameter=list(support=0.01,confidence=0.5))
  • 29.
  • 31. Notes  Rules 7 and 8 have the highest level of lift: the fruits and vegetables involved in these two rules have a relatively low frequency of occurrence, but their support and confidence are both relatively high.  Contrast these two rules with Rule 1, which also has high confidence , but which has low support. The reason for this is that milk is a frequently occurring item, so there is not much novelty to that rule. On the other hand, the combination of fruits, root vegetables, and other vegetables suggest a need to find out more about customers whose carts may contain only vegetarian or vegan items.
  • 32.  to better insights we can use a data visualization package to help explore this possibility.  The R package called arulesViz has methods of visualizing the rule sets generated by apriori() that can help us examine a larger set of rules. First, install and library the arulesViz package: > install.packages("arulesViz") > library(arulesViz)
  • 34. Notes  the lift is shown by the darkness of a dot that appears on the plot. The darker the dot, the close the lift of that rule is to 4.0.  the support of rules ranges from somewhere below 1% all the way up above 7%, all of the rules with high lift seem to have support below 1%.On the other hand, there are rules with high lift and high confidence , which sounds quite positive.
  • 35. focus on a smaller set of rules that only have the very highest levels of lift. goodrules <- ruleset[quality(ruleset)$lift > 3.5] Note that the use of the square braces with our data structure ruleset allows us to index only those elements > inspect(goodrules)
  • 36.
  • 37. Notes  it seems evidence that shoppers are purchasing particular combinations of items that go together in recipes. The first three rules really seem like soup! Rules four and five seem like a fruit platter with dip.  we might recommend that recipes could be published along with coupons and popular recipes, such as for homemade soup, might want to have all of the ingredients group together in the store along with signs saying, "Mmmm, homemade soup!"
  • 38. R Functions Used in This Chapter • apriori() - Uses the algorithm of the same name to analyze a transaction data set and generate rules. • itemFrequencyPlot() - Shows the relative frequency of commonly occurring items in the spare occurrence matrix. • inspect() - Shows the contents of the data object generated by apriori() that generates the association rules. • install.packages() - Loads package from the CRAN respository. • summary() - Provides an overview of the contents of a data structure.
  • 39. REFRENCES • Book :INTRODUCTION TO DATA SCIENCE • Book : Data mining concepts and techniques Second Edition SLIDES :DR:BASSEL Alkteeb