Data mining techniques and dss

DM

Myths of DM
Techniques of DM

Myth #1:Data mining provides instant
crystal ball-predictions
Data mining is neither a crystal ball nor a
technology where answers magically appear after
pushing a single button.
It's a multi-step process that includes: defining the
business problem, exploring and conditioning data,
developing the model, and deploying the
knowledge gained.
Typically, companies spend the bulk of their time
preprocessing and conditioning the data to make
sure it is clean, consistent, and combined properly
to deliver business intelligence on which they can
rely. Data mining is all about the data -- successful
data mining requires data that accurately reflects
the business.

Myth #2: Data mining is not yet viable
for business application
Data mining is viable technology and highly prized
for its business results.
The myth tends to be perpetrated by those who
need to explain why they are not yet using the
process and revolves around two related
statements.

Myth #3: Data mining requires a data
warehouse
 It is true that data mining can benefit from warehoused

data that is well organized, relatively clean, and easy to
access.
 This is particularly true if the warehouse has been
constructed with data mining specifically in mind and
with knowledge of the requirements of the data mining
project.
 However, the warehoused data may be less useful for
data mining than the source or operational data. In the
worst case, warehoused data may be completely
useless (for example, if only summary data are stored).

Myth #4: DM is all about algorithms
 People often misunderstood that "All you need for data

mining is good algorithms. The better your algorithms,
the better your data mining; advancing the effectiveness
of data mining means advancing our knowledge of
algorithms.“
 This is often to misunderstand the data mining process.
Data mining is a process consisting of many elements,
such as formulating business goals, mapping business
goals to data mining goals, acquiring, understanding,
and pre-processing the data, evaluating and presenting
the results of analysis and deploying these results to
achieve business benefits.
 This is not to minimize the importance of new or
improved data mining algorithms

Myth #5: DM should be done by

technology expert
 Quite the opposite is true, due to the paramount

importance of business knowledge in data mining.
 When performed without business knowledge, data
mining can produce nonsensical or useless results so
it is essential that data mining be performed by
someone with extensive knowledge of the business
problem.
 Very seldom is this the same person with extensive
knowledge of the data mining technology. It is the
responsibility of data mining tool providers to ensure
that tools are accessible to business users.

Myth #6: Data mining is for large companies
with lots of customer data
 The plain fact is that if a company, large or

small, has data that accurately reflects the
business or its customers, it can build models
against that data that lend insights into
important business challenges. The amount
of customer data a company possesses has
never been the issue.

Fundamental concepts of DM
 Classification

Classification is the operation most commonly
supported by commercial data mining tools.
 It is the process of sub-dividing a data set with
regard to a number of specific outcomes.
 For example, classifying customers into ‘high’
and ‘low’ categories with regard to credit risk.
 The category or ‘class’ into which each
customer is placed is the ‘outcome’ of the
classification.


Prediction
 Prediction gives the future data states based

on past and current data. Prediction can be
viewed as a type of classification. Ex:
Predicting floods
 Techniques for Classification and prediction decision trees, neural networks, nearest
neighbour algorithms

Understanding v Prediction
 Sophisticated classification techniques enable us to

discover new patterns in large and complex data
sets.
 Classification is a powerful aid to understanding a
particular problem. In some cases, improved
understanding is sufficient. It may suggest new
initiatives and provide information that improves
future decision making.
 Often the reason for developing an accurate
classification model is to improve our capability for
prediction.

Training
 A classification model is said to be ‘trained’

on historical data, for which the outcome is
known for each record.
 But beware over fitting: for example100 per
cent of customers called Smith who live at 28
Arcadia Street responded to the offer.
 One would then use a separate test dataset
of historical data to validate the model.
 The model could then be applied to a new,
unclassified data set in order to predict the
outcome for each record.

Clustering
 It is used to find groupings of similar records

in a data set without any preconditions as to
what that similarity may involve.
 Clustering is used to identify interesting
groups in a customer base that may not have
been recognised before. Often undertaken as
an exploratory exercise before doing further
data mining using a classification technique.
 Techniques for Clustering - cluster analysis,
neural networks

Association analysis
 Association analysis looks for links between

records in a data set.
 Sometimes referred to as ‘market basket
analysis’, its most common aim is to discover
which items are generally purchased at the
same time.

Example of Association Analysis
 Consider the following beer and nappy

example:
 500,000 transactions
 20,000 transactions contain nappies (4%)
 30,000 transactions contain beer (6%)
 10,000 transactions contain both nappies and
beer (2%)

Sequential analysis
 Sequential analysis looks for temporal links

between purchases, rather than relationships
between items in a single transaction.

Support (or prevalence)
 Measures how often items occur together, as

a percentage of the total transactions. In this
example, beer and nappies occur together
2% of the time (10,000/500,000).

Confidence (or predictability)
 Measures how much a particular item is dependent






on another.
Because 20,000 transactions contain nappies and
10,000 of these transactions contain beer, when
people buy nappies, they also buy beer 50% of the
time.
The confidence for the rule:
When people buy nappies they also buy beer 50% of
the time. is 50%.
Because 30,000 transactions contain beer and
10,000 of these transactions contain nappies, when
people buy beer, they also buy nappies 33.33% of
the time.

Expected Confidence
 In the absence of any knowledge about what

else was bought, we can also make the
following assertions from the available data:
 People buy nappies 4% of the time.
 People buy beer 6% of the time.
 These numbers - 4% and 6% - are called the
expected confidence of buying nappies or
beer, regardless of what else is purchased.

Lift
 Measures the ratio between the confidence of

a rule and the expected confidence that the
second product will be purchased. Lift is
measures of the strength of an effect.
 In our example, the confidence of the
nappies-beer buying rule is 50%, whilst the
expected confidence is 6% that an arbitrary
customer will buy beer. So, the lift provided
by the nappies-beer rule is :8.33 (= 50%/6%).

Forecasting
 Forecasting (unlike prediction based on classification

models) concerns the prediction of continuous
values, such a person’s income based on various
personal details, or the level of the stock market.
 Simpler forecasting problems involve a single
continuous value based on a series of unordered
examples. More complex problem is to predict one or
more values based on a sequential pattern.
 Techniques include statistical time-series analysis as
well as neural networks.

Techniques used in DM
 Regression:

This is used to map data item to a real valued
prediction variable. Ex: A college professor
wishing to calculate his future savings
 Time series analysis:
In this the value of an attribute is examined
as it varies over time. Ex: A company trying to
analyze to whom the stock can be purchased,
whether from X, Y, Z

Techniques used in DM (contd..)
 Summarization:

This maps the data into subsets with associated simple
descriptions. This is also called as characterization or
generalization. Ex: Comparison of universities in US is
the average SAT or ACT score.
 Association rules:
This is a model that identifies specific types of data
associations. Ex: a grocery store trying to decide
whether to put bread on sale.

Data Mining Steps








Collect the Data
Clean the Data
Determine what is desired
Determine optimal method/tool
Mine the data
Analyze and verify the results
Use the results

Data Mining Input
 Data mining can effectively deal with

inconsistencies in your data. Even If your
sources are clean, integrated, and validated,
they may contain data about the real world
that is simply not true. This noise can, for
example, be caused by errors in user input or
just plain mistakes of customers filling in
questionnaires. If it does not occur too often,
data mining tools are able to ignore the noise
and still find the overall patterns that exist in
your data.

Data Mining Output
 The output of data mining can provide you with more

flexibility. For example, if you have a budget to mail
information to 1000 people about a new product,
queries or OLAP analysis directly on your data will
never be able to select exactly that number of people
from your database. By enhancing your data with an
attribute that you can use in your query or OLAP
analysis, data mining enables you to find the 1000
people most likely to respond. This example also
shows that data mining is not replacing OLAP, but
enhancing it.

The Future of Data Mining
 In the short-term, the results of data mining will be in

profitable, if mundane, business related areas. Micromarketing campaigns will explore new niches.
Advertising will target potential customers with new
precision.
 In the medium term, data mining may be as common
and easy to use as e-mail. We may use these tools to
find the best airfare to New York, root out a phone
number of a long-lost classmate, or find the best prices
on lawn mowers.
 The long-term prospects are truly exciting. Imagine
intelligent agents turned loose on medical research data
or on sub-atomic particle data. Computers may reveal
new treatments for diseases or new insights into the
nature of the universe. There are potential dangers,
though, as discussed below.

Privacy Concerns
 What if every telephone call you make, every credit card

purchase you make, every flight you take, every visit to
the doctor you make, every warranty card you send in,
every employment application you fill out, every school
record you have, your credit record, every web page you
visit ... was all collected together? A lot would be known
about you! This is an all-too-real possibility.
 In a database, too much information about too many
people for anybody is going to make any sense? Not
with data mining tools running on massively parallel
processing computers! Would you feel comfortable about
someone having access to all this data about you? And
remember, all this data does not have to reside in one
physical location; as the net grows; information of this
type becomes more available to more people.

Proposed solutions might be…
 Data are intentionally modified from their original

version, in order to misinform the recipients or for
privacy and security
 legislation designed to protect consumers against
data security failures by, among other things,
requiring companies to notify consumers when their
personal information has been compromised.

Expanding universe of data
 Nowadays, the world is regarded as an

expanding universe of data. We have an
infinite amount of data, yet little information.
Some people look at this phenomenon as a
new paradox of the growth of data, that is,
more data means less information. Therefore,
there is an urgent need for the development
of new techniques to find the required
information from huge amount of data.

Expanding universe of data
The following factors make the data mining as a
very important technique to extract implicit,
previously unknown and potentially useful
knowledge from data.
Data mining algorithms can find "optimal"
clustering or interesting regularities in a Database.
Data mining algorithms typically zoom in on
interesting sub-parts of the Databases.
Networks make it easy to connect Databases.
Machine learning techniques make it easier to
find interesting connections in Database.
Client/Server revolution.

Information as a factor of
production
Increase in available data
Exacerbated by World Wide Website
Information overload
Computer assistance to filter, select and
interpret data
 Extend this to allow computers to discover
relevant information
 In the future machine assistance will become
more and more important





Components explained
 Database, data warehouse, or other

information repository: This is one or a set
of databases, data warehouses, spread
sheets, or other kinds of information
repositories. Data cleaning and data
integration techniques may be performed on
the data.
 Database or data warehouse server: The
database or data warehouse server is
responsible for fetching the relevant data,
based on the user's data mining request.

 Knowledge base: This is the domain knowledge that

is used to guide the search, or evaluate the
interestingness of resulting patterns.
 Such knowledge can include concept hierarchies,
used to organize attributes or attribute values into
different levels of abstraction.
 Knowledge such as user beliefs, which can be used
to assess a pattern's interestingness based on its
unexpectedness, may also be included.
 Other examples of domain knowledge are additional
interestingness constraints or thresholds, and
metadata (e.g., describing data from multiple
heterogeneous sources). .

 Data mining engine: This is essential to the data

mining system and ideally consists of a set of
functional modules for tasks such as characterization,
association analysis, classification, evolution and
deviation analysis.

 Pattern evaluation module: This component

typically employs interestingness measures and
interacts with the data mining modules so as to focus
the search towards interesting patterns.
 It may access interestingness thresholds stored in
the knowledge base.
 Alternatively, the pattern evaluation module may be
integrated with the mining module, depending on the
implementation of the data mining method used.

 Graphical user interface: This module

communicates between users and the data mining
system, allowing the user to interact with the system
by specifying a data mining query or task, providing
information to help focus the search, and performing
exploratory data mining based on the intermediate
data mining results.
 In addition, this component allows the user to browse
database and data warehouse schemas or data
structures, evaluate mined patterns, and visualize the
patterns in different forms.

Classification of DM
 Classification according to the kinds of

databases mined.


A data mining system can be classified
according to the kinds of databases mined.
Database systems themselves can be
classified according to different criteria (such
as data models, or the types of data or
applications involved), each of which may
require its own data mining technique. Data
mining systems can therefore be classified
accordingly.

 Classification according to the kinds of databases

mined.
 For instance, if classifying according to data
models, we may have a relational, transactional,
object-oriented, object-relational, or data
warehouse mining system. If classifying according
to the special types of data handled, we may have
a spatial, time-series, text, or multimedia data
mining system, or a World-Wide Web mining
system. Other system types include
heterogeneous data mining systems, and legacy
data mining systems.

 Classification according to the kinds of knowledge

mined.


Data mining systems can be categorized according to
the kinds of knowledge they mine, i.e., based on data
mining functionalities, such as characterization,
discrimination, association, classification, clustering,
trend and evolution analysis, deviation analysis,
similarity analysis, etc.

 Classification according to the kinds of techniques

utilized.




These techniques can be described according to the
degree of user interaction involved (e.g., autonomous
systems, interactive exploratory systems, query-driven
systems), or the methods of data analysis employed
(e.g., database-oriented or data warehouse-oriented
techniques, machine learning, statistics, visualization,
pattern recognition, neural networks, and so on).
A sophisticated data mining system will often adopt
multiple data mining techniques or work out an
effective, integrated technique which combines the
merits of a few individual approaches.

Decision Support Systems (DSS)
 A decision support system is a computer-

based system that supports the decision
making process
• Assist decision makers in semi-structured
tasks
• Support not replace human judgment
• Highly interactive
• Improve effectiveness of human decision
makers

DSS characteristics
 Provide support in semi-structured and

unstructured situations, includes human
judgment and computerized information
 Support for various managerial levels
 Support to individuals and groups
 Support to interdependent and/or sequential
decisions
 Support all phases of the decision-making
process
 Support a variety of decision-making
processes and styles

DSS characteristics
 Are adaptive
 Have user friendly interfaces
 Goal: improve effectiveness of decision

making
 The decision maker controls the decisionmaking process
 End-users can build simple systems
 Utilizes models for analysis
 Provides access to a variety of data sources,
formats, and types

Why DSS?
• Increasing complexity of decisions
o Technology
o Information:

• “Data, data everywhere, and not the time to
think!”
o Number and complexity of options
o Pace of change

Why DSS?
• Increasing availability of computerized support
o Inexpensive high-powered computing
o Better software
o More efficient software development process

• Increasing usability of computers
o COTS (Commercial Off The Shelf) tools
o Customization

Types of Problems
• Structured
o Repetitive
o Standard solution methods exist
o Complete automation may be feasible
• Unstructured
o One-time
o No standard solutions
o Rely on judgment
o Automation is usually infeasible
• Semi-structured
o Some elements and/or phases of decision making
process have repetitive elements

Decision Support Trends
• IT is increasingly pervasive
• Users are increasingly computer savvy
• Computer hardware is increasingly smaller and more
powerful
• Systems are increasingly interconnected
• The Web is increasingly interwoven into all aspects of
our lives
• Demand for usable, flexible, powerful decision support
will continue to grow
• Decision support will be embedded into a wide variety
of consumer and business products

Humans and Computers:
Complementary Strengths
• Human decision makers
o Good at seeing patterns
o Can work with incomplete problem
representations
o Exercise subtle judgment we do not know how
to automate
o Often unaware of how they perform tasks
o Poor at integrating large numbers of cues
o Unreliable and slow at tedious bookkeeping
tasks and complex calculations

Humans and Computers:
Complementary Strengths
Computers
o Still inferior to humans at pattern recognition,
messy unstructured problems
o Good at integrating large numbers of features
o Good at tedious bookkeeping
o Rapid and accurate at complex calculations

DSS classifications
 Model Driven DSS: A model-driven DSS

emphasizes access to and manipulation of financial,
optimization and/or simulation models. Simple
quantitative models provide the most elementary
level of functionality.
 Data Driven DSS: Data-driven DSS emphasizes
access to and manipulation of a time-series of
internal company data and sometimes external and
real-time data. Simple file systems accessed by
query and retrieval tools provide the most elementary
level of functionality.

DSS classifications
 Communication Driven DSS: Communications-

driven DSS use network and communications
technologies to facilitate decision-relevant
collaboration and communication. In these systems,
communication technologies are the dominant
architectural component.
 Document Driven DSS: Document-driven DSS uses
computer storage and processing technologies to
provide document retrieval and analysis. Large
document databases may include scanned
documents, hypertext documents, images, sounds
and video.

Data Management subsystem
 consists of DSS database, Database

management system, Data directory and
Query facility. It does the following
Captures/ extracts data for inclusion in a DSS
database
 Updates (adds, deletes, edits, changes) data
records and files
 Interrelates data from different sources
 Retrieves data from the database for queries
and reports


Data Management subsystem
Provides comprehensive data
security(protection from unauthorised access,
recovery capabilities, etc)
 Handles personal and unofficial data so that
users can experiment with alternative
solutions based on their own judgement
 Performs complex data manipulation tasks
based on queries
 Tracks data use within DSS
 Manages data through a data dictionary


Model Management Sub system
 consists of Analog of the database management

subsystem, Model base, Model base management
system, Modeling language, Model directory, Model
execution, integration, and command processor
 Strategic Models: Non routine mergers, impact
analysis, capital budgeting
 Tactical Models: Allocation & Control labor
requirements, sales promotion planning
 Operational Models: Routine-day-to-day
production scheduling, inventory control, quality
control
 Analytical Models: SAS, SPSS, OR, data mining

KBS
 Knowledge based Subsystem

Provides expertise in solving complex
unstructured and semi-structured problems
 Expertise provided by an expert system or
other intelligent system
 Advanced DSS have a knowledge based
(management) component
 Leads to intelligent DSS
 Example: Data mining


User interface
 User Interface sub system

Includes all communication between a user
and the MSS
 Graphical user interfaces (GUI)
 Voice recognition and speech synthesis
possible


User
 Different usage patterns for the user, the

manager, or the decision maker
Managers
 Staff specialists
 Intermediaries 1. Staff assistant 2. Expert tool
user 3. Business (system) analyst 4. GSS
Facilitator


Data mining techniques and dss

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Data mining techniques and dss

Similaire à Data mining techniques and dss (20)

Plus de Niyitegekabilly

Plus de Niyitegekabilly (7)

Dernier

Dernier (20)

Data mining techniques and dss