2. intro
Data mining is a powerful new
technology with great potential to help
companies focus on the most important
information in the data they have
collected about the behavior of their
customers and potential customers.
3. Data collections in the real world
Ten largest transaction-processing
databases range from 3 to 18
Terabytes
Ten largest decision support databases
range from 10 to 29 Terabytes
Sizes have doubled / tripled between
2001 and end of 2003
4. Questions arise
Is there any new, unexpected and
potentially useful information contained
in this data?
Can we use historical data to predict
future outcomes?
(e.g. customer behavior, fraud
detection, etc.)
5. Some examples of data mining
1.
Telecommunications
Huge amount of data is collected daily
Transactional data (about each phone call)
Data on mobile phones, house based phones, Internet, etc.)
Other customer data (billing, personal information, etc.)
Additional data (network load, faults, etc.)
Questions arises
Which customer group is highly profitable, which one is not?
To which customers should we advertise what kind of special
offers?
What kind of call rates would increase profit without loosing good
customers?
How do customer profiles change over time?
Fraud detection (stolen mobile phones or phone cards
6. Another
2. Health
Different aspects of the health system
Personal health records (at GPs, specialists, etc.)
Hospital data (e.g. admission data, midwives data,
surgery data)
Billing information (Medicare, PBS)
Questions
Are doctors following the procedures (e.g. prescription of
medication)?
Adverse drug reactions (analysis of different data
collections to find correlations)
Are people committing fraud (e.g. doctor shoppers)
Correlations between social and environmental issues
and people's health?
7. What is data mining?
Data Mining is the automated extraction
of previously unrealized information
from Large data sources for the
purpose of supporting business actions.
8. Some more definitions
Knowledge discovery in databases is the
non-trivial process of identifying valid, novel,
potentially useful, and ultimately
understandable patterns in data.
An information extraction activity whose goal
is to discover hidden facts contained in
databases.
Data mining, or knowledge discovery, is the
computer-assisted process of digging through
and analyzing enormous sets of data and
then extracting the meaning of the data.
10. Data mining process
Extract, transform, and load transaction
data onto the data warehouse system.
Store and manage the data in a
multidimensional database system.
Provide data access to business
analysts and information technology
professionals.
11. Data mining process
Analyze the data by application
software.
Present the data in a useful format,
such as a graph or table.
13. What they do
Detect patterns in data: Rules, patterns,
classes, associations and functional
dependencies, outliers, data distributions,
clusters
14. How they do it
Search through data and pattern space,
non-parametric modelling, filtering,
aggregation
How well they do it
Errors and biases, over-fitting,
confounding effects, speed, scalability
15. Challenges in DM
Data size
Size of data collections grows more than
linear, doubling every 18 months
Scalable algorithms are needed
Data complexity
Different types of data (free text, HTML, XML,
multimedia)
Dimensionality of the data increases (more
attributes)
16. Challenges contd..
The curse of dimensionality affects many
algorithms
(for example find nearest neighbors in high
dimensions)
Data quality
Real world data is messy and dirty
(missing and out-of-date values,
typographical errors, different
coding/formats, etc.)
17. Why mine data?
Data is being recorded
Recorded data is being warehoused
Computing power is affordable
Competitive pressure is strong
Commercial DM products are available
It provides support for business
decisions
18. Value to business
Market segmentation - Identify the
common characteristics of customers
who buy the same products from your
company.
Customer churn - Predict which
customers are likely to leave your
company and go to a competitor.
Fraud detection - Identify which
transactions are most likely to be
fraudulent.
19. Value to business
Interactive marketing - Predict what each
individual accessing a Web site is most
likely interested in seeing.
Market basket analysis - Understand what
products or services are commonly
purchased together; e.g., beer and
diapers.
20. Value to business
Trend analysis - Reveal the difference
between a typical customer this month
and last.
Data mining can also effectively deal with
missing, inconsistent, and noisy data.
Direct marketing - Identify which prospects
should be included in a mailing list to
obtain the highest response rate.