Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004

Data Mining
Some Real-World Experiences

Alan Walker
VP Sabre Labs
April 12th, 2004

1

Overview
• What are the challenges?
–
–
–
–
–

Missing and/or noisy data
Joining data from multiple data sources
Very large data sets
Designing and testing new models
Explaining the results of your data mining exercise to decision makers

• Case studies
– Employee fraud detection
– Web page analysis
– Customer choice models

• Conclusions
• Questions to think about
2

Employee Fraud Detection
• Liquor sales
– Many airlines give away drinks in first
class, but charge for them in economy
– Dishonest staff could sell in economy
and report drinks given away in first
class, then pocket the revenue

• Requirements
– Formal and objective method to flag an
individual as a candidate for further
investigation

3

• Choosing a measure
– Total Revenue Per Passenger (TRPP)
– Total revenue is not a good measure, as it depends on the number of
passengers on the aircraft

• Data quality
– Revenue amounts come from hand written reports that are later entered
into a computer system
– Noisy data
– Missing values

4

• Additional variables
– Data varies by time of day (see below)
– May also vary by day of week or on holidays
– Need to ensure that we’ve gathered other variables that may be correlated with
variance in sales

800,000
700,000
Number of Flights

600,000
500,000

0.0-0.2
0.2-0.4

400,000

0.4-0.6

300,000

0.6-0.8
0.8 +

200,000
100,000
0
Morning

Mid Day

Evening

Late Night

All

5

Rank the TRPP values for each Day/ Time
Period into deciles.

$
10%

0

10%

10%

1

2

3

4

5

6

7

8

9
6

• Binomial Approach
– Probability for a single day’s sales
– P(TRPP in decile 10 for one day) = 0.1

• What about two days in row?
– Like tossing two heads in a row
– P(TRPP in decile 10 for two consecutive days)
– (0.10)2 = 0.01

• Why use ranks?
– Not affected by outliers

7

• Variables
– n = number of observations for an employee
– x = number of 10th decile rankings

• Use binomial theorem to compute probabilities

P(x or more lowest decile rankings) =

Where:

n  n
 
ix  



i

n i

0.1 0.9
i

 n
n!
 
 x  x!(n  x)!
 
8

• Example
– An employee reports 100 TRPP values
– There are 30 observations in lowest decile
– P(30 or more in lowest) = 2.45 x 10-8

• How probable is this?
– Texas Lotto probability is 3.87 x 10-8
– Lotto’s advantages
• You get more money
• You don’t go to jail

• Results
– This work was successful in identifying people for investigation
– But, as we stressed earlier, the results don’t prove or disprove guilt
9

Web page analysis
• How do users interact with a large website?
– What paths lead to sales?
– What paths lead to abandonment?
– What users are actually robots pounding your system?

• What we did
– Gathered page hit information from data warehouse
– Built a version of the Apriori algorithm to find sequential patterns
– Iterative process to discover useful, actionable results

10

Web page analysis
• Data collection
– We were fortunate
•
•
•
•

Travelocity’s web site went live in March 1996
The data warehouse started at the same time
Initially on Oracle, migrated to Teradata 1Q00
All the page hit data we needed was stored in
Teradata, along with a lot of other data about
user sessions

– Teradata is a shared-nothing database system,
optimized for warehouse and VLDB
applications
• Tables are partitioned by hash values
• Extensive parallel join facilities

11

Web page analysis
• Consider a set of three sample sessions
– S1: A, B, C, D, E
– S2: A, B, X
– S3: A, B, C, Q

• Some sequential patterns
– A B
– A,B C
– A,B,C D

confidence=100%
confidence=67%
confidence=33%

12

Web page analysis
• Confidence
– A,B C, confidence=67%
– If A,B occurs, then C follows, with 67% chance
– More formally, confidence = P(C | A,B)

• Support
– Number of cases in which this sequence occurs
– Used to eliminate high probability sequences that only occurred once or
twice

13

Web page analysis
• SPuD (Sequential Pattern Discoverer)
–
–
–
–

About 1,000 lines of C++, using STL
Ports to any platform
Command line, reads stdin, writes stdout
Variant of the Apriori Algorithm

• Command line options
–
–
–
–

Minimum confidence & support (-c, -s)
Min / Max pattern length (-l, -m)
Include / Exclude pages (-i, -x)
Help with options (-h, -?)

14

Web page analysis
• Performance goals
– ONE MILLION RECORDS!!!

• Test results
– 62 seconds elapsed
– 500 MHz Pentium
– 256 MB RAM

• Observation
– The textbook examples are all small
datasets
– One million records is not a large
dataset in practice

15

Web page analysis
These rules show repetition. For example, if a
user looks at page 2841 three times in a row,
we’re 99% sure they’ll hit it again
2827,2827,2827  2827; conf=0.68; supp=0.10
3157,3158,3163  3163; conf=0.71; supp=0.11
3157,3157,3157  3157; conf=0.73; supp=0.23
2841,2841,2841  2841; conf=0.99; supp=0.29

Some more example rules
6016  3162; conf=0.90; supp=0.12
3162  3157; conf=0.62; supp=0.35

There is still the challenge
of deciding what this
information means. Does
spinning on the same page
mean the user can’t find
what they want? Is it a
web crawler gathering
data? Or something else?

2432  2827; conf=0.61; supp=0.34
3157,3158  3163; conf=0.55; supp=0.16
16

Web page analysis
• Challenges
– The Apriori algorithm generates a lot of patterns
• Most are obvious, such as the path people follow as they fill in personal
information and pay for a reservation
• We added some filters to only generate patterns that use a certain page, or
exclude a certain page, also min/max pattern length

• Additional variables
– Thing we know about the session
• Look vs. book
• What did they book (air / car / hotel / other)?

– Things about the user
• Registered user
• Frequent buyer

17

Web page analysis
• Concept hierarchy
– Too many distinct values of page ID for any categorical data analysis
– Need to build a hierarchy
– This is harder than it looks, every business person will come up with a different
classification

Travelocity

Air
Air_shop
2123 2124 3123

Cruise
Air_book
2234

5770

5771

2235
18

Customer choice modeling
• Predicting probabilities
– Linear regression finds y(-,)
y = c0 + c1x1 + c2x2 + … + cnxn + ε
– This won’t work for probability, since P(event) [0,1]
– A non-linear transform maps y  p
p = ey / (1 + ey)
y = c0 + c1x1 + c2x2 + … + cnxn + ε
– This transform is called a logistic function
– Alternatively….
loge[p/(1-p)] = c0 + c1x1 + c2x2 + … + cnxn + ε

• Based on logit-choice [Ben-Akiva & Lehrman, 1985]

19

• Derived from the logistic
regression
– Equivalent to logistic regression
when there are only two choices
– Forecast the probability a customer
will choice an item from the choice
set
– The utility of each choice i, is
denoted ui
– Each ui is a linear combination of
indicator variables and/or continuous
variables, such as price

uk

P Buyk
uk

n
i 1 uk

k,1

xk,1

...

k,m

xk,m

xk,1

1 non stop flt
0 otherwise

xk,2

1 connecting flt
0 otherwise

xk,m

Price
20

• Choice model is used to determine
– What will someone pay for a non-stop vs connecting flight?
– Does this vary by airline?
– Does this vary by time-of-day or day-of-week?

• What is it good for?
– Price determination
– Dynamic discounts and packages

• Other methods for categorical data
– Decision-tree induction (ie. C4.5)
– Neural networks can forecast y[0,1], but don’t extend easily to create a
market share model

21

One use is to model the
probability that a user will
choose one of the many
itineraries displayed on
the web site.
We can look at the price,
the type of itinerary
(Nonstop, 1 Stop, etc), the
time of day to estimate the
probability of selling each
option

22

• Implementation
– We use SAS for data preprocessing and model calibration.
• PROC MDC (multinomial discrete choice) in the Econometrics and Time
Series (ETS) package
• SAS is also very good with large datasets

– Although not a problem here, data collection is often a challenge for
customer choice modeling

• Results
–
–
–
–

We’ve been using logistic regression and similar models for many years
Can sometimes be hard to explain as few people understand the statistics
The upside is that the model predicts probabilities and share
Also combines continuous variables (price) with discrete (service type)

23

Conclusions
• Data mining is a process, not a product
– Data collection and preparation is an involved process
– Customized techniques are still needed
– Large datasets are typical

• How to be a data miner?
– Learn tools for large scale data manipulation, such as SQL, SAS, etc.
– The math is important, even if the tool has a GUI and is simple to use,
you have to understand the results and limitations
– Be prepared to spend significant time presenting and explaining what
you’ve discovered. Data mining is an iterative process

24

Questions to think about…
• Employee fraud detection
– How could an employee be consistently in the bottom 10% and not be
committing fraud?
– Suppose you were a crooked employee, how could you beat the system?

• Web page analysis
– What other data mining techniques could you use to analyze this data?
– How could I detect a web-crawler? How are they different than a real
person?

• Customer choice modeling
– What other data mining techniques could you use to analyze this data?
– What other variables might you add to the model to explain choice?
– What other factors might explain abandonment at a web site? Which of
these can you measure?
25

Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004

Recommandé

Recommandé

Contenu connexe

Similaire à Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004

Similaire à Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004 (20)

Dernier

Dernier (20)

Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004