I was invited to talk about some of the data mining and knowledge discovery work that was going on at Sabre. This is an overview of some of the projects that I could talk about. The photo for the title slide was home made, that's my wife's geologist hammer.
2. Overview
• What are the challenges?
–
–
–
–
–
Missing and/or noisy data
Joining data from multiple data sources
Very large data sets
Designing and testing new models
Explaining the results of your data mining exercise to decision makers
• Case studies
– Employee fraud detection
– Web page analysis
– Customer choice models
• Conclusions
• Questions to think about
2
3. Employee Fraud Detection
• Liquor sales
– Many airlines give away drinks in first
class, but charge for them in economy
– Dishonest staff could sell in economy
and report drinks given away in first
class, then pocket the revenue
• Requirements
– Formal and objective method to flag an
individual as a candidate for further
investigation
3
4. Employee Fraud Detection
• Choosing a measure
– Total Revenue Per Passenger (TRPP)
– Total revenue is not a good measure, as it depends on the number of
passengers on the aircraft
• Data quality
– Revenue amounts come from hand written reports that are later entered
into a computer system
– Noisy data
– Missing values
4
5. Employee Fraud Detection
• Additional variables
– Data varies by time of day (see below)
– May also vary by day of week or on holidays
– Need to ensure that we’ve gathered other variables that may be correlated with
variance in sales
800,000
700,000
Number of Flights
600,000
500,000
0.0-0.2
0.2-0.4
400,000
0.4-0.6
300,000
0.6-0.8
0.8 +
200,000
100,000
0
Morning
Mid Day
Evening
Late Night
All
5
6. Employee Fraud Detection
Rank the TRPP values for each Day/ Time
Period into deciles.
$
10%
0
10%
10%
1
2
3
4
5
6
7
8
9
6
7. Employee Fraud Detection
• Binomial Approach
– Probability for a single day’s sales
– P(TRPP in decile 10 for one day) = 0.1
• What about two days in row?
– Like tossing two heads in a row
– P(TRPP in decile 10 for two consecutive days)
– (0.10)2 = 0.01
• Why use ranks?
– Not affected by outliers
7
8. Employee Fraud Detection
• Variables
– n = number of observations for an employee
– x = number of 10th decile rankings
• Use binomial theorem to compute probabilities
P(x or more lowest decile rankings) =
Where:
n n
ix
i
n i
0.1 0.9
i
n
n!
x x!(n x)!
8
9. Employee Fraud Detection
• Example
– An employee reports 100 TRPP values
– There are 30 observations in lowest decile
– P(30 or more in lowest) = 2.45 x 10-8
• How probable is this?
– Texas Lotto probability is 3.87 x 10-8
– Lotto’s advantages
• You get more money
• You don’t go to jail
• Results
– This work was successful in identifying people for investigation
– But, as we stressed earlier, the results don’t prove or disprove guilt
9
10. Web page analysis
• How do users interact with a large website?
– What paths lead to sales?
– What paths lead to abandonment?
– What users are actually robots pounding your system?
• What we did
– Gathered page hit information from data warehouse
– Built a version of the Apriori algorithm to find sequential patterns
– Iterative process to discover useful, actionable results
10
11. Web page analysis
• Data collection
– We were fortunate
•
•
•
•
Travelocity’s web site went live in March 1996
The data warehouse started at the same time
Initially on Oracle, migrated to Teradata 1Q00
All the page hit data we needed was stored in
Teradata, along with a lot of other data about
user sessions
– Teradata is a shared-nothing database system,
optimized for warehouse and VLDB
applications
• Tables are partitioned by hash values
• Extensive parallel join facilities
11
12. Web page analysis
• Consider a set of three sample sessions
– S1: A, B, C, D, E
– S2: A, B, X
– S3: A, B, C, Q
• Some sequential patterns
– A B
– A,B C
– A,B,C D
confidence=100%
confidence=67%
confidence=33%
12
13. Web page analysis
• Confidence
– A,B C, confidence=67%
– If A,B occurs, then C follows, with 67% chance
– More formally, confidence = P(C | A,B)
• Support
– Number of cases in which this sequence occurs
– Used to eliminate high probability sequences that only occurred once or
twice
13
14. Web page analysis
• SPuD (Sequential Pattern Discoverer)
–
–
–
–
About 1,000 lines of C++, using STL
Ports to any platform
Command line, reads stdin, writes stdout
Variant of the Apriori Algorithm
• Command line options
–
–
–
–
Minimum confidence & support (-c, -s)
Min / Max pattern length (-l, -m)
Include / Exclude pages (-i, -x)
Help with options (-h, -?)
14
15. Web page analysis
• Performance goals
– ONE MILLION RECORDS!!!
• Test results
– 62 seconds elapsed
– 500 MHz Pentium
– 256 MB RAM
• Observation
– The textbook examples are all small
datasets
– One million records is not a large
dataset in practice
15
16. Web page analysis
These rules show repetition. For example, if a
user looks at page 2841 three times in a row,
we’re 99% sure they’ll hit it again
2827,2827,2827 2827; conf=0.68; supp=0.10
3157,3158,3163 3163; conf=0.71; supp=0.11
3157,3157,3157 3157; conf=0.73; supp=0.23
2841,2841,2841 2841; conf=0.99; supp=0.29
Some more example rules
6016 3162; conf=0.90; supp=0.12
3162 3157; conf=0.62; supp=0.35
There is still the challenge
of deciding what this
information means. Does
spinning on the same page
mean the user can’t find
what they want? Is it a
web crawler gathering
data? Or something else?
2432 2827; conf=0.61; supp=0.34
3157,3158 3163; conf=0.55; supp=0.16
16
17. Web page analysis
• Challenges
– The Apriori algorithm generates a lot of patterns
• Most are obvious, such as the path people follow as they fill in personal
information and pay for a reservation
• We added some filters to only generate patterns that use a certain page, or
exclude a certain page, also min/max pattern length
• Additional variables
– Thing we know about the session
• Look vs. book
• What did they book (air / car / hotel / other)?
– Things about the user
• Registered user
• Frequent buyer
17
18. Web page analysis
• Concept hierarchy
– Too many distinct values of page ID for any categorical data analysis
– Need to build a hierarchy
– This is harder than it looks, every business person will come up with a different
classification
Travelocity
Air
Air_shop
2123 2124 3123
Cruise
Air_book
2234
5770
5771
2235
18
19. Customer choice modeling
• Predicting probabilities
– Linear regression finds y(-,)
y = c0 + c1x1 + c2x2 + … + cnxn + ε
– This won’t work for probability, since P(event) [0,1]
– A non-linear transform maps y p
p = ey / (1 + ey)
y = c0 + c1x1 + c2x2 + … + cnxn + ε
– This transform is called a logistic function
– Alternatively….
loge[p/(1-p)] = c0 + c1x1 + c2x2 + … + cnxn + ε
• Based on logit-choice [Ben-Akiva & Lehrman, 1985]
19
20. Customer choice modeling
• Derived from the logistic
regression
– Equivalent to logistic regression
when there are only two choices
– Forecast the probability a customer
will choice an item from the choice
set
– The utility of each choice i, is
denoted ui
– Each ui is a linear combination of
indicator variables and/or continuous
variables, such as price
uk
P Buyk
uk
n
i 1 uk
k,1
xk,1
...
k,m
xk,m
xk,1
1 non stop flt
0 otherwise
xk,2
1 connecting flt
0 otherwise
xk,m
Price
20
21. Customer choice modeling
• Choice model is used to determine
– What will someone pay for a non-stop vs connecting flight?
– Does this vary by airline?
– Does this vary by time-of-day or day-of-week?
• What is it good for?
– Price determination
– Dynamic discounts and packages
• Other methods for categorical data
– Decision-tree induction (ie. C4.5)
– Neural networks can forecast y[0,1], but don’t extend easily to create a
market share model
21
22. Customer choice modeling
One use is to model the
probability that a user will
choose one of the many
itineraries displayed on
the web site.
We can look at the price,
the type of itinerary
(Nonstop, 1 Stop, etc), the
time of day to estimate the
probability of selling each
option
22
23. Customer choice modeling
• Implementation
– We use SAS for data preprocessing and model calibration.
• PROC MDC (multinomial discrete choice) in the Econometrics and Time
Series (ETS) package
• SAS is also very good with large datasets
– Although not a problem here, data collection is often a challenge for
customer choice modeling
• Results
–
–
–
–
We’ve been using logistic regression and similar models for many years
Can sometimes be hard to explain as few people understand the statistics
The upside is that the model predicts probabilities and share
Also combines continuous variables (price) with discrete (service type)
23
24. Conclusions
• Data mining is a process, not a product
– Data collection and preparation is an involved process
– Customized techniques are still needed
– Large datasets are typical
• How to be a data miner?
– Learn tools for large scale data manipulation, such as SQL, SAS, etc.
– The math is important, even if the tool has a GUI and is simple to use,
you have to understand the results and limitations
– Be prepared to spend significant time presenting and explaining what
you’ve discovered. Data mining is an iterative process
24
25. Questions to think about…
• Employee fraud detection
– How could an employee be consistently in the bottom 10% and not be
committing fraud?
– Suppose you were a crooked employee, how could you beat the system?
• Web page analysis
– What other data mining techniques could you use to analyze this data?
– How could I detect a web-crawler? How are they different than a real
person?
• Customer choice modeling
– What other data mining techniques could you use to analyze this data?
– What other variables might you add to the model to explain choice?
– What other factors might explain abandonment at a web site? Which of
these can you measure?
25