SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Data Mining
Some Real-World Experiences

Alan Walker
VP Sabre Labs
April 12th, 2004

1
Overview
• What are the challenges?
–
–
–
–
–

Missing and/or noisy data
Joining data from multiple data sources
Very large data sets
Designing and testing new models
Explaining the results of your data mining exercise to decision makers

• Case studies
– Employee fraud detection
– Web page analysis
– Customer choice models

• Conclusions
• Questions to think about
2
Employee Fraud Detection
• Liquor sales
– Many airlines give away drinks in first
class, but charge for them in economy
– Dishonest staff could sell in economy
and report drinks given away in first
class, then pocket the revenue

• Requirements
– Formal and objective method to flag an
individual as a candidate for further
investigation

3
Employee Fraud Detection
• Choosing a measure
– Total Revenue Per Passenger (TRPP)
– Total revenue is not a good measure, as it depends on the number of
passengers on the aircraft

• Data quality
– Revenue amounts come from hand written reports that are later entered
into a computer system
– Noisy data
– Missing values

4
Employee Fraud Detection
• Additional variables
– Data varies by time of day (see below)
– May also vary by day of week or on holidays
– Need to ensure that we’ve gathered other variables that may be correlated with
variance in sales

800,000
700,000
Number of Flights

600,000
500,000

0.0-0.2
0.2-0.4

400,000

0.4-0.6

300,000

0.6-0.8
0.8 +

200,000
100,000
0
Morning

Mid Day

Evening

Late Night

All

5
Employee Fraud Detection
Rank the TRPP values for each Day/ Time
Period into deciles.

$
10%

0

10%

10%

1

2

3

4

5

6

7

8

9
6
Employee Fraud Detection
• Binomial Approach
– Probability for a single day’s sales
– P(TRPP in decile 10 for one day) = 0.1

• What about two days in row?
– Like tossing two heads in a row
– P(TRPP in decile 10 for two consecutive days)
– (0.10)2 = 0.01

• Why use ranks?
– Not affected by outliers

7
Employee Fraud Detection
• Variables
– n = number of observations for an employee
– x = number of 10th decile rankings

• Use binomial theorem to compute probabilities

P(x or more lowest decile rankings) =

Where:

n  n
 
ix  



i

n i

0.1 0.9
i

 n
n!
 
 x  x!(n  x)!
 
8
Employee Fraud Detection
• Example
– An employee reports 100 TRPP values
– There are 30 observations in lowest decile
– P(30 or more in lowest) = 2.45 x 10-8

• How probable is this?
– Texas Lotto probability is 3.87 x 10-8
– Lotto’s advantages
• You get more money
• You don’t go to jail

• Results
– This work was successful in identifying people for investigation
– But, as we stressed earlier, the results don’t prove or disprove guilt
9
Web page analysis
• How do users interact with a large website?
– What paths lead to sales?
– What paths lead to abandonment?
– What users are actually robots pounding your system?

• What we did
– Gathered page hit information from data warehouse
– Built a version of the Apriori algorithm to find sequential patterns
– Iterative process to discover useful, actionable results

10
Web page analysis
• Data collection
– We were fortunate
•
•
•
•

Travelocity’s web site went live in March 1996
The data warehouse started at the same time
Initially on Oracle, migrated to Teradata 1Q00
All the page hit data we needed was stored in
Teradata, along with a lot of other data about
user sessions

– Teradata is a shared-nothing database system,
optimized for warehouse and VLDB
applications
• Tables are partitioned by hash values
• Extensive parallel join facilities

11
Web page analysis
• Consider a set of three sample sessions
– S1: A, B, C, D, E
– S2: A, B, X
– S3: A, B, C, Q

• Some sequential patterns
– A B
– A,B C
– A,B,C D

confidence=100%
confidence=67%
confidence=33%

12
Web page analysis
• Confidence
– A,B C, confidence=67%
– If A,B occurs, then C follows, with 67% chance
– More formally, confidence = P(C | A,B)

• Support
– Number of cases in which this sequence occurs
– Used to eliminate high probability sequences that only occurred once or
twice

13
Web page analysis
• SPuD (Sequential Pattern Discoverer)
–
–
–
–

About 1,000 lines of C++, using STL
Ports to any platform
Command line, reads stdin, writes stdout
Variant of the Apriori Algorithm

• Command line options
–
–
–
–

Minimum confidence & support (-c, -s)
Min / Max pattern length (-l, -m)
Include / Exclude pages (-i, -x)
Help with options (-h, -?)

14
Web page analysis
• Performance goals
– ONE MILLION RECORDS!!!

• Test results
– 62 seconds elapsed
– 500 MHz Pentium
– 256 MB RAM

• Observation
– The textbook examples are all small
datasets
– One million records is not a large
dataset in practice

15
Web page analysis
These rules show repetition. For example, if a
user looks at page 2841 three times in a row,
we’re 99% sure they’ll hit it again
2827,2827,2827  2827; conf=0.68; supp=0.10
3157,3158,3163  3163; conf=0.71; supp=0.11
3157,3157,3157  3157; conf=0.73; supp=0.23
2841,2841,2841  2841; conf=0.99; supp=0.29

Some more example rules
6016  3162; conf=0.90; supp=0.12
3162  3157; conf=0.62; supp=0.35

There is still the challenge
of deciding what this
information means. Does
spinning on the same page
mean the user can’t find
what they want? Is it a
web crawler gathering
data? Or something else?

2432  2827; conf=0.61; supp=0.34
3157,3158  3163; conf=0.55; supp=0.16
16
Web page analysis
• Challenges
– The Apriori algorithm generates a lot of patterns
• Most are obvious, such as the path people follow as they fill in personal
information and pay for a reservation
• We added some filters to only generate patterns that use a certain page, or
exclude a certain page, also min/max pattern length

• Additional variables
– Thing we know about the session
• Look vs. book
• What did they book (air / car / hotel / other)?

– Things about the user
• Registered user
• Frequent buyer

17
Web page analysis
• Concept hierarchy
– Too many distinct values of page ID for any categorical data analysis
– Need to build a hierarchy
– This is harder than it looks, every business person will come up with a different
classification

Travelocity

Air
Air_shop
2123 2124 3123

Cruise
Air_book
2234

5770

5771

2235
18
Customer choice modeling
• Predicting probabilities
– Linear regression finds y(-,)
y = c0 + c1x1 + c2x2 + … + cnxn + ε
– This won’t work for probability, since P(event) [0,1]
– A non-linear transform maps y  p
p = ey / (1 + ey)
y = c0 + c1x1 + c2x2 + … + cnxn + ε
– This transform is called a logistic function
– Alternatively….
loge[p/(1-p)] = c0 + c1x1 + c2x2 + … + cnxn + ε

• Based on logit-choice [Ben-Akiva & Lehrman, 1985]

19
Customer choice modeling
• Derived from the logistic
regression
– Equivalent to logistic regression
when there are only two choices
– Forecast the probability a customer
will choice an item from the choice
set
– The utility of each choice i, is
denoted ui
– Each ui is a linear combination of
indicator variables and/or continuous
variables, such as price

uk

P Buyk
uk

n
i 1 uk

k,1

xk,1

...

k,m

xk,m

xk,1

1 non stop flt
0 otherwise

xk,2

1 connecting flt
0 otherwise

xk,m

Price
20
Customer choice modeling
• Choice model is used to determine
– What will someone pay for a non-stop vs connecting flight?
– Does this vary by airline?
– Does this vary by time-of-day or day-of-week?

• What is it good for?
– Price determination
– Dynamic discounts and packages

• Other methods for categorical data
– Decision-tree induction (ie. C4.5)
– Neural networks can forecast y[0,1], but don’t extend easily to create a
market share model

21
Customer choice modeling
One use is to model the
probability that a user will
choose one of the many
itineraries displayed on
the web site.
We can look at the price,
the type of itinerary
(Nonstop, 1 Stop, etc), the
time of day to estimate the
probability of selling each
option

22
Customer choice modeling
• Implementation
– We use SAS for data preprocessing and model calibration.
• PROC MDC (multinomial discrete choice) in the Econometrics and Time
Series (ETS) package
• SAS is also very good with large datasets

– Although not a problem here, data collection is often a challenge for
customer choice modeling

• Results
–
–
–
–

We’ve been using logistic regression and similar models for many years
Can sometimes be hard to explain as few people understand the statistics
The upside is that the model predicts probabilities and share
Also combines continuous variables (price) with discrete (service type)

23
Conclusions
• Data mining is a process, not a product
– Data collection and preparation is an involved process
– Customized techniques are still needed
– Large datasets are typical

• How to be a data miner?
– Learn tools for large scale data manipulation, such as SQL, SAS, etc.
– The math is important, even if the tool has a GUI and is simple to use,
you have to understand the results and limitations
– Be prepared to spend significant time presenting and explaining what
you’ve discovered. Data mining is an iterative process

24
Questions to think about…
• Employee fraud detection
– How could an employee be consistently in the bottom 10% and not be
committing fraud?
– Suppose you were a crooked employee, how could you beat the system?

• Web page analysis
– What other data mining techniques could you use to analyze this data?
– How could I detect a web-crawler? How are they different than a real
person?

• Customer choice modeling
– What other data mining techniques could you use to analyze this data?
– What other variables might you add to the model to explain choice?
– What other factors might explain abandonment at a web site? Which of
these can you measure?
25

Contenu connexe

Similaire à Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004

Large scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningLarge scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningitstuff
 
IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesNish Parikh
 
DS M1 full - KQB KtuQbank.pdf
DS M1 full - KQB KtuQbank.pdfDS M1 full - KQB KtuQbank.pdf
DS M1 full - KQB KtuQbank.pdfMidhunM83
 
Developing Web-scale Machine Learning at LinkedIn - From Soup to Nuts
Developing Web-scale Machine Learning at LinkedIn - From Soup to NutsDeveloping Web-scale Machine Learning at LinkedIn - From Soup to Nuts
Developing Web-scale Machine Learning at LinkedIn - From Soup to NutsKun Liu
 
Process.ppt
Process.pptProcess.ppt
Process.pptSK Chew
 
WWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big dataWWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big datawebwinkelvakdag
 
Ed Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual AnalysisEd Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual AnalysisVolha Banadyseva
 
Get Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysGet Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysAerospike, Inc.
 
Outlier analysis for Temporal Datasets
Outlier analysis for Temporal DatasetsOutlier analysis for Temporal Datasets
Outlier analysis for Temporal DatasetsQuantUniversity
 
Artificial intelligence and IoT
Artificial intelligence and IoTArtificial intelligence and IoT
Artificial intelligence and IoTVeselin Pizurica
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptopRising Media, Inc.
 
ML Application Life Cycle
ML Application Life CycleML Application Life Cycle
ML Application Life CycleSrujanaMerugu1
 

Similaire à Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004 (20)

Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop PDF
Hadoop PDFHadoop PDF
Hadoop PDF
 
Large scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningLarge scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log mining
 
IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slides
 
Skillwise Big data
Skillwise Big dataSkillwise Big data
Skillwise Big data
 
DS M1 full - KQB KtuQbank.pdf
DS M1 full - KQB KtuQbank.pdfDS M1 full - KQB KtuQbank.pdf
DS M1 full - KQB KtuQbank.pdf
 
Developing Web-scale Machine Learning at LinkedIn - From Soup to Nuts
Developing Web-scale Machine Learning at LinkedIn - From Soup to NutsDeveloping Web-scale Machine Learning at LinkedIn - From Soup to Nuts
Developing Web-scale Machine Learning at LinkedIn - From Soup to Nuts
 
Process.ppt
Process.pptProcess.ppt
Process.ppt
 
WWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big dataWWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big data
 
Ed Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual AnalysisEd Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual Analysis
 
Get Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysGet Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California Highways
 
Outlier analysis for Temporal Datasets
Outlier analysis for Temporal DatasetsOutlier analysis for Temporal Datasets
Outlier analysis for Temporal Datasets
 
Artificial intelligence and IoT
Artificial intelligence and IoTArtificial intelligence and IoT
Artificial intelligence and IoT
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
Lecture2 (1).ppt
Lecture2 (1).pptLecture2 (1).ppt
Lecture2 (1).ppt
 
ML Application Life Cycle
ML Application Life CycleML Application Life Cycle
ML Application Life Cycle
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 

Dernier

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Dernier (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004

  • 1. Data Mining Some Real-World Experiences Alan Walker VP Sabre Labs April 12th, 2004 1
  • 2. Overview • What are the challenges? – – – – – Missing and/or noisy data Joining data from multiple data sources Very large data sets Designing and testing new models Explaining the results of your data mining exercise to decision makers • Case studies – Employee fraud detection – Web page analysis – Customer choice models • Conclusions • Questions to think about 2
  • 3. Employee Fraud Detection • Liquor sales – Many airlines give away drinks in first class, but charge for them in economy – Dishonest staff could sell in economy and report drinks given away in first class, then pocket the revenue • Requirements – Formal and objective method to flag an individual as a candidate for further investigation 3
  • 4. Employee Fraud Detection • Choosing a measure – Total Revenue Per Passenger (TRPP) – Total revenue is not a good measure, as it depends on the number of passengers on the aircraft • Data quality – Revenue amounts come from hand written reports that are later entered into a computer system – Noisy data – Missing values 4
  • 5. Employee Fraud Detection • Additional variables – Data varies by time of day (see below) – May also vary by day of week or on holidays – Need to ensure that we’ve gathered other variables that may be correlated with variance in sales 800,000 700,000 Number of Flights 600,000 500,000 0.0-0.2 0.2-0.4 400,000 0.4-0.6 300,000 0.6-0.8 0.8 + 200,000 100,000 0 Morning Mid Day Evening Late Night All 5
  • 6. Employee Fraud Detection Rank the TRPP values for each Day/ Time Period into deciles. $ 10% 0 10% 10% 1 2 3 4 5 6 7 8 9 6
  • 7. Employee Fraud Detection • Binomial Approach – Probability for a single day’s sales – P(TRPP in decile 10 for one day) = 0.1 • What about two days in row? – Like tossing two heads in a row – P(TRPP in decile 10 for two consecutive days) – (0.10)2 = 0.01 • Why use ranks? – Not affected by outliers 7
  • 8. Employee Fraud Detection • Variables – n = number of observations for an employee – x = number of 10th decile rankings • Use binomial theorem to compute probabilities P(x or more lowest decile rankings) = Where: n  n   ix    i n i 0.1 0.9 i  n n!    x  x!(n  x)!   8
  • 9. Employee Fraud Detection • Example – An employee reports 100 TRPP values – There are 30 observations in lowest decile – P(30 or more in lowest) = 2.45 x 10-8 • How probable is this? – Texas Lotto probability is 3.87 x 10-8 – Lotto’s advantages • You get more money • You don’t go to jail • Results – This work was successful in identifying people for investigation – But, as we stressed earlier, the results don’t prove or disprove guilt 9
  • 10. Web page analysis • How do users interact with a large website? – What paths lead to sales? – What paths lead to abandonment? – What users are actually robots pounding your system? • What we did – Gathered page hit information from data warehouse – Built a version of the Apriori algorithm to find sequential patterns – Iterative process to discover useful, actionable results 10
  • 11. Web page analysis • Data collection – We were fortunate • • • • Travelocity’s web site went live in March 1996 The data warehouse started at the same time Initially on Oracle, migrated to Teradata 1Q00 All the page hit data we needed was stored in Teradata, along with a lot of other data about user sessions – Teradata is a shared-nothing database system, optimized for warehouse and VLDB applications • Tables are partitioned by hash values • Extensive parallel join facilities 11
  • 12. Web page analysis • Consider a set of three sample sessions – S1: A, B, C, D, E – S2: A, B, X – S3: A, B, C, Q • Some sequential patterns – A B – A,B C – A,B,C D confidence=100% confidence=67% confidence=33% 12
  • 13. Web page analysis • Confidence – A,B C, confidence=67% – If A,B occurs, then C follows, with 67% chance – More formally, confidence = P(C | A,B) • Support – Number of cases in which this sequence occurs – Used to eliminate high probability sequences that only occurred once or twice 13
  • 14. Web page analysis • SPuD (Sequential Pattern Discoverer) – – – – About 1,000 lines of C++, using STL Ports to any platform Command line, reads stdin, writes stdout Variant of the Apriori Algorithm • Command line options – – – – Minimum confidence & support (-c, -s) Min / Max pattern length (-l, -m) Include / Exclude pages (-i, -x) Help with options (-h, -?) 14
  • 15. Web page analysis • Performance goals – ONE MILLION RECORDS!!! • Test results – 62 seconds elapsed – 500 MHz Pentium – 256 MB RAM • Observation – The textbook examples are all small datasets – One million records is not a large dataset in practice 15
  • 16. Web page analysis These rules show repetition. For example, if a user looks at page 2841 three times in a row, we’re 99% sure they’ll hit it again 2827,2827,2827  2827; conf=0.68; supp=0.10 3157,3158,3163  3163; conf=0.71; supp=0.11 3157,3157,3157  3157; conf=0.73; supp=0.23 2841,2841,2841  2841; conf=0.99; supp=0.29 Some more example rules 6016  3162; conf=0.90; supp=0.12 3162  3157; conf=0.62; supp=0.35 There is still the challenge of deciding what this information means. Does spinning on the same page mean the user can’t find what they want? Is it a web crawler gathering data? Or something else? 2432  2827; conf=0.61; supp=0.34 3157,3158  3163; conf=0.55; supp=0.16 16
  • 17. Web page analysis • Challenges – The Apriori algorithm generates a lot of patterns • Most are obvious, such as the path people follow as they fill in personal information and pay for a reservation • We added some filters to only generate patterns that use a certain page, or exclude a certain page, also min/max pattern length • Additional variables – Thing we know about the session • Look vs. book • What did they book (air / car / hotel / other)? – Things about the user • Registered user • Frequent buyer 17
  • 18. Web page analysis • Concept hierarchy – Too many distinct values of page ID for any categorical data analysis – Need to build a hierarchy – This is harder than it looks, every business person will come up with a different classification Travelocity Air Air_shop 2123 2124 3123 Cruise Air_book 2234 5770 5771 2235 18
  • 19. Customer choice modeling • Predicting probabilities – Linear regression finds y(-,) y = c0 + c1x1 + c2x2 + … + cnxn + ε – This won’t work for probability, since P(event) [0,1] – A non-linear transform maps y  p p = ey / (1 + ey) y = c0 + c1x1 + c2x2 + … + cnxn + ε – This transform is called a logistic function – Alternatively…. loge[p/(1-p)] = c0 + c1x1 + c2x2 + … + cnxn + ε • Based on logit-choice [Ben-Akiva & Lehrman, 1985] 19
  • 20. Customer choice modeling • Derived from the logistic regression – Equivalent to logistic regression when there are only two choices – Forecast the probability a customer will choice an item from the choice set – The utility of each choice i, is denoted ui – Each ui is a linear combination of indicator variables and/or continuous variables, such as price uk P Buyk uk n i 1 uk k,1 xk,1 ... k,m xk,m xk,1 1 non stop flt 0 otherwise xk,2 1 connecting flt 0 otherwise xk,m Price 20
  • 21. Customer choice modeling • Choice model is used to determine – What will someone pay for a non-stop vs connecting flight? – Does this vary by airline? – Does this vary by time-of-day or day-of-week? • What is it good for? – Price determination – Dynamic discounts and packages • Other methods for categorical data – Decision-tree induction (ie. C4.5) – Neural networks can forecast y[0,1], but don’t extend easily to create a market share model 21
  • 22. Customer choice modeling One use is to model the probability that a user will choose one of the many itineraries displayed on the web site. We can look at the price, the type of itinerary (Nonstop, 1 Stop, etc), the time of day to estimate the probability of selling each option 22
  • 23. Customer choice modeling • Implementation – We use SAS for data preprocessing and model calibration. • PROC MDC (multinomial discrete choice) in the Econometrics and Time Series (ETS) package • SAS is also very good with large datasets – Although not a problem here, data collection is often a challenge for customer choice modeling • Results – – – – We’ve been using logistic regression and similar models for many years Can sometimes be hard to explain as few people understand the statistics The upside is that the model predicts probabilities and share Also combines continuous variables (price) with discrete (service type) 23
  • 24. Conclusions • Data mining is a process, not a product – Data collection and preparation is an involved process – Customized techniques are still needed – Large datasets are typical • How to be a data miner? – Learn tools for large scale data manipulation, such as SQL, SAS, etc. – The math is important, even if the tool has a GUI and is simple to use, you have to understand the results and limitations – Be prepared to spend significant time presenting and explaining what you’ve discovered. Data mining is an iterative process 24
  • 25. Questions to think about… • Employee fraud detection – How could an employee be consistently in the bottom 10% and not be committing fraud? – Suppose you were a crooked employee, how could you beat the system? • Web page analysis – What other data mining techniques could you use to analyze this data? – How could I detect a web-crawler? How are they different than a real person? • Customer choice modeling – What other data mining techniques could you use to analyze this data? – What other variables might you add to the model to explain choice? – What other factors might explain abandonment at a web site? Which of these can you measure? 25