SlideShare a Scribd company logo
1 of 32
Download to read offline
Duplicate Detection via
Topic Modeling
HomeAway Key Facts
● 1,300,000+ global vacation rental listings
● 200,000,000+ vacation days / year
● ~190 countries, 22 languages
● HQ in Austin, TX; part of Expedia, Inc
--> Capable competition and fraud vectors
Competitive Intelligence
Over 2 million global HA + Comp
documents and meta data
Breckenridge Colorado
HomeAway in blue
Breckenridge, zoomed in
Same Property
The Property Descriptions
Why Property Descriptions?
● Almost identical text
● Similar descriptions
seemed probable
○ Consistent owner
branding, easy to
replicate
● Tech team wanted to use
natural language
processing techniques
● Didn’t know if this would
work when we began
The Other Guys
There are truly inspiring views at High Point Retreat and
plenty of places to sit and enjoy them. Take a load off in one
of the many rooms with views of the ski mountain and
remember how lucky you are to live like this. Cozy up with
family in the sunken living room and chat for hours on end.
Sit in a circle of tree stumps around the outdoor fire pit and
roast marshmallows. After all that sitting, youll be more than
happy to walk 250 yards to the free shuttle to get the blood
pumping again. Then, have a seat and enjoy your free ride.
Best. Vacation. Ever. Vacation homes allow families to
stay...together. At InvitedHome, we think that's pretty
important, so we do everything in our power to make your
vacation totally epic. Not only do we choose the best homes
in the best destinations, but we make the experience
effortless so you can really enjoy yourself. Our team will
stock your fridge, babysit the kids, cater your party, plan your
day trip, make reservations, and do whatever we can to
make sure you have the Best. Vacation. Ever.
HomeAway
There are truly inspiring views at High Point Retreat and
plenty of places to sit and enjoy them. Take a load off in one
of the many rooms with views of the ski mountain and
remember how lucky you are to live like this. Cozy up with
family in the sunken living room and chat for hours on end.
Sit in a circle of tree stumps around the outdoor fire pit and
roast marshmallows. After all that sitting, you’ll be more
than happy to walk 250 yards to the free shuttle to get the
blood pumping again. Then, have a seat and enjoy your free
ride.
Best.Vacation.Ever. Vacation homes allow families to stay...
together. At InvitedHome, we think that's pretty important,
so we do everything in our power to make your vacation
totally epic. Not only do we choose the best homes in the
best destinations, but we make the experience effortless so
you can really enjoy yourself. Let us connect you with the
best options in town for babysitting, equipment rental,
transportation, catering, day trips, shopping, dining, and
even stocking your fridge with groceries! We’ll do everything
in our power to make sure you have the Best. Vacation.
Ever.
Hypothesis
We can detect properties listed on
HomeAway and the competition by
comparing the text in the property
descriptions
Worked great, but...
“Large” Vocabulary size
~10K Tokens -> 10K Dimensions and
millions of sparse vectors
A little slow
(took a week to process the US)
Initial Approach: TF-IDF and Cosine Distance
Spark
Clusters?
Topic
Modeling?
Other Distance
Metrics?
Hypothesis
We can detect properties listed on
HomeAway and the competition by
comparing the text in the property
descriptions
We can leverage Topic Modeling to do
it
Latent Dirichlet Allocation (Topic Modeling)
Communications of the ACM, Vol. 55 No. 4, Pages
77-84
10.1145/2133806.2133826
Topic Modeling and LDA
In natural language processing,
Latent Dirichlet allocation
(LDA) is a generative model
that allows sets of
observations to be explained
by unobserved groups that
explain why some parts of the
data are similar.
(Wikipedia)
Cat, Dog, Fish,
Turtle,
Hamster
Cat, Dog,
Mass,
Hysteria,
Living,
Together
Cat, Dog,
Cold, Rain,
Hot,
Temperature
Document A
Document B
Document C
Some Example Topics from Breckenridge
time, setting, wifi, elk,
central, enjoying,
spend, marijuana,
sleepers, brittany
buffalo, soaking, pubs,
titles, washroom, pristine,
ratedgas, multiple,
especially, scrumptious
apartment, weekend,
maintained,
company, bedroom,
bed, sized, bathroom,
walk, queen
golf, course, chateau,
sole, beauty,
payment, splendor,
championship,
rooftop, stonehaven
smoking, allowed,
deposit, damage, fee,
owner, dates, paid,
balance, zone
Topic Modeling Motivations
● Smaller dimensional space
● Faster processing times?
● At the end, we’d have Topic Models
Must be useful for duplicate detection
We used Spark’s ML APIs for this:
val countLDA = new LDA()
.setK(numTopics)
.setMaxIter(params.maxIterations)
.setSeed(params.randomSeed)
.setFeaturesCol(featureCol)
.setTopicDistributionCol("topicDistribution")
Distances between Topic Distributions
Euclidean Manhattan Cosine
Distances between Topic Distributions
Euclidean Manhattan Cosine
Jensen-Shannon Hellinger
Distances between Topic Distributions
Euclidean Manhattan Cosine
Jensen-Shannon Hellinger
Create an experimental dataset
Original Corpus
Create an experimental dataset
Original Corpus
Random selection
Create an experimental dataset
Original Corpus
Random selection
Duplicate (with optional
degradation)...
… and see if we can find
those duplicates
How to make something useful?
Machine Learning Effort
Interquartile Ranges are more resilient to outliers than
standard deviations
IQRs bring information about the entire set of possible
duplicates
Random Forest Model (R):
trainIdx <- createDataPartition(dupesFoundByTopic$match,
p=0.9, list=FALSE, times=1)
train <- dupesFoundByTopic[trainIdx,]
fit <- randomForest(as.factor(match) ~ distance + iqrs,
data=train)
Combining Distance and IQR
Feature Mean
Decrease
Gini
distance 498
IQR 57
Reference
Pred. FALSE TRUE
FALSE 204 2
TRUE 4 32
● Topic Models / Topic Distances seem useful
○ Esp. when part of a multi-signal model
(i.e. images)
● Hybrid Spark and R approach
○ Moving to 100% Spark in future for
speed
● Topic Models just sitting there, waiting for
exploitation
○ “Programmatic” Marketing Efforts, &c
● But what about Locality Sensitive Hashing?
Current Status
Questions?
Brent Schneeman
Principal Data Scientist
HomeAway, Inc.
brent@homeaway.com
careers.homeaway.com
@schnee
← https://www.homeaway.com/vacation-rental/p3482065

More Related Content

Similar to Data Day Seattle Duplicate Detection via Topic Modeling

2015 ESA Mini Mag
2015 ESA Mini Mag2015 ESA Mini Mag
2015 ESA Mini MagAziza Brown
 
2015 ESA Mini Mag
2015 ESA Mini Mag2015 ESA Mini Mag
2015 ESA Mini MagTracy Itzen
 
2015 ESA Mini Mag
2015 ESA Mini Mag2015 ESA Mini Mag
2015 ESA Mini MagAutumn Shaw
 
My flight and hotel
My flight and hotelMy flight and hotel
My flight and hotelalvinaruby
 
Spring2011_CabinLivingTransformed
Spring2011_CabinLivingTransformedSpring2011_CabinLivingTransformed
Spring2011_CabinLivingTransformedLori Storm
 

Similar to Data Day Seattle Duplicate Detection via Topic Modeling (9)

2015 ESA Mini Mag
2015 ESA Mini Mag2015 ESA Mini Mag
2015 ESA Mini Mag
 
2015 ESA Mini Mag
2015 ESA Mini Mag2015 ESA Mini Mag
2015 ESA Mini Mag
 
2015 ESA Mini Mag
2015 ESA Mini Mag2015 ESA Mini Mag
2015 ESA Mini Mag
 
Julie8bd
Julie8bdJulie8bd
Julie8bd
 
Julie8bd
Julie8bdJulie8bd
Julie8bd
 
Julie8bd
Julie8bdJulie8bd
Julie8bd
 
My flight and hotel
My flight and hotelMy flight and hotel
My flight and hotel
 
Spring2011_CabinLivingTransformed
Spring2011_CabinLivingTransformedSpring2011_CabinLivingTransformed
Spring2011_CabinLivingTransformed
 
May 2020 Newsletter
May 2020 NewsletterMay 2020 Newsletter
May 2020 Newsletter
 

Recently uploaded

Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx9to5mart
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 

Recently uploaded (20)

Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 

Data Day Seattle Duplicate Detection via Topic Modeling

  • 2. HomeAway Key Facts ● 1,300,000+ global vacation rental listings ● 200,000,000+ vacation days / year ● ~190 countries, 22 languages ● HQ in Austin, TX; part of Expedia, Inc --> Capable competition and fraud vectors
  • 3. Competitive Intelligence Over 2 million global HA + Comp documents and meta data
  • 7. The Property Descriptions Why Property Descriptions? ● Almost identical text ● Similar descriptions seemed probable ○ Consistent owner branding, easy to replicate ● Tech team wanted to use natural language processing techniques ● Didn’t know if this would work when we began The Other Guys There are truly inspiring views at High Point Retreat and plenty of places to sit and enjoy them. Take a load off in one of the many rooms with views of the ski mountain and remember how lucky you are to live like this. Cozy up with family in the sunken living room and chat for hours on end. Sit in a circle of tree stumps around the outdoor fire pit and roast marshmallows. After all that sitting, youll be more than happy to walk 250 yards to the free shuttle to get the blood pumping again. Then, have a seat and enjoy your free ride. Best. Vacation. Ever. Vacation homes allow families to stay...together. At InvitedHome, we think that's pretty important, so we do everything in our power to make your vacation totally epic. Not only do we choose the best homes in the best destinations, but we make the experience effortless so you can really enjoy yourself. Our team will stock your fridge, babysit the kids, cater your party, plan your day trip, make reservations, and do whatever we can to make sure you have the Best. Vacation. Ever. HomeAway There are truly inspiring views at High Point Retreat and plenty of places to sit and enjoy them. Take a load off in one of the many rooms with views of the ski mountain and remember how lucky you are to live like this. Cozy up with family in the sunken living room and chat for hours on end. Sit in a circle of tree stumps around the outdoor fire pit and roast marshmallows. After all that sitting, you’ll be more than happy to walk 250 yards to the free shuttle to get the blood pumping again. Then, have a seat and enjoy your free ride. Best.Vacation.Ever. Vacation homes allow families to stay... together. At InvitedHome, we think that's pretty important, so we do everything in our power to make your vacation totally epic. Not only do we choose the best homes in the best destinations, but we make the experience effortless so you can really enjoy yourself. Let us connect you with the best options in town for babysitting, equipment rental, transportation, catering, day trips, shopping, dining, and even stocking your fridge with groceries! We’ll do everything in our power to make sure you have the Best. Vacation. Ever.
  • 8. Hypothesis We can detect properties listed on HomeAway and the competition by comparing the text in the property descriptions
  • 9. Worked great, but... “Large” Vocabulary size ~10K Tokens -> 10K Dimensions and millions of sparse vectors A little slow (took a week to process the US) Initial Approach: TF-IDF and Cosine Distance
  • 11. Hypothesis We can detect properties listed on HomeAway and the competition by comparing the text in the property descriptions We can leverage Topic Modeling to do it
  • 12. Latent Dirichlet Allocation (Topic Modeling) Communications of the ACM, Vol. 55 No. 4, Pages 77-84 10.1145/2133806.2133826
  • 13. Topic Modeling and LDA In natural language processing, Latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. (Wikipedia) Cat, Dog, Fish, Turtle, Hamster Cat, Dog, Mass, Hysteria, Living, Together Cat, Dog, Cold, Rain, Hot, Temperature Document A Document B Document C
  • 14. Some Example Topics from Breckenridge time, setting, wifi, elk, central, enjoying, spend, marijuana, sleepers, brittany buffalo, soaking, pubs, titles, washroom, pristine, ratedgas, multiple, especially, scrumptious apartment, weekend, maintained, company, bedroom, bed, sized, bathroom, walk, queen golf, course, chateau, sole, beauty, payment, splendor, championship, rooftop, stonehaven smoking, allowed, deposit, damage, fee, owner, dates, paid, balance, zone
  • 15. Topic Modeling Motivations ● Smaller dimensional space ● Faster processing times? ● At the end, we’d have Topic Models Must be useful for duplicate detection We used Spark’s ML APIs for this: val countLDA = new LDA() .setK(numTopics) .setMaxIter(params.maxIterations) .setSeed(params.randomSeed) .setFeaturesCol(featureCol) .setTopicDistributionCol("topicDistribution")
  • 16.
  • 17. Distances between Topic Distributions Euclidean Manhattan Cosine
  • 18. Distances between Topic Distributions Euclidean Manhattan Cosine Jensen-Shannon Hellinger
  • 19. Distances between Topic Distributions Euclidean Manhattan Cosine Jensen-Shannon Hellinger
  • 20. Create an experimental dataset Original Corpus
  • 21. Create an experimental dataset Original Corpus Random selection
  • 22. Create an experimental dataset Original Corpus Random selection Duplicate (with optional degradation)... … and see if we can find those duplicates
  • 23.
  • 24.
  • 25.
  • 26.
  • 27. How to make something useful? Machine Learning Effort
  • 28.
  • 29.
  • 30. Interquartile Ranges are more resilient to outliers than standard deviations IQRs bring information about the entire set of possible duplicates Random Forest Model (R): trainIdx <- createDataPartition(dupesFoundByTopic$match, p=0.9, list=FALSE, times=1) train <- dupesFoundByTopic[trainIdx,] fit <- randomForest(as.factor(match) ~ distance + iqrs, data=train) Combining Distance and IQR Feature Mean Decrease Gini distance 498 IQR 57 Reference Pred. FALSE TRUE FALSE 204 2 TRUE 4 32
  • 31. ● Topic Models / Topic Distances seem useful ○ Esp. when part of a multi-signal model (i.e. images) ● Hybrid Spark and R approach ○ Moving to 100% Spark in future for speed ● Topic Models just sitting there, waiting for exploitation ○ “Programmatic” Marketing Efforts, &c ● But what about Locality Sensitive Hashing? Current Status
  • 32. Questions? Brent Schneeman Principal Data Scientist HomeAway, Inc. brent@homeaway.com careers.homeaway.com @schnee ← https://www.homeaway.com/vacation-rental/p3482065