2. HomeAway Key Facts
● 1,300,000+ global vacation rental listings
● 200,000,000+ vacation days / year
● ~190 countries, 22 languages
● HQ in Austin, TX; part of Expedia, Inc
--> Capable competition and fraud vectors
7. The Property Descriptions
Why Property Descriptions?
● Almost identical text
● Similar descriptions
seemed probable
○ Consistent owner
branding, easy to
replicate
● Tech team wanted to use
natural language
processing techniques
● Didn’t know if this would
work when we began
The Other Guys
There are truly inspiring views at High Point Retreat and
plenty of places to sit and enjoy them. Take a load off in one
of the many rooms with views of the ski mountain and
remember how lucky you are to live like this. Cozy up with
family in the sunken living room and chat for hours on end.
Sit in a circle of tree stumps around the outdoor fire pit and
roast marshmallows. After all that sitting, youll be more than
happy to walk 250 yards to the free shuttle to get the blood
pumping again. Then, have a seat and enjoy your free ride.
Best. Vacation. Ever. Vacation homes allow families to
stay...together. At InvitedHome, we think that's pretty
important, so we do everything in our power to make your
vacation totally epic. Not only do we choose the best homes
in the best destinations, but we make the experience
effortless so you can really enjoy yourself. Our team will
stock your fridge, babysit the kids, cater your party, plan your
day trip, make reservations, and do whatever we can to
make sure you have the Best. Vacation. Ever.
HomeAway
There are truly inspiring views at High Point Retreat and
plenty of places to sit and enjoy them. Take a load off in one
of the many rooms with views of the ski mountain and
remember how lucky you are to live like this. Cozy up with
family in the sunken living room and chat for hours on end.
Sit in a circle of tree stumps around the outdoor fire pit and
roast marshmallows. After all that sitting, you’ll be more
than happy to walk 250 yards to the free shuttle to get the
blood pumping again. Then, have a seat and enjoy your free
ride.
Best.Vacation.Ever. Vacation homes allow families to stay...
together. At InvitedHome, we think that's pretty important,
so we do everything in our power to make your vacation
totally epic. Not only do we choose the best homes in the
best destinations, but we make the experience effortless so
you can really enjoy yourself. Let us connect you with the
best options in town for babysitting, equipment rental,
transportation, catering, day trips, shopping, dining, and
even stocking your fridge with groceries! We’ll do everything
in our power to make sure you have the Best. Vacation.
Ever.
8. Hypothesis
We can detect properties listed on
HomeAway and the competition by
comparing the text in the property
descriptions
9. Worked great, but...
“Large” Vocabulary size
~10K Tokens -> 10K Dimensions and
millions of sparse vectors
A little slow
(took a week to process the US)
Initial Approach: TF-IDF and Cosine Distance
11. Hypothesis
We can detect properties listed on
HomeAway and the competition by
comparing the text in the property
descriptions
We can leverage Topic Modeling to do
it
12. Latent Dirichlet Allocation (Topic Modeling)
Communications of the ACM, Vol. 55 No. 4, Pages
77-84
10.1145/2133806.2133826
13. Topic Modeling and LDA
In natural language processing,
Latent Dirichlet allocation
(LDA) is a generative model
that allows sets of
observations to be explained
by unobserved groups that
explain why some parts of the
data are similar.
(Wikipedia)
Cat, Dog, Fish,
Turtle,
Hamster
Cat, Dog,
Mass,
Hysteria,
Living,
Together
Cat, Dog,
Cold, Rain,
Hot,
Temperature
Document A
Document B
Document C
15. Topic Modeling Motivations
● Smaller dimensional space
● Faster processing times?
● At the end, we’d have Topic Models
Must be useful for duplicate detection
We used Spark’s ML APIs for this:
val countLDA = new LDA()
.setK(numTopics)
.setMaxIter(params.maxIterations)
.setSeed(params.randomSeed)
.setFeaturesCol(featureCol)
.setTopicDistributionCol("topicDistribution")
22. Create an experimental dataset
Original Corpus
Random selection
Duplicate (with optional
degradation)...
… and see if we can find
those duplicates
23.
24.
25.
26.
27. How to make something useful?
Machine Learning Effort
28.
29.
30. Interquartile Ranges are more resilient to outliers than
standard deviations
IQRs bring information about the entire set of possible
duplicates
Random Forest Model (R):
trainIdx <- createDataPartition(dupesFoundByTopic$match,
p=0.9, list=FALSE, times=1)
train <- dupesFoundByTopic[trainIdx,]
fit <- randomForest(as.factor(match) ~ distance + iqrs,
data=train)
Combining Distance and IQR
Feature Mean
Decrease
Gini
distance 498
IQR 57
Reference
Pred. FALSE TRUE
FALSE 204 2
TRUE 4 32
31. ● Topic Models / Topic Distances seem useful
○ Esp. when part of a multi-signal model
(i.e. images)
● Hybrid Spark and R approach
○ Moving to 100% Spark in future for
speed
● Topic Models just sitting there, waiting for
exploitation
○ “Programmatic” Marketing Efforts, &c
● But what about Locality Sensitive Hashing?
Current Status
32. Questions?
Brent Schneeman
Principal Data Scientist
HomeAway, Inc.
brent@homeaway.com
careers.homeaway.com
@schnee
← https://www.homeaway.com/vacation-rental/p3482065