SlideShare a Scribd company logo
1 of 24
Download to read offline
Duplicate Detection via
Topic Modeling
HomeAway Key Facts
● 1,300,000+ global vacation rental listings
● 200,000,000+ vacation days / year
● ~190 countries, 22 languages
● HQ in Austin, TX; part of Expedia, Inc
--> Capable competition and fraud vectors
Competitive Intelligence
Breckenridge Colorado
HomeAway in blue
Breckenridge, zoomed in
Same Property
The Property Descriptions
Why Property Descriptions?
● Almost identical text
● Similar descriptions
seemed probable
○ Consistent owner
branding, easy to
replicate
● Tech team wanted to use
natural language
processing techniques
● Didn’t know if this would
work when we began
The Other Guys
There are truly inspiring views at High Point Retreat and
plenty of places to sit and enjoy them. Take a load off in one
of the many rooms with views of the ski mountain and
remember how lucky you are to live like this. Cozy up with
family in the sunken living room and chat for hours on end.
Sit in a circle of tree stumps around the outdoor fire pit and
roast marshmallows. After all that sitting, youll be more than
happy to walk 250 yards to the free shuttle to get the blood
pumping again. Then, have a seat and enjoy your free ride.
Best. Vacation. Ever. Vacation homes allow families to
stay...together. At InvitedHome, we think that's pretty
important, so we do everything in our power to make your
vacation totally epic. Not only do we choose the best homes
in the best destinations, but we make the experience
effortless so you can really enjoy yourself. Our team will
stock your fridge, babysit the kids, cater your party, plan your
day trip, make reservations, and do whatever we can to
make sure you have the Best. Vacation. Ever.
HomeAway
There are truly inspiring views at High Point Retreat and
plenty of places to sit and enjoy them. Take a load off in one
of the many rooms with views of the ski mountain and
remember how lucky you are to live like this. Cozy up with
family in the sunken living room and chat for hours on end.
Sit in a circle of tree stumps around the outdoor fire pit and
roast marshmallows. After all that sitting, you’ll be more
than happy to walk 250 yards to the free shuttle to get the
blood pumping again. Then, have a seat and enjoy your free
ride.
Best.Vacation.Ever. Vacation homes allow families to stay...
together. At InvitedHome, we think that's pretty important,
so we do everything in our power to make your vacation
totally epic. Not only do we choose the best homes in the
best destinations, but we make the experience effortless so
you can really enjoy yourself. Let us connect you with the
best options in town for babysitting, equipment rental,
transportation, catering, day trips, shopping, dining, and
even stocking your fridge with groceries! We’ll do everything
in our power to make sure you have the Best. Vacation.
Ever.
Worked great, but...
“Large” Vocabulary size
~6300 Tokens -> 6300 Dimensions and
millions of sparse vectors
A little slow
(took a week to process the US)
Initial Approach: TF-IDF and Cosine Distance
Spark
Clusters?
Topic
Modeling?
Other Distance
Metrics?
Latent Dirichlet Allocation (Topic Modeling)
Communications of the ACM, Vol. 55 No. 4, Pages
77-84
10.1145/2133806.2133826
Topic Modeling Motivations
● Smaller dimensional space
● Faster processing times
● At the end, we’d have Topic Models
Must be useful for duplicate detection
We used Spark’s ML APIs for this:
val countLDA = new LDA()
.setK(numTopics)
.setMaxIter(params.maxIterations)
.setSeed(params.randomSeed)
.setFeaturesCol(featureCol)
.setTopicDistributionCol("topicDistribution")
Distances between Topic Distributions
Euclidean Manhattan Cosine
Distances between Topic Distributions
Euclidean Manhattan Cosine
Jensen-Shannon Hellinger
Distances between Topic Distributions
Euclidean Manhattan Cosine
Jensen-Shannon Hellinger
How to make something useful?
This is a machine learning effort
Interquartile Ranges are more resilient to outliers than
standard deviations
IQRs bring information about the entire set of possible
duplicates
Random Forest Model (R):
trainIdx <- createDataPartition(dupesFoundByTopic$match,
p=0.9, list=FALSE, times=1)
train <- dupesFoundByTopic[trainIdx,]
fit <- randomForest(as.factor(match) ~ distance + iqrs,
data=train)
Combining Distance and IQR
Feature Mean
Decrease
Gini
distance 498
IQR 57
Reference
Pred. FALSE TRUE
FALSE 204 2
TRUE 4 32
● Topic Models / Topic Distances seem useful
○ Esp. when part of a multi-signal model
(i.e. images)
● Hybrid Spark and R approach
○ Moving to 100% Spark in future for
speed
● Topic Models just sitting there, waiting for
exploitation
○ “Programmatic” Marketing Efforts, &c.
Current Status
Questions?
Brent Schneeman
Principal Data Scientist
HomeAway, Inc.
brent@homeaway.com
careers.homeaway.com
@schnee
← https://www.homeaway.com/vacation-rental/p3482065

More Related Content

Similar to Duplicate detection via topic modeling

Dime-Novel Genre Classifier: A Prototype Text-Mining Application
Dime-Novel Genre Classifier:  A Prototype Text-Mining ApplicationDime-Novel Genre Classifier:  A Prototype Text-Mining Application
Dime-Novel Genre Classifier: A Prototype Text-Mining ApplicationMarcos Quezada
 
My flight and hotel
My flight and hotelMy flight and hotel
My flight and hotelalvinaruby
 
The cruise weddings
The cruise weddingsThe cruise weddings
The cruise weddingsalishajohn85
 
Event Marquees - Hosting an Event in Transition Seasons
Event Marquees - Hosting an Event in Transition Seasons Event Marquees - Hosting an Event in Transition Seasons
Event Marquees - Hosting an Event in Transition Seasons Event Marquees
 
Creating Presentations That Matter - A 1-day workshop (May 4th) at SVC
Creating Presentations That Matter - A 1-day workshop (May 4th) at SVCCreating Presentations That Matter - A 1-day workshop (May 4th) at SVC
Creating Presentations That Matter - A 1-day workshop (May 4th) at SVCAshley Bright
 
Brochure gls avenue51 sector 92 gurgaon +91 9717622228
Brochure gls avenue51 sector 92 gurgaon +91 9717622228Brochure gls avenue51 sector 92 gurgaon +91 9717622228
Brochure gls avenue51 sector 92 gurgaon +91 9717622228NADEEM YAZDAN
 

Similar to Duplicate detection via topic modeling (8)

Lesson3 2 es0_19-20
Lesson3 2 es0_19-20 Lesson3 2 es0_19-20
Lesson3 2 es0_19-20
 
Dime-Novel Genre Classifier: A Prototype Text-Mining Application
Dime-Novel Genre Classifier:  A Prototype Text-Mining ApplicationDime-Novel Genre Classifier:  A Prototype Text-Mining Application
Dime-Novel Genre Classifier: A Prototype Text-Mining Application
 
May 2020 Newsletter
May 2020 NewsletterMay 2020 Newsletter
May 2020 Newsletter
 
My flight and hotel
My flight and hotelMy flight and hotel
My flight and hotel
 
The cruise weddings
The cruise weddingsThe cruise weddings
The cruise weddings
 
Event Marquees - Hosting an Event in Transition Seasons
Event Marquees - Hosting an Event in Transition Seasons Event Marquees - Hosting an Event in Transition Seasons
Event Marquees - Hosting an Event in Transition Seasons
 
Creating Presentations That Matter - A 1-day workshop (May 4th) at SVC
Creating Presentations That Matter - A 1-day workshop (May 4th) at SVCCreating Presentations That Matter - A 1-day workshop (May 4th) at SVC
Creating Presentations That Matter - A 1-day workshop (May 4th) at SVC
 
Brochure gls avenue51 sector 92 gurgaon +91 9717622228
Brochure gls avenue51 sector 92 gurgaon +91 9717622228Brochure gls avenue51 sector 92 gurgaon +91 9717622228
Brochure gls avenue51 sector 92 gurgaon +91 9717622228
 

Recently uploaded

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Duplicate detection via topic modeling

  • 2. HomeAway Key Facts ● 1,300,000+ global vacation rental listings ● 200,000,000+ vacation days / year ● ~190 countries, 22 languages ● HQ in Austin, TX; part of Expedia, Inc --> Capable competition and fraud vectors
  • 7. The Property Descriptions Why Property Descriptions? ● Almost identical text ● Similar descriptions seemed probable ○ Consistent owner branding, easy to replicate ● Tech team wanted to use natural language processing techniques ● Didn’t know if this would work when we began The Other Guys There are truly inspiring views at High Point Retreat and plenty of places to sit and enjoy them. Take a load off in one of the many rooms with views of the ski mountain and remember how lucky you are to live like this. Cozy up with family in the sunken living room and chat for hours on end. Sit in a circle of tree stumps around the outdoor fire pit and roast marshmallows. After all that sitting, youll be more than happy to walk 250 yards to the free shuttle to get the blood pumping again. Then, have a seat and enjoy your free ride. Best. Vacation. Ever. Vacation homes allow families to stay...together. At InvitedHome, we think that's pretty important, so we do everything in our power to make your vacation totally epic. Not only do we choose the best homes in the best destinations, but we make the experience effortless so you can really enjoy yourself. Our team will stock your fridge, babysit the kids, cater your party, plan your day trip, make reservations, and do whatever we can to make sure you have the Best. Vacation. Ever. HomeAway There are truly inspiring views at High Point Retreat and plenty of places to sit and enjoy them. Take a load off in one of the many rooms with views of the ski mountain and remember how lucky you are to live like this. Cozy up with family in the sunken living room and chat for hours on end. Sit in a circle of tree stumps around the outdoor fire pit and roast marshmallows. After all that sitting, you’ll be more than happy to walk 250 yards to the free shuttle to get the blood pumping again. Then, have a seat and enjoy your free ride. Best.Vacation.Ever. Vacation homes allow families to stay... together. At InvitedHome, we think that's pretty important, so we do everything in our power to make your vacation totally epic. Not only do we choose the best homes in the best destinations, but we make the experience effortless so you can really enjoy yourself. Let us connect you with the best options in town for babysitting, equipment rental, transportation, catering, day trips, shopping, dining, and even stocking your fridge with groceries! We’ll do everything in our power to make sure you have the Best. Vacation. Ever.
  • 8. Worked great, but... “Large” Vocabulary size ~6300 Tokens -> 6300 Dimensions and millions of sparse vectors A little slow (took a week to process the US) Initial Approach: TF-IDF and Cosine Distance
  • 10. Latent Dirichlet Allocation (Topic Modeling) Communications of the ACM, Vol. 55 No. 4, Pages 77-84 10.1145/2133806.2133826
  • 11. Topic Modeling Motivations ● Smaller dimensional space ● Faster processing times ● At the end, we’d have Topic Models Must be useful for duplicate detection We used Spark’s ML APIs for this: val countLDA = new LDA() .setK(numTopics) .setMaxIter(params.maxIterations) .setSeed(params.randomSeed) .setFeaturesCol(featureCol) .setTopicDistributionCol("topicDistribution")
  • 12.
  • 13. Distances between Topic Distributions Euclidean Manhattan Cosine
  • 14. Distances between Topic Distributions Euclidean Manhattan Cosine Jensen-Shannon Hellinger
  • 15. Distances between Topic Distributions Euclidean Manhattan Cosine Jensen-Shannon Hellinger
  • 16.
  • 17.
  • 18.
  • 19. How to make something useful? This is a machine learning effort
  • 20.
  • 21.
  • 22. Interquartile Ranges are more resilient to outliers than standard deviations IQRs bring information about the entire set of possible duplicates Random Forest Model (R): trainIdx <- createDataPartition(dupesFoundByTopic$match, p=0.9, list=FALSE, times=1) train <- dupesFoundByTopic[trainIdx,] fit <- randomForest(as.factor(match) ~ distance + iqrs, data=train) Combining Distance and IQR Feature Mean Decrease Gini distance 498 IQR 57 Reference Pred. FALSE TRUE FALSE 204 2 TRUE 4 32
  • 23. ● Topic Models / Topic Distances seem useful ○ Esp. when part of a multi-signal model (i.e. images) ● Hybrid Spark and R approach ○ Moving to 100% Spark in future for speed ● Topic Models just sitting there, waiting for exploitation ○ “Programmatic” Marketing Efforts, &c. Current Status
  • 24. Questions? Brent Schneeman Principal Data Scientist HomeAway, Inc. brent@homeaway.com careers.homeaway.com @schnee ← https://www.homeaway.com/vacation-rental/p3482065