Talk given on NLP at the Elasticsearch meetup in Berlin in February 2017. Discusses word embeddings for product classification, generation of product descriptions and chat bots.
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Applying NLP to product comparison at visual meta
1. Applying NLP to Product Comparison at Visual Meta
1
Ross Turner
Elasticsearch Meetup Berlin 22/02/17
2. Overview
Product Comparison on the Visual Meta Platform1
Applying NLP to Product Comparison
Using NLP to Maintain a Product Catalogue2
Making Product Discovery Conversational3
2
3. About Me
Previously…
• Researcher in Natural Language Generation (NLG)
• Software Engineer on Local Search
• Co-founder and Principal Engineer at an NLG Start Up
Currently…
• Engineering Head at Visual Meta
5. Product Comparison at Visual Meta
‘All shops, one site’
• Online marketing platform with
shopping portals in 12 different
countries
• 3 brands: Ladenzeile, ShopAlike,
UmSóLugar
• 100,000,000+ items
• 6,000+ partner shops
6. Faceted Search at Visual Meta
Discover fashion, furniture and
more….
• 800,000 platform visits per day
• 80 filter types across 21
categories
• Currently porting filter search
to ElasticSearch
7. Maintaining a Product Catalogue at Visual Meta
Product feeds are continuously synced from partner shops:
• Feed items must be categorised in order to be discoverable on the platform
We want to:
• Identify all variants of a product
• Compare offers across shops
• Make it easy for our for users to browse through millions of products
Model Colour Memory
Apple iPhone 6s Space Grey 32GB
Apple iPhone 6s Space Grey 128GB
Apple iPhone 6s Gold 32GB
Apple iPhone 6s Gold 128GB
Apple iPhone 6s Rose Gold 32GB
Apple iPhone 6s Rose Gold 128GB
Apple iPhone 6s Silver 32GB
Apple iPhone 6s Silver 128GB
9. String Matching
Index item names and descriptions, query product variant tag names against the index
Lucene query:
• +(Name:apple Description:apple) +(Name:iphone Description:iphone) +(Name:6s Description:6s)
+(Name:16gb Description:16gb) +(Name:space Description:space) +(Name:grey)
Test by manually assigning items to a random sample of products
Recall Precision Fscore
0.59 0.64 0.61
10. Error Analysis
Naming for the same product is not consistent across feeds:
1. abc.com: “Apple iPhone 6 (Space Grey, 64GB)”
2. efg.com: “Apple iPhone 6 64 GB Space Grey”
3. xyz.com: “Apple iPhone 6”
Naming for the same product is not consistent within the same feed:
1. “Apple Iphone 6 - 64GB”
2. “Apple Iphone 6 64GB Space Grey”
3. “Kamakshi Apple iPhone 6 (Latest Model) - 64 GB - Space Gray - Smartphone”
Wrongly categorised Products in the feed:
• “Cover for Apple Iphone 6 - 64GB”
14. Language Models
Drawbacks of bag of words / n-grams:
• Words are equally distant
• Vectors are sparse
Word embeddings capture semantics:
• Vectors are continuous
• Similar words are close in vector space
1. Efficient estimation of word representations in vector space arXiv preprint arXiv:1301.3781 (2013) by Tomas Mikolov, Kai Chen, Greg
Corrado, Jeffrey Dean
15. 15
Word2Vec for Mobile Phone Items
Mobile phone item corpus:
• 7,890 feed items
• 863k tokens, 41.5k unique
Closest words to “Galaxy”:
Word Cosine Distance
1 Samsung 0.51
2 S2 0.48
3 S5 0.46
18. Two Descriptions of a Samsung TV
Samsung UE40H6400AK. Display diagonal:
101.6 cm (40"), HD type: Full HD, Display
resolution: 1920 x 1080 pixels. Tuner type:
Analog & Digital, Digital signal format
system: DVB-C, DVB-T. RMS rated power:
20 W. Consumer Electronics Control (CEC):
Anynet+. Picture processing technology:
Samsung Wide Color Enhancer
The Samsung UE40H6400 has a 101.6cm
screen size and a resolution of 1920 x
1080 pixels. It is a Full HD TV, has an
Analog & Digital tuner and comes with
Anynet+.
19. Generating Product Descriptions
Choosing what to say Deciding how to say it
3. E Reiter (2007). An Architecture for Data-to-Text Systems. In Proceedings of ENLG-2007, pages 97-104
20. Two Descriptions of a Samsung Smartphone
Samsung SM-G920F, Galaxy. Display
diagonal: 12.9 cm (5.1"), Display
resolution: 2560 x 1440 pixels, Display
type: SAMOLED. Processor frequency: 2.1
GHz, Coprocessor frequency: 1.5 GHz.
Internal storage capacity: 32 GB, Internal
RAM: 3072 MB. Main camera resolution
(numeric): 16 MP, Video recording modes:
1080p, 2160p, Maximum frame rate: 30
fps. SIM card capability: Single SIM, SIM
card type: NanoSIM, 2G standards: GSM
The Samsung GALAXY S6 has a 12.9'
display with 2560 x 1440 pixel resolution.
It has a 2.1GHZ processor, a 16 megapixel
camera and 3072MB of internal RAM with
32GB of internal storage capacity.
21. Building Messages from a Product Catalogue
The Samsung Galaxy S6 has a 12.9' display
with 2560 x 1440 pixel resolution. It has a
2.1GHZ processor, a 16 megapixel camera
and 3072MB of internal RAM with 32GB of
internal storage capacity.
23. Entity Recognition for Voice Search
Input - “I’d like some red adidas trainers”
Output:
• <brands, [adidas]>
• <categories, [trainers]>
• <colours, [red]>
234. http://visual-meta.com/tech-corner/hi-lara-building-a-conversational-agent-for-visual-metas-first-hackathon.html
24. Lucene index is built from labels to tag tree
tokens
1. Word shingles are extracted from the input
query
2. Each shingle is queried against the index (top
down, greedy)
Labeled tokens are used to:
1. Query the product index
2. Keep track of the dialogue state
Using the Product Catalogue to Parse Queries
24
• “I’d like some red adidas trainers”
• “I’d like some red adidas”
• “like some red adidas trainers”
• “I’d like some red”
• “like some red adidas”
• “some red adidas trainers”
• ...
• “red”
• “adidas”
• “trainers”
25. Putting It all Together: Answering Queries
How big is the Samsung Galaxy S6’s screen?
The Samsung Galaxy S6 has a 12’9 display
How much RAM does it have?
It has 3072MB of RAM
27. Takeaways
1. Word embeddings, even when trained on limited data can:
a. provide significant improvement over bag of words models for text classification; and
b. reduce the amount of manually curated data required for the task
2. Product catalogues provide a rich information source for conversational apps
3. NLG can be utilised for product feed enhancement as well as conversation