SlideShare a Scribd company logo
1 of 1
Download to read offline
En#ty	
  matching	
  of	
  ecommerce	
  offers	
  
Paul	
  Puget	
  
	
  	
  	
  	
  Objec#ves	
  
	
  	
  	
  Methodology	
  
•  Iden#fy	
  if	
  two	
  webpages	
  present	
  offers	
  of	
  the	
  same	
  
product.	
  	
  
•  Define	
  a	
  methodology	
  to	
  compare	
  html	
  pages	
  of	
  
ecommerce	
  offers.	
  
•  Respect	
  context	
  constraints.	
  	
   This	
  is	
  one	
  example	
  of	
  two	
  different	
  webpages	
  represen#ng	
  
similar	
  offers	
  
I.	
  Parsing	
  
•  From	
  HTML	
  pages	
  to	
  product	
  informa#on	
  (name,	
  descrip#on,	
  
image,	
  …).	
  
•  Extensive	
  use	
  of	
  LXML	
  libraries	
  to	
  query	
  HTML	
  via	
  a	
  language	
  
deriva#ng	
  from	
  xpath.	
  	
  
Name:	
  Crème	
  avene	
  40mL	
  	
  
Image:	
  discount.fr/prodim.jpg	
  
descrip5on:	
  This	
  cream	
  will	
  have	
  
an	
  immediate	
  effect	
  on	
  …	
  
From	
  html	
  to	
  json	
  product	
  fields	
  	
  
II.	
  Features	
  extrac5on	
  	
  
•  Extract	
  and	
  normalize	
  explicit	
  features	
  from	
  product	
  data	
  
•  First	
  clean	
  and	
  tokenize	
  text	
  using	
  text	
  cleaning	
  techniques	
  
•  Then	
  extract	
  data	
  based	
  on	
  dynamically	
  built	
  dic#onnaries	
  
and	
  	
  context.	
  
Cream	
  
Extrac#on	
  and	
  normalisa#on	
  process	
  of	
  a	
  simple	
  3	
  words	
  string	
  
JPG	
   40mL	
  
Manufacturer:	
  JPG	
  
Volume:	
  40mL	
  
III.	
  Features	
  matching	
  	
  
•  From	
  the	
  features	
  we	
  previously	
  extracted	
  we	
  compute	
  a	
  
serie	
  of	
  matching	
  scores.	
  
•  Two	
  types	
  of	
  matchers	
  were	
  mainly	
  used.	
  
	
  	
  	
  	
  Conclusion	
  and	
  perspec#ves	
  
Boolean	
  matching	
  is	
  based	
  on	
  a	
  strict	
  equality,	
  it	
  can	
  be	
  of	
  one	
  or	
  more	
  of	
  
these	
  three	
  subtypes:	
  
•  Nega#ve:	
  a	
  nega#ve	
  result	
  means	
  the	
  offers	
  are	
  different	
  
(ex:	
  volume,	
  sku,	
  manufacturer)	
  
•  Posi#ve:	
  a	
  posi#ve	
  result	
  means	
  the	
  offers	
  are	
  the	
  same	
  
(only	
  sku	
  is	
  in	
  this	
  case)	
  
•  Neutral:	
  neither	
  match	
  or	
  not	
  match	
  allows	
  to	
  conclude	
  
Con5nuous	
  matching	
  gives	
  a	
  score	
  between	
  0	
  and	
  1	
  depending	
  on	
  
similarity	
  of	
  features.	
  
•  Price:	
  absolute	
  and	
  rela#ve	
  difference	
  
•  Name:	
  	
  tokens	
  differences	
  +	
  jaro_winkler	
  difference	
  
(jellyfish	
  package)	
  
•  Images:	
  Color	
  comparison	
  (numpy	
  +	
  scipy)	
  
Manufacturer:	
  Jean-­‐Paul	
  Gaul#er	
  
Volume:	
  0.04L	
  
Extrac#on	
  
Extrac#on	
   Normaliza#on	
  
Normaliza#on	
  
•  Results	
  of	
  classifica#on	
  accuracy	
  superior	
  to	
  recent	
  li^erature,	
  who	
  do	
  not	
  go	
  
beyond	
  80%	
  accuracy.	
  	
  
•  Methodology	
  is	
  not	
  specific	
  to	
  one	
  sector,	
  most	
  li^erature	
  studies	
  being	
  tested	
  on	
  
hi-­‐tech	
  products.	
  	
  
•  However	
  results	
  are	
  dependent	
  on	
  the	
  two	
  first	
  parts	
  (parsing	
  and	
  extrac#on)	
  which	
  
may	
  require	
  manual	
  work.	
  	
  
•  For	
  further	
  improvements	
  features	
  engineering	
  seems	
  to	
  be	
  the	
  direc#on	
  that	
  could	
  
bring	
  most	
  improvements.	
  
•  Using	
  more	
  advanced	
  seman#c	
  techniques	
  such	
  as	
  the	
  ones	
  implemented	
  in	
  NLTK	
  
and	
  shape	
  comparison	
  techniques	
  with	
  scikit	
  image	
  would	
  be	
  next	
  steps.	
  
IV.a	
  Web	
  offer	
  Matching	
  main	
  scoring	
  technique	
  
•  The	
  problem	
  of	
  matching	
  web	
  offers	
  is	
  modeled	
  as	
  a	
  
classifica#on	
  problem,	
  classifying	
  pairs	
  of	
  web	
  offers	
  as	
  valid	
  or	
  
invalid	
  pairs.	
  
•  A	
  dataset	
  of	
  pairs	
  is	
  created	
  using	
  boolean	
  posi#ve	
  matching	
  	
  
and	
  completed	
  by	
  manual	
  matching.	
  
•  The	
  model	
  which	
  proved	
  to	
  be	
  the	
  most	
  accurate	
  is	
  the	
  
decision	
  tree	
  classifier	
  as	
  implemented	
  in	
  scikit-­‐learn	
  
IV.b	
  Web	
  offer	
  Matching	
  Op5misa5ons	
  
•  Nega#ve	
  matchings	
  allow	
  via	
  pandas	
  dataframe	
  opera#ons	
  to	
  
eliminate	
  most	
  nega#ve	
  pairs.	
  This	
  gains	
  lots	
  of	
  computa#onal	
  
#me.	
  
•  When	
  comparing	
  two	
  ecommerce	
  catalogues	
  we	
  can	
  improve	
  
accuracy	
  by	
  using	
  the	
  unicity	
  of	
  products	
  hypotheses.	
  Indeed,	
  in	
  
this	
  case	
  we	
  can	
  use	
  an	
  assignment	
  algorithm	
  to	
  choose	
  best	
  
pairs.	
  
Classifica#on	
  score	
  depending	
  on	
  the	
  por#on	
  of	
  classified	
  
pairs	
  (defined	
  using	
  probability	
  classifica#on).	
  Test	
  was	
  
conducted	
  on	
  a	
  dataset	
  of	
  50000	
  weboffers	
  pairs	
  	
  

More Related Content

Similar to Entity matching of web offers, from html to similarity score.

Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
 
Building an AI and ML Model Using KNIME and Python.pptx
Building an AI and ML Model Using KNIME and Python.pptxBuilding an AI and ML Model Using KNIME and Python.pptx
Building an AI and ML Model Using KNIME and Python.pptxssuser448ad3
 
Common Problems in Hyperparameter Optimization
Common Problems in Hyperparameter OptimizationCommon Problems in Hyperparameter Optimization
Common Problems in Hyperparameter OptimizationSigOpt
 
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017MLconf
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Alok Singh
 
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...Christopher Sneed, MSDS, PMP, CSPO
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIVijayananda Mohire
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning SystemsAnuj Gupta
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NETDev Raj Gautam
 
An Introduction To Python - Problem Solving: Flowcharts & Test Cases, Boolean...
An Introduction To Python - Problem Solving: Flowcharts & Test Cases, Boolean...An Introduction To Python - Problem Solving: Flowcharts & Test Cases, Boolean...
An Introduction To Python - Problem Solving: Flowcharts & Test Cases, Boolean...Blue Elephant Consulting
 
House price prediction
House price predictionHouse price prediction
House price predictionKaranseth30
 
You have Selenium... Now what?
You have Selenium... Now what?You have Selenium... Now what?
You have Selenium... Now what?Great Wide Open
 
housing price prediction.pptx
housing price prediction.pptxhousing price prediction.pptx
housing price prediction.pptxJINALVASOYA2
 
How to get Automated Testing "Done"
How to get Automated Testing "Done"How to get Automated Testing "Done"
How to get Automated Testing "Done"TEST Huddle
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Praveen Penumathsa
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 

Similar to Entity matching of web offers, from html to similarity score. (20)

Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce Setting
 
Building an AI and ML Model Using KNIME and Python.pptx
Building an AI and ML Model Using KNIME and Python.pptxBuilding an AI and ML Model Using KNIME and Python.pptx
Building an AI and ML Model Using KNIME and Python.pptx
 
Common Problems in Hyperparameter Optimization
Common Problems in Hyperparameter OptimizationCommon Problems in Hyperparameter Optimization
Common Problems in Hyperparameter Optimization
 
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
 
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NET
 
An Introduction To Python - Problem Solving: Flowcharts & Test Cases, Boolean...
An Introduction To Python - Problem Solving: Flowcharts & Test Cases, Boolean...An Introduction To Python - Problem Solving: Flowcharts & Test Cases, Boolean...
An Introduction To Python - Problem Solving: Flowcharts & Test Cases, Boolean...
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
You have Selenium... Now what?
You have Selenium... Now what?You have Selenium... Now what?
You have Selenium... Now what?
 
housing price prediction.pptx
housing price prediction.pptxhousing price prediction.pptx
housing price prediction.pptx
 
Pre-Report.pptx
Pre-Report.pptxPre-Report.pptx
Pre-Report.pptx
 
How to get Automated Testing "Done"
How to get Automated Testing "Done"How to get Automated Testing "Done"
How to get Automated Testing "Done"
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Software Design principales
Software Design principalesSoftware Design principales
Software Design principales
 

Recently uploaded

Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 

Recently uploaded (20)

Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 

Entity matching of web offers, from html to similarity score.

  • 1. En#ty  matching  of  ecommerce  offers   Paul  Puget          Objec#ves        Methodology   •  Iden#fy  if  two  webpages  present  offers  of  the  same   product.     •  Define  a  methodology  to  compare  html  pages  of   ecommerce  offers.   •  Respect  context  constraints.     This  is  one  example  of  two  different  webpages  represen#ng   similar  offers   I.  Parsing   •  From  HTML  pages  to  product  informa#on  (name,  descrip#on,   image,  …).   •  Extensive  use  of  LXML  libraries  to  query  HTML  via  a  language   deriva#ng  from  xpath.     Name:  Crème  avene  40mL     Image:  discount.fr/prodim.jpg   descrip5on:  This  cream  will  have   an  immediate  effect  on  …   From  html  to  json  product  fields     II.  Features  extrac5on     •  Extract  and  normalize  explicit  features  from  product  data   •  First  clean  and  tokenize  text  using  text  cleaning  techniques   •  Then  extract  data  based  on  dynamically  built  dic#onnaries   and    context.   Cream   Extrac#on  and  normalisa#on  process  of  a  simple  3  words  string   JPG   40mL   Manufacturer:  JPG   Volume:  40mL   III.  Features  matching     •  From  the  features  we  previously  extracted  we  compute  a   serie  of  matching  scores.   •  Two  types  of  matchers  were  mainly  used.          Conclusion  and  perspec#ves   Boolean  matching  is  based  on  a  strict  equality,  it  can  be  of  one  or  more  of   these  three  subtypes:   •  Nega#ve:  a  nega#ve  result  means  the  offers  are  different   (ex:  volume,  sku,  manufacturer)   •  Posi#ve:  a  posi#ve  result  means  the  offers  are  the  same   (only  sku  is  in  this  case)   •  Neutral:  neither  match  or  not  match  allows  to  conclude   Con5nuous  matching  gives  a  score  between  0  and  1  depending  on   similarity  of  features.   •  Price:  absolute  and  rela#ve  difference   •  Name:    tokens  differences  +  jaro_winkler  difference   (jellyfish  package)   •  Images:  Color  comparison  (numpy  +  scipy)   Manufacturer:  Jean-­‐Paul  Gaul#er   Volume:  0.04L   Extrac#on   Extrac#on   Normaliza#on   Normaliza#on   •  Results  of  classifica#on  accuracy  superior  to  recent  li^erature,  who  do  not  go   beyond  80%  accuracy.     •  Methodology  is  not  specific  to  one  sector,  most  li^erature  studies  being  tested  on   hi-­‐tech  products.     •  However  results  are  dependent  on  the  two  first  parts  (parsing  and  extrac#on)  which   may  require  manual  work.     •  For  further  improvements  features  engineering  seems  to  be  the  direc#on  that  could   bring  most  improvements.   •  Using  more  advanced  seman#c  techniques  such  as  the  ones  implemented  in  NLTK   and  shape  comparison  techniques  with  scikit  image  would  be  next  steps.   IV.a  Web  offer  Matching  main  scoring  technique   •  The  problem  of  matching  web  offers  is  modeled  as  a   classifica#on  problem,  classifying  pairs  of  web  offers  as  valid  or   invalid  pairs.   •  A  dataset  of  pairs  is  created  using  boolean  posi#ve  matching     and  completed  by  manual  matching.   •  The  model  which  proved  to  be  the  most  accurate  is  the   decision  tree  classifier  as  implemented  in  scikit-­‐learn   IV.b  Web  offer  Matching  Op5misa5ons   •  Nega#ve  matchings  allow  via  pandas  dataframe  opera#ons  to   eliminate  most  nega#ve  pairs.  This  gains  lots  of  computa#onal   #me.   •  When  comparing  two  ecommerce  catalogues  we  can  improve   accuracy  by  using  the  unicity  of  products  hypotheses.  Indeed,  in   this  case  we  can  use  an  assignment  algorithm  to  choose  best   pairs.   Classifica#on  score  depending  on  the  por#on  of  classified   pairs  (defined  using  probability  classifica#on).  Test  was   conducted  on  a  dataset  of  50000  weboffers  pairs