Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Document semantic characterization

Presentation given at 2018 IEEE International Conference on Intelligent Systems (IS) 2018 for the article:
L. Mazzola, P. Siegfried, A. Waldis, M. Kaufmann, A. Denzler. (2018). "A Domain Specific ESA Inspired Approach for Document Semantic Description". In Proceedings of the IEEE IS2018, At: Madeira Island, Portugal, pg. xx-xx. ISBN: 978-1-5386-7097-2/18

  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Document semantic characterization

  1. 1. A Domain Specific ESA-inspired Approach for Document Semantic Description Luca Mazzola, Patrick Siegfried, Andreas Waldis, Michael Kaufmann, and Alexander Denzler HSLU - Lucerne University of Applied Sciences, School of Information Technology, 6343 - Rotkreuz, Switzerland 9th IEEE International Conference on Intelligent Systems – IS2018 IEEE_IS2018 25/09/2018
  2. 2. Slide 2, 25-Sep-18 - DSS : Decision Support System for - job placement - further education suggestion - profile (CV) similarity identification - Data driven - Automatically evolving (no rules definition need) - Limiting the cold-start problem. Motivation • DSS • Data-driven • Limited cold-start IEEE_IS2018 25/09/2018
  3. 3. Slide 3, 25-Sep-18 - Unstructured/semi-structured documents - CV/resumé - job offer - education description(high school, professional instruction, Bachelor, Master, executive ed.,…) - Other general purposes docs (e.g: websites) - Mixing with on-the-job training: - No formal learning objective, no uniform description - Consideration of competences due to job experiences Issues • Unstructured data • Different origin/standard • Informal and semiformal IEEE_IS2018 25/09/2018
  4. 4. Slide 4, 25-Sep-18 - External crowd-based available corpus: Wikipedia - Good quality - Concepts = existing page titles - Vocabulary = page content (stems) - Metric = normalized TF-IDF - As suggested by ESA, but transposed - Domain specific filtering - Noise reduction by removal of “irrelevant” concepts / vocabulary Our Approach • Wikipedia ad data-source (ESA) • nTF-IDF • Domain specific (noise limiting) IEEE_IS2018 25/09/2018
  5. 5. Slide 5, 25-Sep-18 Semantic matrix building process • Enriching ( NO Disambiguation, Virtual pages for Redirect) • filtering Data characterization: IEEE_IS2018 25/09/2018 DEWiki: ~2.5M CVs: ~27K JOB offers: ~30K Education descr: ~1,1K Valid “concepts”: ~40K Valid ”stems”: ~66K
  6. 6. Slide 6, 25-Sep-18 Reference Model building • Additional distribution data • Dynamic filtering IEEE_IS2018 25/09/2018
  7. 7. Slide 7, 25-Sep-18 - develop a metric to compare documents based on common set of attributes - compare two given documents: - identify similarities - extract common “concepts” - compare a given document against a set: - assign relevant CVs to a job post - Match educational experiences to CV on common skill-set - find similar CVs to a given one Requirements • Set of requirements IEEE_IS2018 25/09/2018
  8. 8. Slide 8, 25-Sep-18 - Ranked matching between 17CVs and 44 educational experiences - Golden standard: manual annotation by business partner (ordered top-3 educations for each CV) - Weighted as from the table  Expected value for pure random assignment: E[Q] ~ 0.32 - Obtained result  Q = 6.62 and sd[Q]= 1.68 - Additional analysis, for 5 representative cases: Non-randomness verification • Wikipedia ad data-source (ESA) • nTF-IDF • Domain specific (noise limiting) Rank #1 #2 #3 Top-1 2 - - Top-2 1/2 3/2 - Top-3 1/3 3/3 5/3 Top-5 1/5 3/5 5/5 Top-10 1/10 3/10 5/10 IEEE_IS2018 25/09/2018
  9. 9. Slide 9, 25-Sep-18 - We identified a set of 10 heterogenous documents in German: - Doc1 Automobile Meckatroniker EZF (educ exp) - Doc2 Software Entwichkler (JOB offer) - Doc3 B.Sc. Medizin-Informatiker/in BFH (educ exp) - Doc4 AutoMeckatroniker (JOB offer) - Doc5 Webpage of «Data Intelligence» team at HSLU (website) - Doc6 Dipl. Pflegefachperson HF/FH(Privatabteilung) (JOB offer) - Doc7 Luzerner Kantonspital website - general page (website) - Doc8 Zuger Kantonspital website – «about us» (website) - Doc9 Visa hat technische Probleme in ganz Europa (news, 01Jun) - Doc10 Bayer übernimmt Monsanto für 63 Milliarden (news, 07Jun) - Analysis to discover relationships (similarities) amongst them Experiment • Experiment setup IEEE_IS2018 25/09/2018 noise, from http://www.20min.ch
  10. 10. Slide 10, 25-Sep-18 Results – pairwise similarities IEEE_IS2018 25/09/2018 v v v v v v v v ?
  11. 11. Slide 11, 25-Sep-18 Results – final R measure IEEE_IS2018 25/09/2018
  12. 12. Slide 12, 25-Sep-18 Result – Dendrogram by spectral clustering IEEE_IS2018 25/09/2018
  13. 13. Slide 13, 25-Sep-18 - An ESA-inspired approach for document comparison - Able to work on heterogeneous documents - Language - structure - Domain filtering for better specificity (less noise) - Better results wrt randomness - Human manual evaluation positive - Clustering capabilities - Meaninful - Able to spot and “separate” outliers in a dataset(noise) Achievments • New approach • Good performances • Outliers “detection” IEEE_IS2018 25/09/2018
  14. 14. Slide 14, 25-Sep-18 - Language dependent - Currently in German - No interpretation of absolute distance of documents - Only comparisons are meaningful - No completely meaningful explicit signature of document (such as the one offered by ESA) - Computation complexity for model creation - But, dynamic adjustment partially compensate Limits • Language dependency • Adopted metrics • Explicit semantic interpretation IEEE_IS2018 25/09/2018
  15. 15. Slide 15, 25-Sep-18 - Granular approach usage - Using, if available, the CV semi-structure - Customizable metrics for stem weighting - Different metrics for vectors comparison - Multilanguage version - Using the Wikipedia metadata for “translated” pages - Granular map of the CH educational panorama Next Steps • Improve model (metrics) • Multilanguage support • Towards a Map of CH education IEEE_IS2018 25/09/2018
  16. 16. T direct Research Dr. Luca Mazzola Research Associate +41 41 757 68 90 luca.mazzola@hslu.ch Rotkreuz Questions IEEE_IS2018 25/09/2018

×