SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Master’s Thesis
Sociopath: automatic local events extractor
Czech Technical University in Prague
Faculty of Electrical Engineering
Department of Computer Science and Engineering
June, 2017
Supervisor: Ing. Jan Drchal, PhD
Galina Alperovich
Problem: Extract event info automatically for any web page
2
Requirements
- Extract: name, date, location, description of the event
- Automatic extraction regardless of design and web page structure
- High accuracy
3
4
Examples of different design
Motivation
- Information Extraction task in the Web
- Web technologies
- Machine learning: 4 classification tasks (date, name, location, description)
- Popular type of problems in search engines
Interesting and not easy task
5
Classification problem
Web element
id: //a[@id = 'name-id']
tag: <div>
text: “Summer classic concert”
font size: 16px
font weight: 300
block height: 35 px
block width: 79 px
X, Y coords: [155, 230]
# of siblings: 2
….
Class: “Event name”
6
How training data would look like?
7
ID URL class Tag Text Font size Color_1 X ...
id_1 url_1 name div “Summer
festival”
57 240 120 ..
id_2 url_1 location li “Central park” 17 210 130 ..
id_3 url_1 description span “Sumer is a
perfect time..”
36 100 100 ..
id_4 url_2 no_event a “http://...” .. .. .. ..
id_5 url_2 date .. .. .. .. .. ..
Difficulties
- No training data available ⇒ we need to create it
- Specify the list of relevant features
- Web pages are very different and diverse
- Full web page rendering is not fast
- Not much of previous research
8
1. Literature review
2. Training data collecting
3. Data cleaning
4. Exploratory data analysis
5. Modelling and Evaluation
Thesis structure
9Architecture of the application
Implementation of training data collecting
- Schema.org + Microdata semantic HTML markup: Event, Person, Product, Article, etc
- Web Data Commons - huge online archive of the URLs with semantic markup
- MetaCentrum - parallel crawler for the pages to extract features for the Event schema elements
Training dataset where we know exactly where event components are!
10
Data cleaning and feature extraction
Features 300 + 30
Rows 1.6M 170K
11
DOM tree - related Visual
HTML tag
Siblings in a tree
Children in a tree
Depth
Color of the text
Text alignment
Family, size and weight of the font
Padding
Spatial Textual
X and Y coordinates
Visual properties of a block (h, w)
Tf-Idf matrix
Punctuations and Digits
Upper case letters
Length of the text
Some Features
12
Not all features are important
Feature importance
for the ‘name’ (Random Forest)
13
Top-5 for Event name:
1. Font family
2. Tag
3. Block width
4. Font size
5. Number of uppercase
letters
Evaluation
Name Date Location Description
Accuracy 0.86 0.91 0.81 0.87
Precision 0.86 0.90 0.81 0.83
Recall 0.90 0.95 0.91 0.91
F1 - measure 0.86 0.91 0.82 0.86
The highest metrics results for every event component.
Cross-validation with k = 5
Extreme Random Forest in average showed the best result
14
Classification models
Random forest
SVM
Logistic regression
Extreme Random Forest
Tools
Python: sklearn, seaborn
PhantomJS for page rendering
Scrapy, HTML features
MetaCentrum (parallel crawling)
Feature engineering
TF-IDF for words importance
PCA, t-SNE
Feature Importance from
XGboost and Random Forest
15
Conclusion
- Review of modern Web extraction methods
- Parallel automatic collection of the training dataset
- Engineering of DOM-tree, visual, textual and spatial features
- Extensive dataset cleaning
- Insights on dataset
- Several classification models for every event component
- Dataset is now public and all process is published on GitHub
- Proof of concept of automatic training set collection
16
Thank you!
17
Headless PhantomJS is no longer supported, does that affect
possible future work?
PhantomJS is a web testing framework which relies on modern web browsers, so it is
important to have updates in time.
If it is not actively supported, other alternatives would be created for testing (for example
NightMare - another one), because automatic web interface testing is a standard practice
today.
18
Is it possible to render vector format pictures with matplotlib?
Yes :)
19
from matplotlib import pyplot as plt
fig, ax = plt.subplots()
fig.savefig('filename.eps' , format='eps')
Disadvantages of separate classification problems for every event
component?
- I consider every element independently of each other ⇒ loose information
- Mutual positions and other relative feature would probably improve the results
20
Do you plan to further utilize/promote your system?
Probably yes, I want to try to create scalable system for events for different cities. It
would be easy to find them with such framework.
21

Contenu connexe

Tendances

Maoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_shortMaoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_shortMao Ye
 
Yawen_Yu_resume
Yawen_Yu_resumeYawen_Yu_resume
Yawen_Yu_resumeYawen Yu
 
Shiwei Liu-resume - 2017
Shiwei Liu-resume - 2017Shiwei Liu-resume - 2017
Shiwei Liu-resume - 2017Savill Liu
 
2016 05-20-clariah-wp4
2016 05-20-clariah-wp42016 05-20-clariah-wp4
2016 05-20-clariah-wp4CLARIAH
 
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...Ralf Stockmann
 
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioOpen Knowledge Belgium
 
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...Stefan Schmunk
 
Data-mining the Semantic Web
Data-mining the Semantic WebData-mining the Semantic Web
Data-mining the Semantic WebFrank Lynam
 

Tendances (13)

Maoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_shortMaoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_short
 
Yawen_Yu_resume
Yawen_Yu_resumeYawen_Yu_resume
Yawen_Yu_resume
 
Shiwei Liu-resume - 2017
Shiwei Liu-resume - 2017Shiwei Liu-resume - 2017
Shiwei Liu-resume - 2017
 
2016 05-20-clariah-wp4
2016 05-20-clariah-wp42016 05-20-clariah-wp4
2016 05-20-clariah-wp4
 
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
 
Graph Database
Graph DatabaseGraph Database
Graph Database
 
rachelzhang
rachelzhangrachelzhang
rachelzhang
 
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
 
Graph database
Graph database Graph database
Graph database
 
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
 
Top 5-nosql
Top 5-nosqlTop 5-nosql
Top 5-nosql
 
Data-mining the Semantic Web
Data-mining the Semantic WebData-mining the Semantic Web
Data-mining the Semantic Web
 
SWUI Position Paper
SWUI Position PaperSWUI Position Paper
SWUI Position Paper
 

Similaire à Sociopath presentation

Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingJan Wiegelmann
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisTeradata Aster
 
Zotonic tutorial EUC 2013
Zotonic tutorial EUC 2013Zotonic tutorial EUC 2013
Zotonic tutorial EUC 2013Arjan
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLPaco Nathan
 
MyResume_Updated
MyResume_UpdatedMyResume_Updated
MyResume_UpdatedShiva Ram
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
 
Intro to-html-backbone
Intro to-html-backboneIntro to-html-backbone
Intro to-html-backbonezonathen
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7Paul Lo
 
Data ops: Machine Learning in production
Data ops: Machine Learning in productionData ops: Machine Learning in production
Data ops: Machine Learning in productionStepan Pushkarev
 
Architecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web ApplicationsArchitecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web Applicationsbpanulla
 
Akshat misra resume
Akshat misra resumeAkshat misra resume
Akshat misra resumeAkshat Misra
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Azure Machine Learning 101
Azure Machine Learning 101Azure Machine Learning 101
Azure Machine Learning 101Renato Jovic
 
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
More Data Science with Less Engineering: Machine Learning Infrastructure at N...More Data Science with Less Engineering: Machine Learning Infrastructure at N...
More Data Science with Less Engineering: Machine Learning Infrastructure at N...Ville Tuulos
 
Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...Volha Bryl
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
 

Similaire à Sociopath presentation (20)

Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
 
Zotonic tutorial EUC 2013
Zotonic tutorial EUC 2013Zotonic tutorial EUC 2013
Zotonic tutorial EUC 2013
 
NLP and the Web
NLP and the WebNLP and the Web
NLP and the Web
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 
MyResume_Updated
MyResume_UpdatedMyResume_Updated
MyResume_Updated
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Intro to-html-backbone
Intro to-html-backboneIntro to-html-backbone
Intro to-html-backbone
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
 
Data ops: Machine Learning in production
Data ops: Machine Learning in productionData ops: Machine Learning in production
Data ops: Machine Learning in production
 
Architecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web ApplicationsArchitecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web Applications
 
Akshat misra resume
Akshat misra resumeAkshat misra resume
Akshat misra resume
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Azure Machine Learning 101
Azure Machine Learning 101Azure Machine Learning 101
Azure Machine Learning 101
 
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
More Data Science with Less Engineering: Machine Learning Infrastructure at N...More Data Science with Less Engineering: Machine Learning Infrastructure at N...
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
 
Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 

Dernier

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 

Dernier (20)

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 

Sociopath presentation

  • 1. Master’s Thesis Sociopath: automatic local events extractor Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science and Engineering June, 2017 Supervisor: Ing. Jan Drchal, PhD Galina Alperovich
  • 2. Problem: Extract event info automatically for any web page 2
  • 3. Requirements - Extract: name, date, location, description of the event - Automatic extraction regardless of design and web page structure - High accuracy 3
  • 5. Motivation - Information Extraction task in the Web - Web technologies - Machine learning: 4 classification tasks (date, name, location, description) - Popular type of problems in search engines Interesting and not easy task 5
  • 6. Classification problem Web element id: //a[@id = 'name-id'] tag: <div> text: “Summer classic concert” font size: 16px font weight: 300 block height: 35 px block width: 79 px X, Y coords: [155, 230] # of siblings: 2 …. Class: “Event name” 6
  • 7. How training data would look like? 7 ID URL class Tag Text Font size Color_1 X ... id_1 url_1 name div “Summer festival” 57 240 120 .. id_2 url_1 location li “Central park” 17 210 130 .. id_3 url_1 description span “Sumer is a perfect time..” 36 100 100 .. id_4 url_2 no_event a “http://...” .. .. .. .. id_5 url_2 date .. .. .. .. .. ..
  • 8. Difficulties - No training data available ⇒ we need to create it - Specify the list of relevant features - Web pages are very different and diverse - Full web page rendering is not fast - Not much of previous research 8
  • 9. 1. Literature review 2. Training data collecting 3. Data cleaning 4. Exploratory data analysis 5. Modelling and Evaluation Thesis structure 9Architecture of the application
  • 10. Implementation of training data collecting - Schema.org + Microdata semantic HTML markup: Event, Person, Product, Article, etc - Web Data Commons - huge online archive of the URLs with semantic markup - MetaCentrum - parallel crawler for the pages to extract features for the Event schema elements Training dataset where we know exactly where event components are! 10
  • 11. Data cleaning and feature extraction Features 300 + 30 Rows 1.6M 170K 11
  • 12. DOM tree - related Visual HTML tag Siblings in a tree Children in a tree Depth Color of the text Text alignment Family, size and weight of the font Padding Spatial Textual X and Y coordinates Visual properties of a block (h, w) Tf-Idf matrix Punctuations and Digits Upper case letters Length of the text Some Features 12
  • 13. Not all features are important Feature importance for the ‘name’ (Random Forest) 13 Top-5 for Event name: 1. Font family 2. Tag 3. Block width 4. Font size 5. Number of uppercase letters
  • 14. Evaluation Name Date Location Description Accuracy 0.86 0.91 0.81 0.87 Precision 0.86 0.90 0.81 0.83 Recall 0.90 0.95 0.91 0.91 F1 - measure 0.86 0.91 0.82 0.86 The highest metrics results for every event component. Cross-validation with k = 5 Extreme Random Forest in average showed the best result 14 Classification models Random forest SVM Logistic regression Extreme Random Forest
  • 15. Tools Python: sklearn, seaborn PhantomJS for page rendering Scrapy, HTML features MetaCentrum (parallel crawling) Feature engineering TF-IDF for words importance PCA, t-SNE Feature Importance from XGboost and Random Forest 15
  • 16. Conclusion - Review of modern Web extraction methods - Parallel automatic collection of the training dataset - Engineering of DOM-tree, visual, textual and spatial features - Extensive dataset cleaning - Insights on dataset - Several classification models for every event component - Dataset is now public and all process is published on GitHub - Proof of concept of automatic training set collection 16
  • 18. Headless PhantomJS is no longer supported, does that affect possible future work? PhantomJS is a web testing framework which relies on modern web browsers, so it is important to have updates in time. If it is not actively supported, other alternatives would be created for testing (for example NightMare - another one), because automatic web interface testing is a standard practice today. 18
  • 19. Is it possible to render vector format pictures with matplotlib? Yes :) 19 from matplotlib import pyplot as plt fig, ax = plt.subplots() fig.savefig('filename.eps' , format='eps')
  • 20. Disadvantages of separate classification problems for every event component? - I consider every element independently of each other ⇒ loose information - Mutual positions and other relative feature would probably improve the results 20
  • 21. Do you plan to further utilize/promote your system? Probably yes, I want to try to create scalable system for events for different cities. It would be easy to find them with such framework. 21