Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
web page classification
with naïve bayes classifiers

nabeelah ali
27 november 2013
outline
• what is web page classification
• motivation
• literature review
• project design
• experiments
• evaluation
description &
motivation
what is classification?
web page classification
web page classification can
be seen as a type of
document classification
documents vs web pages
• web pages have structure
• HTML indicates headings, paragraphs,
meta-information

• web pages are...
why?
web directories
why?
improving search results
why?
• user profile mining
• information filtering
• creation of domain-specific search engines
literature
review
bag of words
text is represented as an unordered
list of words
n-gram representation
• document is represented by vector of
features

• concepts expressed by phrases can be
capture (e.g...
using html structure
• assign weight depending on HTML tags, and
make the feature a linear combination of
these

• e.g. he...
visual analysis
• visual representation by web browser is
important

• each web page is visualised as an adjacency
multigr...
URL features
• pages do not need to be fetched or
analysed

• fast!
• derives tokens from the URL and uses
these tokens as...
web page classification
project design
dataset
• 4 universities dataset (cornell, texas,
washington, wisconsin)

• each page must be classified into a

category:...
document classification
single label classification: one and only one
class label is assigned to each instance
hard classi...
details of the dataset
experiment #1
bag of words
use the words, unweighted, as features
istant
ass
CS
Dr
intern
22
0
ission
adm
Professor
room
a...
experiment #2

HTML tag weighting

use words weighted by the HTML tags (e.g.
words in <h1> tags will be weighted more
heav...
experiment #3
n-gram
use phrases instead of single words as features
t ant
assis

arch c
rese
onta

c t in

form

ogram de...
evaluation

k-fold cross validation

From http://chrisjmccormick.wordpress.com/2013/07/31/k-fold-cross-validation-with-mat...
evaluation
confusion matrix

http://en.wikipedia.org/wiki/Confusion_matrix
bibliography
B. Choi and Z. Yao: Web Page Classification, StudFuzz 180, 221–274 (2005)
Qi, Xiaoguang, and Brian D. Davison....
questions?
Prochain SlideShare
Chargement dans…5
×

web page classification

1 267 vues

Publié le

Publié dans : Technologie, Formation
  • Soyez le premier à commenter

web page classification

  1. 1. web page classification with naïve bayes classifiers nabeelah ali 27 november 2013
  2. 2. outline • what is web page classification • motivation • literature review • project design • experiments • evaluation
  3. 3. description & motivation
  4. 4. what is classification?
  5. 5. web page classification web page classification can be seen as a type of document classification
  6. 6. documents vs web pages • web pages have structure • HTML indicates headings, paragraphs, meta-information • web pages are interconnected • they contain hyperlinks to other pages • they have locations (URLs)
  7. 7. why? web directories
  8. 8. why? improving search results
  9. 9. why? • user profile mining • information filtering • creation of domain-specific search engines
  10. 10. literature review
  11. 11. bag of words text is represented as an unordered list of words
  12. 12. n-gram representation • document is represented by vector of features • concepts expressed by phrases can be capture (e.g. “New York” vs “new” and “york”)
  13. 13. using html structure • assign weight depending on HTML tags, and make the feature a linear combination of these • e.g. headings would have a greater weight • four main elements are considered: title, headings, metadata and main text Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and metadata in automated subject classification." Research and Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2005. 368-378.
  14. 14. visual analysis • visual representation by web browser is important • each web page is visualised as an adjacency multigraph, with each section representing a different kind of content Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web page classification." Proceedings of SAWM04 workshop, ECML2004. 2004.
  15. 15. URL features • pages do not need to be fetched or analysed • fast! • derives tokens from the URL and uses these tokens as features Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL features." Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 2005.
  16. 16. web page classification project design
  17. 17. dataset • 4 universities dataset (cornell, texas, washington, wisconsin) • each page must be classified into a category: course, department, faculty, project, staff, student, other http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
  18. 18. document classification single label classification: one and only one class label is assigned to each instance hard classification: an instance can either be or not be in a particular class, with no intermediate state multi-class classification: instances that can be divided into more than two categories
  19. 19. details of the dataset
  20. 20. experiment #1 bag of words use the words, unweighted, as features istant ass CS Dr intern 22 0 ission adm Professor room a rc h rese
  21. 21. experiment #2 HTML tag weighting use words weighted by the HTML tags (e.g. words in <h1> tags will be weighted more heavily than those in <p> tags) sistant as CS Dr intern 22 0 ission ofe adm Pr ssor room arch rese
  22. 22. experiment #3 n-gram use phrases instead of single words as features t ant assis arch c rese onta c t in form ogram description pr course outl ine atio n
  23. 23. evaluation k-fold cross validation From http://chrisjmccormick.wordpress.com/2013/07/31/k-fold-cross-validation-with-matlab-code/
  24. 24. evaluation confusion matrix http://en.wikipedia.org/wiki/Confusion_matrix
  25. 25. bibliography B. Choi and Z. Yao: Web Page Classification, StudFuzz 180, 221–274 (2005) Qi, Xiaoguang, and Brian D. Davison. "Web page classification: Features and algorithms." ACM Computing Surveys (CSUR) 41.2 (2009): 12. Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and metadata in automated subject classification." Research and Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2005. 368378. Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL features." Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 2005. Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web page classification." Proceedings of SAWM04 workshop, ECML2004. 2004.
  26. 26. questions?

×