SlideShare une entreprise Scribd logo
1  sur  18
Modified Naive Bayes Model for
Improved Web Page
Classification
Overview
• Abstract
• Objective
• Related Works
• Traditional Naive Bayes Model
• Proposed Modification
• Classification Algorithm
• Experimental Setup
• Experimental Results
• Conclusions
• References
Abstract
• World Wide Web is a large repository of information and it
keeps on growing exponentially. The fact is that most of the
information stored in it is in either unstructured or semi
structured way. So it is not that much easy to get desired piece
of information out of that large collection of unprocessed data.
Several strategies has been employed to mine the data in
World Wide Web for finding interesting information and to
organize them in a meaningful way. Web data mining is a field
of active research interest and computer scientists across the
world is looking into development and improvement of web
mining strategies. In this paper we will present a modified
probabilistic model based on naive bayes theorem, which will
help to classify webpages based on their textual content, in a
better way than traditional naive bayes model.
ObjectiveObjective
• Explore traditional naive bayes textExplore traditional naive bayes text
classification model, and optimize it forclassification model, and optimize it for
improved performance in terms of costimproved performance in terms of cost
and time.and time.
• Use above algorithm for automatedUse above algorithm for automated
classification of web pages based on theirclassification of web pages based on their
textual content.textual content.
Previous WorksPrevious Works
• Following are some of the methods presented
for text classification
– Decision tree method[5]
– Rule based classification method[6]
– SVM(Space Vector Model)[7][8][9]
– Neural network classifiers[10][11][12]
– Bayesian Classifiers[13][14]
– K-nearest neighbour approach[16]
• Most of these text classification methods were
further extended for web page classification by
taking textual content or hierarchy of web page into
consideration[17]
Traditional Naive Bayes ModelTraditional Naive Bayes Model
• Bayesian classifiers are statistical classifiers
which predict the class membership probabilities of
tuples, which means probability of tuples to be in a
particular class.
•Naive Bayes classifiers are Bayesian Classifiers
that assume 'class conditional independence'.
•According to class conditional independece,
'Effect of value an attribute in a class is
independent of values of other attributes'.
Classification Algorithm
INPUT
freq-> word frequency list of test web page.
Database-> 2d hash table containing list of word frequencies in each category
categories->List of available categories for classification
OUTPUT
category->category of given web page
function probability_model(freq_list,database,categories):
v=total_number_words_in_database
pc={} // A hash function for storing probability of each category
for each category in categories:
attributes=database[category]
n=total_number_of_words_in_category
pc[category]=0
for each word in freq_list:
if word not in attributes:
pc[category]=pc[category]+(1.0/(n+v))
else:
pc[category]=pc[category] + ((1+attributes[word])/(n+v))
Category=category for which pc[category] is maximum
return Category
Experimental Setup
• Dataset
– Standard 20 newsgroup data set was used for
testing purpose.
– The 20 newsgroup dataset contains 19997
documents classified into 20 different groups
uniformly. Some of the newsgroups are very
closely related to each other while others are
highly unrelated
• Testing Methods
– Random Sub Set Sampling
– K-Fold cross validation
Experimental Results
Number of training
documents
Accuracy With
Traditional Model
Accuracy with
proposed Model
100 68.63 68.96
200 74.52 75.04
300 77.78 78.66
400 79.10 80.28
500 80.23 81.46
600 81.61 82.62
700 82.25 83.41
800 83.06 83.93
900 82.91 84.36
Random Subset Sampling : Traditional Naive Bayes Model V/S Modified Model
Experimental Results(contd.)
Random Subset Sampling : Traditional Naive Bayes Model V/S Modified Model
Experimental Results(contd.)
Number of folds Accuracy with
traditional Model
Accuracy with
Proposed Model
2 73.26 74.67
3 76.51 77.75
4 77.29 78.74
5 77.66 79.24
6 78.22 79.66
7 78.70 80.05
8 79.07 80.49
9 79.51 80.93
10 79.70 81.12
K-Fold Cross Validation : Traditional Naive Bayes Model V/S Modified Model
Experimental Results(contd.)
K-Fold Cross Validation : Traditional Naive Bayes Model V/S Modified Model
Conclusions
• proposed modified naive bayes classifier is
better than traditional model because of
following reasons.
– Experimental results shows that it provide more
accuracy than traditional model.
– Long multiplication operations an be replaced by less
expensive addition operations
– No need use Laplacian correction, since zero
probability of individual terms will not lead to zero
probability of whole category
– Floating point underflow problem can be avoided
which may arise due to continuous multiplication of
small numbers.
References
1. http://www.sciencedaily.com/releases/2013/05/130522085217.htm
2. http://www.worldwidewebsize.com/
3. http://www.thecultureist.com/2013/05/09/how-many-people-use-the-internet-more-than-2-billion-
infographic/
4. Automated Classification of Web Sites using Naive Bayesian Algorithm. Ajay S. Patil and B.V
Pawar IMECS 2012.
5. A decision-tree-based symbolic rule induction system for text categorization. DE. Johnson, FJ
Oles, T Zhang, T Goets 2002
6. Automated learning of decision rules for text categorization. Chidanand and Fred Damerau ACM
1994
7. A statistical learning model of Text classification for Support Vector Machines. Thorsten
Joachims ACM 2001
8. Text categorization with support vector machines: Learning with many relevant features. Thorsten
Joachims
9. Support vector machines for text categorization. A. Basu, C. Watters, and M. Shepherd IEEE
2002
10. Automated text classification using a dynamic artificial neural network model M. Ghiassi,M.
Olschimke, B.Moon,P. Arnaudo Elsevier 2012
11. Study a text classification method based on neural network model Jian Chen,Hailan Pan, Qinyum
Ao Springer 2012
12. Autmatic text classification using artificial neural network. Springer volume 172 2005
13. Naive Bayes text classificaton - http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-
classification-1.html
14. Naive Bayesian text classification - John Graham-cuming 2005
15. A survey of text classification algorithms- Charu C Aggarwal, ChengXiang Zhai
16. improved K-nearest neighbour algorithm for text categorization. Li Baoli, Yu Shiven, Lu Qin
ICCPOL 2003
References[contd.]
17. Recent research in web page classification-A review . Alamelu Mangai, Santhosh Kumar,
Sugumaran IJCET 2010
18. Data Mining: Practical machine learning tools and techniques, Ian H. Witten and Eibe Frank 2nd
Edition, Morgan Kaufmann, San Francisco, 2005.
19. Fast categorizations of large document collections Shanks, V. and H. E. Williams. SPIRE 2001
20. Simple and accurate feature selection for hierarchical categorization. Wibowo, W. and H.
E.Williams. ACM 2002
21. Computational Approaches to Analyzing Weblogs- Mihalcea, R. and H. Liu.
22. Graph-based text classification: Learn from your neighbors Angelova, R. and G. Weikum SIGIR
2006
23. SM.F. Porter, “An algorithm for suffix stripping”, Program, Vo.14, no. 3, pp. 130-137, Jul. 1980.
24. Python beautiful soup library.http://www.crummy.com/software/BeautifulSoup/
25.http://tartarus.org/martin/PorterStemmer/index.html Porter Stemming algorithm, with various
implementations.
26. http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html Stanford
NLP , naïve bayes textclassification
27. The BOW or libbow C Library [Online] Available: http://www.cs.cmu.edu/~mccallum/bow/
28. Bird, Steven, Edward Loper and Ewan Klein (2009). Natural Language Processing with Python.
O’Reilly Media Inc. Python NLTK
29. www.matplotlib.org Python based open source mathematical analysis toolkit.
30. Data Mining concepts and techniques Han Kamber Lee Morgan Kaufman publications. 3 rd
Edition 2012.
31. DMOZ open directory project. [Online]. Available: http://dmoz.org/
32. Home Page for 20 newsgroup dataset http://qwone.com/~jason/20Newsgroups/

Contenu connexe

En vedette

"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love Bucharest"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love BucharestStefan Adam
 
02. naive bayes classifier revision
02. naive bayes classifier   revision02. naive bayes classifier   revision
02. naive bayes classifier revisionJeonghun Yoon
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes ClassifierYiqun Hu
 
09 hc -_bai_tap_nop_so_4
09 hc -_bai_tap_nop_so_409 hc -_bai_tap_nop_so_4
09 hc -_bai_tap_nop_so_4Vitalify Asia
 
Tutorial 8 - Creating Effective Web Pages
Tutorial 8 - Creating Effective Web PagesTutorial 8 - Creating Effective Web Pages
Tutorial 8 - Creating Effective Web Pagesdpd
 
SBC 2012 - Một số thuật toán phân lớp và ứng dụng trong IDS (Nguyễn Đình Chiểu)
SBC 2012 - Một số thuật toán phân lớp và ứng dụng trong IDS (Nguyễn Đình Chiểu)SBC 2012 - Một số thuật toán phân lớp và ứng dụng trong IDS (Nguyễn Đình Chiểu)
SBC 2012 - Một số thuật toán phân lớp và ứng dụng trong IDS (Nguyễn Đình Chiểu)Security Bootcamp
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Dev Sahu
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayesjakehofman
 
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes ClassifiersDongseo University
 
Single Page Web Applications with CoffeeScript, Backbone and Jasmine
Single Page Web Applications with CoffeeScript, Backbone and JasmineSingle Page Web Applications with CoffeeScript, Backbone and Jasmine
Single Page Web Applications with CoffeeScript, Backbone and JasminePaulo Ragonha
 

En vedette (15)

"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love Bucharest"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love Bucharest
 
02. naive bayes classifier revision
02. naive bayes classifier   revision02. naive bayes classifier   revision
02. naive bayes classifier revision
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes Classifier
 
09 hc -_bai_tap_nop_so_4
09 hc -_bai_tap_nop_so_409 hc -_bai_tap_nop_so_4
09 hc -_bai_tap_nop_so_4
 
Tutorial 8 - Creating Effective Web Pages
Tutorial 8 - Creating Effective Web PagesTutorial 8 - Creating Effective Web Pages
Tutorial 8 - Creating Effective Web Pages
 
SBC 2012 - Một số thuật toán phân lớp và ứng dụng trong IDS (Nguyễn Đình Chiểu)
SBC 2012 - Một số thuật toán phân lớp và ứng dụng trong IDS (Nguyễn Đình Chiểu)SBC 2012 - Một số thuật toán phân lớp và ứng dụng trong IDS (Nguyễn Đình Chiểu)
SBC 2012 - Một số thuật toán phân lớp và ứng dụng trong IDS (Nguyễn Đình Chiểu)
 
Naive Bayes
Naive Bayes Naive Bayes
Naive Bayes
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Lecture10 - Naïve Bayes
Lecture10 - Naïve BayesLecture10 - Naïve Bayes
Lecture10 - Naïve Bayes
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
 
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
 
Bai 4 Phan Lop
Bai 4 Phan LopBai 4 Phan Lop
Bai 4 Phan Lop
 
Single Page Web Applications with CoffeeScript, Backbone and Jasmine
Single Page Web Applications with CoffeeScript, Backbone and JasmineSingle Page Web Applications with CoffeeScript, Backbone and Jasmine
Single Page Web Applications with CoffeeScript, Backbone and Jasmine
 

Similaire à Modified naive bayes model for improved web page classification

Data Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and FutureData Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and Futurefeiwin
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...IEEEFINALYEARSTUDENTPROJECT
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...IEEEMEMTECHSTUDENTSPROJECTS
 
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEEFINALYEARSTUDENTPROJECTS
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
Web Information Extraction for the Database Research Domain
Web Information Extraction for the Database Research DomainWeb Information Extraction for the Database Research Domain
Web Information Extraction for the Database Research DomainMichael Genkin
 
Web Information Extraction for the DB Research Domain
Web Information Extraction for the DB Research DomainWeb Information Extraction for the DB Research Domain
Web Information Extraction for the DB Research Domainliat_kakun
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesmustafa sarac
 
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...Christopher Sneed, MSDS, PMP, CSPO
 
NoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsNoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsFirat Atagun
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
BayesiaLab_Book_V18 (1)
BayesiaLab_Book_V18 (1)BayesiaLab_Book_V18 (1)
BayesiaLab_Book_V18 (1)Bayesia USA
 
Creating Linked Data from Relational Databases
Creating Linked Data from Relational DatabasesCreating Linked Data from Relational Databases
Creating Linked Data from Relational DatabasesNikolaos Konstantinou
 
A schema generation approach for column oriented no sql data stores
A schema generation approach for column oriented no sql data storesA schema generation approach for column oriented no sql data stores
A schema generation approach for column oriented no sql data storesKIRAN V
 
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEBCOST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEBIJDKP
 
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...ssuser4b1f48
 

Similaire à Modified naive bayes model for improved web page classification (20)

A1303060109
A1303060109A1303060109
A1303060109
 
A1303060109
A1303060109A1303060109
A1303060109
 
Data Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and FutureData Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and Future
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
 
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
Web Information Extraction for the Database Research Domain
Web Information Extraction for the Database Research DomainWeb Information Extraction for the Database Research Domain
Web Information Extraction for the Database Research Domain
 
Web Information Extraction for the DB Research Domain
Web Information Extraction for the DB Research DomainWeb Information Extraction for the DB Research Domain
Web Information Extraction for the DB Research Domain
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
 
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
 
NoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsNoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, Implementations
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
BayesiaLab_Book_V18 (1)
BayesiaLab_Book_V18 (1)BayesiaLab_Book_V18 (1)
BayesiaLab_Book_V18 (1)
 
Creating Linked Data from Relational Databases
Creating Linked Data from Relational DatabasesCreating Linked Data from Relational Databases
Creating Linked Data from Relational Databases
 
A schema generation approach for column oriented no sql data stores
A schema generation approach for column oriented no sql data storesA schema generation approach for column oriented no sql data stores
A schema generation approach for column oriented no sql data stores
 
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEBCOST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
 
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
 

Dernier

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Dernier (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Modified naive bayes model for improved web page classification

  • 1. Modified Naive Bayes Model for Improved Web Page Classification
  • 2. Overview • Abstract • Objective • Related Works • Traditional Naive Bayes Model • Proposed Modification • Classification Algorithm • Experimental Setup • Experimental Results • Conclusions • References
  • 3. Abstract • World Wide Web is a large repository of information and it keeps on growing exponentially. The fact is that most of the information stored in it is in either unstructured or semi structured way. So it is not that much easy to get desired piece of information out of that large collection of unprocessed data. Several strategies has been employed to mine the data in World Wide Web for finding interesting information and to organize them in a meaningful way. Web data mining is a field of active research interest and computer scientists across the world is looking into development and improvement of web mining strategies. In this paper we will present a modified probabilistic model based on naive bayes theorem, which will help to classify webpages based on their textual content, in a better way than traditional naive bayes model.
  • 4. ObjectiveObjective • Explore traditional naive bayes textExplore traditional naive bayes text classification model, and optimize it forclassification model, and optimize it for improved performance in terms of costimproved performance in terms of cost and time.and time. • Use above algorithm for automatedUse above algorithm for automated classification of web pages based on theirclassification of web pages based on their textual content.textual content.
  • 5. Previous WorksPrevious Works • Following are some of the methods presented for text classification – Decision tree method[5] – Rule based classification method[6] – SVM(Space Vector Model)[7][8][9] – Neural network classifiers[10][11][12] – Bayesian Classifiers[13][14] – K-nearest neighbour approach[16] • Most of these text classification methods were further extended for web page classification by taking textual content or hierarchy of web page into consideration[17]
  • 6. Traditional Naive Bayes ModelTraditional Naive Bayes Model • Bayesian classifiers are statistical classifiers which predict the class membership probabilities of tuples, which means probability of tuples to be in a particular class. •Naive Bayes classifiers are Bayesian Classifiers that assume 'class conditional independence'. •According to class conditional independece, 'Effect of value an attribute in a class is independent of values of other attributes'.
  • 7.
  • 8.
  • 9.
  • 10. Classification Algorithm INPUT freq-> word frequency list of test web page. Database-> 2d hash table containing list of word frequencies in each category categories->List of available categories for classification OUTPUT category->category of given web page function probability_model(freq_list,database,categories): v=total_number_words_in_database pc={} // A hash function for storing probability of each category for each category in categories: attributes=database[category] n=total_number_of_words_in_category pc[category]=0 for each word in freq_list: if word not in attributes: pc[category]=pc[category]+(1.0/(n+v)) else: pc[category]=pc[category] + ((1+attributes[word])/(n+v)) Category=category for which pc[category] is maximum return Category
  • 11. Experimental Setup • Dataset – Standard 20 newsgroup data set was used for testing purpose. – The 20 newsgroup dataset contains 19997 documents classified into 20 different groups uniformly. Some of the newsgroups are very closely related to each other while others are highly unrelated • Testing Methods – Random Sub Set Sampling – K-Fold cross validation
  • 12. Experimental Results Number of training documents Accuracy With Traditional Model Accuracy with proposed Model 100 68.63 68.96 200 74.52 75.04 300 77.78 78.66 400 79.10 80.28 500 80.23 81.46 600 81.61 82.62 700 82.25 83.41 800 83.06 83.93 900 82.91 84.36 Random Subset Sampling : Traditional Naive Bayes Model V/S Modified Model
  • 13. Experimental Results(contd.) Random Subset Sampling : Traditional Naive Bayes Model V/S Modified Model
  • 14. Experimental Results(contd.) Number of folds Accuracy with traditional Model Accuracy with Proposed Model 2 73.26 74.67 3 76.51 77.75 4 77.29 78.74 5 77.66 79.24 6 78.22 79.66 7 78.70 80.05 8 79.07 80.49 9 79.51 80.93 10 79.70 81.12 K-Fold Cross Validation : Traditional Naive Bayes Model V/S Modified Model
  • 15. Experimental Results(contd.) K-Fold Cross Validation : Traditional Naive Bayes Model V/S Modified Model
  • 16. Conclusions • proposed modified naive bayes classifier is better than traditional model because of following reasons. – Experimental results shows that it provide more accuracy than traditional model. – Long multiplication operations an be replaced by less expensive addition operations – No need use Laplacian correction, since zero probability of individual terms will not lead to zero probability of whole category – Floating point underflow problem can be avoided which may arise due to continuous multiplication of small numbers.
  • 17. References 1. http://www.sciencedaily.com/releases/2013/05/130522085217.htm 2. http://www.worldwidewebsize.com/ 3. http://www.thecultureist.com/2013/05/09/how-many-people-use-the-internet-more-than-2-billion- infographic/ 4. Automated Classification of Web Sites using Naive Bayesian Algorithm. Ajay S. Patil and B.V Pawar IMECS 2012. 5. A decision-tree-based symbolic rule induction system for text categorization. DE. Johnson, FJ Oles, T Zhang, T Goets 2002 6. Automated learning of decision rules for text categorization. Chidanand and Fred Damerau ACM 1994 7. A statistical learning model of Text classification for Support Vector Machines. Thorsten Joachims ACM 2001 8. Text categorization with support vector machines: Learning with many relevant features. Thorsten Joachims 9. Support vector machines for text categorization. A. Basu, C. Watters, and M. Shepherd IEEE 2002 10. Automated text classification using a dynamic artificial neural network model M. Ghiassi,M. Olschimke, B.Moon,P. Arnaudo Elsevier 2012 11. Study a text classification method based on neural network model Jian Chen,Hailan Pan, Qinyum Ao Springer 2012 12. Autmatic text classification using artificial neural network. Springer volume 172 2005 13. Naive Bayes text classificaton - http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text- classification-1.html 14. Naive Bayesian text classification - John Graham-cuming 2005 15. A survey of text classification algorithms- Charu C Aggarwal, ChengXiang Zhai 16. improved K-nearest neighbour algorithm for text categorization. Li Baoli, Yu Shiven, Lu Qin ICCPOL 2003
  • 18. References[contd.] 17. Recent research in web page classification-A review . Alamelu Mangai, Santhosh Kumar, Sugumaran IJCET 2010 18. Data Mining: Practical machine learning tools and techniques, Ian H. Witten and Eibe Frank 2nd Edition, Morgan Kaufmann, San Francisco, 2005. 19. Fast categorizations of large document collections Shanks, V. and H. E. Williams. SPIRE 2001 20. Simple and accurate feature selection for hierarchical categorization. Wibowo, W. and H. E.Williams. ACM 2002 21. Computational Approaches to Analyzing Weblogs- Mihalcea, R. and H. Liu. 22. Graph-based text classification: Learn from your neighbors Angelova, R. and G. Weikum SIGIR 2006 23. SM.F. Porter, “An algorithm for suffix stripping”, Program, Vo.14, no. 3, pp. 130-137, Jul. 1980. 24. Python beautiful soup library.http://www.crummy.com/software/BeautifulSoup/ 25.http://tartarus.org/martin/PorterStemmer/index.html Porter Stemming algorithm, with various implementations. 26. http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html Stanford NLP , naïve bayes textclassification 27. The BOW or libbow C Library [Online] Available: http://www.cs.cmu.edu/~mccallum/bow/ 28. Bird, Steven, Edward Loper and Ewan Klein (2009). Natural Language Processing with Python. O’Reilly Media Inc. Python NLTK 29. www.matplotlib.org Python based open source mathematical analysis toolkit. 30. Data Mining concepts and techniques Han Kamber Lee Morgan Kaufman publications. 3 rd Edition 2012. 31. DMOZ open directory project. [Online]. Available: http://dmoz.org/ 32. Home Page for 20 newsgroup dataset http://qwone.com/~jason/20Newsgroups/