SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
Category & Training Texts Selection for
Scientific Article Categorization in
an Expert Search System
By
Gan Keng Hoon*, Chua San Thai,
Khoh Zhuo Yan, Goh Kau Yang
School of Computer Sciences,
Universiti Sains Malaysia
Motivation
Scientific articles are produced as results of research.
Organizing scientific articles into subject areas or topics
help in discovery, navigation etc.
Motivation
Microsoft Academic
Motivation
Google Scholar
Motivation
Takahiro Komamizu Toshiyuki Amagasa Hiroyuki Kitagawa ,
(2015),"Facet-value extraction scheme from textual contents in XML
data“.
Scope
Application oriented research
Expert Search System
DBLP Dataset
School of Computer Sciences, USM
Goal
Improving the categorization of scientific articles
For
Capturing expert’s expertise based on their publications.
Enable category filtering during search.
Existing Approaches
Labelled Scientific Article
Supervised Learning method to train and test
Feature Selection
Bags of Words, Ngram, POS, Term Frequency, TFIDF
This research
Train with Labelled Scientific Related Domain Texts
Test with Scientific Article
Research Justification
Avoid the use of large number of labelled training texts
Focusing on differentiating good training texts sources.
Use reasonable small number of training texts to build
subject category model.
Process of category model construction on
scientific article domain.
Feature Selection
Feature Term Generation
N-gram technique is used to generate potential term candidates from the training text. E.g.
D = “Search engine is an artificial intelligence system.”
2-gram word: Array ([0] => Search engine [1] => engine is [2] is an [3] => an artificial [4] =>
artificial intelligence [5] => intelligence system)
Features Selection by TF-IDF
Term Frequency Inverse Document Frequency (TF-IDF) is a common method for keyword
weighting, which is to compute the TFIDF values and the top N TFIDF values are selected as
features. This method penalizes the term when it occurs in different training texts. The TF-
IDF values are computed as
𝑇𝑇𝑇𝑇 − 𝐼𝐼 𝐼𝐼 𝐼𝐼𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
= 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
× 𝑙𝑙𝑙𝑙 𝑙𝑙
𝑁𝑁𝐷𝐷
𝐷𝐷𝐷𝐷𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
where 𝐷𝐷𝐷𝐷𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
is the number of documents containing the term, 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 and 𝑁𝑁𝐷𝐷 is the
total number of document.
Transfer Training Approach
Intuition
If the training texts are representative enough to cover the concept of a
category, hence this training sets can be obtained from any sources that share
similar concepts or semantics.
Criteria
Sharing same or partially similar categories between two texts source.
The categories must bear the same concept or meaning.
The training source must be comprehensive to cover a category’s concept.
The training source must be available but not the testing source.
This approach is particular useful when the resources of unseen texts are not
readily available.
Training and Testing Category Model
The training of category model, CM, can be defined using the 𝐶𝐶𝐶𝐶𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐵𝐵, function. For each category,
𝐶𝐶𝐶𝐶𝐶𝐶, the function takes in a set of documents, 𝐷𝐷𝐶𝐶𝐶𝐶𝐶𝐶, i.e. training texts; and map them to a set of
features, 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶.
𝐶𝐶𝐶𝐶𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐵𝐵: 𝐷𝐷𝐶𝐶𝐶𝐶𝐶𝐶 → 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶
The testing of category model is defined using the 𝐶𝐶𝐶𝐶𝑆𝑆𝑆𝑆𝑆𝑆, function. For each new document, 𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛,
the function will map the document to a set of most relevant categories, 𝐶𝐶𝐶𝐶𝐶𝐶.
𝐶𝐶𝐶𝐶𝑆𝑆𝑆𝑆𝑆𝑆: 𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛 → 𝐶𝐶𝐶𝐶𝐶𝐶
Feature Similarity Scoring
The scoring technique is based on Vector Space Model Cosine Similarity measure. The set of
features set of category model is viewed as a set of vectors in a vector space. Each term will have its
own axis. The similarity of a category and a document, 𝑆𝑆𝑆𝑆𝑆𝑆𝐹𝐹 can be calculated by comparing the
deviation angle between the vectors as follows.
𝑆𝑆𝑆𝑆𝑆𝑆𝐹𝐹 =
𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛
𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛
where 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 is the feature vector of a category and 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛
is the feature vector of a new document.
Evaluation Settings
Performance Metric
Scientific article is correctly assigned to a category or otherwise.
Expert judgement to evaluate.
Training Texts
Title and Abstract are used.
Tasks
Common (30 general cat) vs. Common + Specific Categories (30
general cat + 12 domain specific )
Automated Selection of Training Texts vs. Manual
Evaluation Results
Common categories
+ Automated
training texts (%)
Common and specific
categories + Automated
training texts (%)
Common and specific
categories + Manual
training texts (%)
Expert 1 62.50 68.75 81.25
Expert 2 46.67 46.67 53.33
Expert 3 33.33 33.33 66.67
Expert 4 33.33 41.67 41.67
Expert 5 43.75 37.50 28.13
(Average) (43.92) (45.59) (54.21)
Conclusion
Possibility
To train a category model using training texts from one source and apply
them on a different source.
Challenge
Selection of training texts as they could influence the accuracy of trained
model.
Limitation
Selection of categories, whereby the selected set is too little to cover the
domain’s (e.g. Computer Science) research area.
Thank You
For more of our work, please visit ir.cs.usm.my
Email me at khgan@usm.my

Contenu connexe

Tendances

Survey of natural language processing(midp2)
Survey of natural language processing(midp2)Survey of natural language processing(midp2)
Survey of natural language processing(midp2)Tariqul islam
 
BTech Pattern Recognition Notes
BTech Pattern Recognition NotesBTech Pattern Recognition Notes
BTech Pattern Recognition NotesAshutosh Agrahari
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXmlaij
 
Connections b/w active learning and model extraction
Connections b/w active learning and model extractionConnections b/w active learning and model extraction
Connections b/w active learning and model extractionAnmol Dwivedi
 
Quantification of Portrayal Concepts using tf-idf Weighting
Quantification of Portrayal Concepts using tf-idf WeightingQuantification of Portrayal Concepts using tf-idf Weighting
Quantification of Portrayal Concepts using tf-idf Weightingijistjournal
 
Association rule discovery for student performance prediction using metaheuri...
Association rule discovery for student performance prediction using metaheuri...Association rule discovery for student performance prediction using metaheuri...
Association rule discovery for student performance prediction using metaheuri...csandit
 
Construction of composite index: process & methods
Construction of composite index:  process & methodsConstruction of composite index:  process & methods
Construction of composite index: process & methodsgopichandbalusu
 
Ms 66 marketing research
Ms 66 marketing researchMs 66 marketing research
Ms 66 marketing researchsmumbahelp
 
Information retrieval 6 ir models
Information retrieval 6 ir modelsInformation retrieval 6 ir models
Information retrieval 6 ir modelsVaibhav Khanna
 
Predicting students performance in final examination
Predicting students performance in final examinationPredicting students performance in final examination
Predicting students performance in final examinationRashid Ansari
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & predictionhktripathy
 
Students academic performance using clustering technique
Students academic performance using clustering techniqueStudents academic performance using clustering technique
Students academic performance using clustering techniquesaniacorreya
 
Using Naive Bayesian Classifier for Predicting Performance of a Student
Using Naive Bayesian Classifier for Predicting Performance of a StudentUsing Naive Bayesian Classifier for Predicting Performance of a Student
Using Naive Bayesian Classifier for Predicting Performance of a Studentijtsrd
 
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)Videoconferencias UTPL
 
Information retrieval 8 term weighting
Information retrieval 8 term weightingInformation retrieval 8 term weighting
Information retrieval 8 term weightingVaibhav Khanna
 
Mixed Methods Research Design
Mixed Methods Research DesignMixed Methods Research Design
Mixed Methods Research DesignSYIKIN MARIA
 

Tendances (19)

Survey of natural language processing(midp2)
Survey of natural language processing(midp2)Survey of natural language processing(midp2)
Survey of natural language processing(midp2)
 
N045038690
N045038690N045038690
N045038690
 
Qualitative data analysis
Qualitative data analysisQualitative data analysis
Qualitative data analysis
 
BTech Pattern Recognition Notes
BTech Pattern Recognition NotesBTech Pattern Recognition Notes
BTech Pattern Recognition Notes
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOX
 
Connections b/w active learning and model extraction
Connections b/w active learning and model extractionConnections b/w active learning and model extraction
Connections b/w active learning and model extraction
 
Quantification of Portrayal Concepts using tf-idf Weighting
Quantification of Portrayal Concepts using tf-idf WeightingQuantification of Portrayal Concepts using tf-idf Weighting
Quantification of Portrayal Concepts using tf-idf Weighting
 
Association rule discovery for student performance prediction using metaheuri...
Association rule discovery for student performance prediction using metaheuri...Association rule discovery for student performance prediction using metaheuri...
Association rule discovery for student performance prediction using metaheuri...
 
Construction of composite index: process & methods
Construction of composite index:  process & methodsConstruction of composite index:  process & methods
Construction of composite index: process & methods
 
Ms 66 marketing research
Ms 66 marketing researchMs 66 marketing research
Ms 66 marketing research
 
Information retrieval 6 ir models
Information retrieval 6 ir modelsInformation retrieval 6 ir models
Information retrieval 6 ir models
 
Predicting students performance in final examination
Predicting students performance in final examinationPredicting students performance in final examination
Predicting students performance in final examination
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & prediction
 
Students academic performance using clustering technique
Students academic performance using clustering techniqueStudents academic performance using clustering technique
Students academic performance using clustering technique
 
A Multiple Ontology, Concept based, Context-sensitive Search and Retrieval
A Multiple Ontology, Concept based, Context-sensitive Search and RetrievalA Multiple Ontology, Concept based, Context-sensitive Search and Retrieval
A Multiple Ontology, Concept based, Context-sensitive Search and Retrieval
 
Using Naive Bayesian Classifier for Predicting Performance of a Student
Using Naive Bayesian Classifier for Predicting Performance of a StudentUsing Naive Bayesian Classifier for Predicting Performance of a Student
Using Naive Bayesian Classifier for Predicting Performance of a Student
 
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
 
Information retrieval 8 term weighting
Information retrieval 8 term weightingInformation retrieval 8 term weighting
Information retrieval 8 term weighting
 
Mixed Methods Research Design
Mixed Methods Research DesignMixed Methods Research Design
Mixed Methods Research Design
 

Similaire à Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System

Text Classification using Support Vector Machine
Text Classification using Support Vector MachineText Classification using Support Vector Machine
Text Classification using Support Vector Machineinventionjournals
 
Automated Question Paper Generator And Answer Checker Using Information Retri...
Automated Question Paper Generator And Answer Checker Using Information Retri...Automated Question Paper Generator And Answer Checker Using Information Retri...
Automated Question Paper Generator And Answer Checker Using Information Retri...Sheila Sinclair
 
PPT SLIDES
PPT SLIDESPPT SLIDES
PPT SLIDESbutest
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learningbutest
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learningbutest
 
Multi label classification of
Multi label classification ofMulti label classification of
Multi label classification ofijaia
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learningbutest
 
slides
slidesslides
slidesbutest
 
LEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdfLEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdfIJDKP
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
Lec1-Into
Lec1-IntoLec1-Into
Lec1-Intobutest
 
NLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docxNLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docxKevinSims18
 
JISC LADIE project Learning Design In Education
JISC LADIE project Learning Design In EducationJISC LADIE project Learning Design In Education
JISC LADIE project Learning Design In Educationgrainne
 
02 course design analysis phase
02 course design   analysis phase02 course design   analysis phase
02 course design analysis phaseDr. Chetan Bhatt
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.pptbutest
 
Automatic Essay Scoring A Review On The Feature Analysis Techniques
Automatic Essay Scoring  A Review On The Feature Analysis TechniquesAutomatic Essay Scoring  A Review On The Feature Analysis Techniques
Automatic Essay Scoring A Review On The Feature Analysis TechniquesDereck Downing
 
Data mining chapter04and5-best
Data mining chapter04and5-bestData mining chapter04and5-best
Data mining chapter04and5-bestABDUmomo
 
Semi Automated Text Categorization Using Demonstration Based Term Set
Semi Automated Text Categorization Using Demonstration Based Term SetSemi Automated Text Categorization Using Demonstration Based Term Set
Semi Automated Text Categorization Using Demonstration Based Term SetIJCSEA Journal
 

Similaire à Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System (20)

Text Classification using Support Vector Machine
Text Classification using Support Vector MachineText Classification using Support Vector Machine
Text Classification using Support Vector Machine
 
Automated Question Paper Generator And Answer Checker Using Information Retri...
Automated Question Paper Generator And Answer Checker Using Information Retri...Automated Question Paper Generator And Answer Checker Using Information Retri...
Automated Question Paper Generator And Answer Checker Using Information Retri...
 
PPT SLIDES
PPT SLIDESPPT SLIDES
PPT SLIDES
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
 
syllabus-CBR.pdf
syllabus-CBR.pdfsyllabus-CBR.pdf
syllabus-CBR.pdf
 
Multi label classification of
Multi label classification ofMulti label classification of
Multi label classification of
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
 
slides
slidesslides
slides
 
LEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdfLEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdf
 
G04124041046
G04124041046G04124041046
G04124041046
 
Lec1-Into
Lec1-IntoLec1-Into
Lec1-Into
 
NLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docxNLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docx
 
JISC LADIE project Learning Design In Education
JISC LADIE project Learning Design In EducationJISC LADIE project Learning Design In Education
JISC LADIE project Learning Design In Education
 
02 course design analysis phase
02 course design   analysis phase02 course design   analysis phase
02 course design analysis phase
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.ppt
 
Automatic Essay Scoring A Review On The Feature Analysis Techniques
Automatic Essay Scoring  A Review On The Feature Analysis TechniquesAutomatic Essay Scoring  A Review On The Feature Analysis Techniques
Automatic Essay Scoring A Review On The Feature Analysis Techniques
 
Data mining chapter04and5-best
Data mining chapter04and5-bestData mining chapter04and5-best
Data mining chapter04and5-best
 
Semi Automated Text Categorization Using Demonstration Based Term Set
Semi Automated Text Categorization Using Demonstration Based Term SetSemi Automated Text Categorization Using Demonstration Based Term Set
Semi Automated Text Categorization Using Demonstration Based Term Set
 

Plus de Gan Keng Hoon

A View of Text Analytics from Word, Sentence and Document Levels
A View of Text Analytics from Word, Sentence and Document Levels A View of Text Analytics from Word, Sentence and Document Levels
A View of Text Analytics from Word, Sentence and Document Levels Gan Keng Hoon
 
Keywords Discovery with Simple Text Mining using R
Keywords Discovery with Simple Text Mining using RKeywords Discovery with Simple Text Mining using R
Keywords Discovery with Simple Text Mining using RGan Keng Hoon
 
OSS 2020 Using SOLR as Open-Source Search Platform.pdf
OSS 2020 Using SOLR as Open-Source Search Platform.pdfOSS 2020 Using SOLR as Open-Source Search Platform.pdf
OSS 2020 Using SOLR as Open-Source Search Platform.pdfGan Keng Hoon
 
Procrastination and Phd.pdf
Procrastination and Phd.pdfProcrastination and Phd.pdf
Procrastination and Phd.pdfGan Keng Hoon
 
Guest Lecture for Principles of Data Analytics.pdf
Guest Lecture for Principles of Data Analytics.pdfGuest Lecture for Principles of Data Analytics.pdf
Guest Lecture for Principles of Data Analytics.pdfGan Keng Hoon
 
Knowledge Representation Reasoning and Acquisition.pdf
Knowledge Representation Reasoning and Acquisition.pdfKnowledge Representation Reasoning and Acquisition.pdf
Knowledge Representation Reasoning and Acquisition.pdfGan Keng Hoon
 
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...Gan Keng Hoon
 
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...Gan Keng Hoon
 
Text and Sentiment Analytics for Business Intelligence
Text and Sentiment Analytics for Business IntelligenceText and Sentiment Analytics for Business Intelligence
Text and Sentiment Analytics for Business IntelligenceGan Keng Hoon
 
Semantics in Retrieval
Semantics in Retrieval Semantics in Retrieval
Semantics in Retrieval Gan Keng Hoon
 
Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineGan Keng Hoon
 
Faceted Search for Finding Expertise Bibliographies
Faceted Search for Finding Expertise BibliographiesFaceted Search for Finding Expertise Bibliographies
Faceted Search for Finding Expertise BibliographiesGan Keng Hoon
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challengeGan Keng Hoon
 
ACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise SearchACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise SearchGan Keng Hoon
 
A Brief Introduction to Knowledge Acquisition, Representation and Publishing
A Brief Introduction to Knowledge Acquisition, Representation and PublishingA Brief Introduction to Knowledge Acquisition, Representation and Publishing
A Brief Introduction to Knowledge Acquisition, Representation and PublishingGan Keng Hoon
 
Wi 2015 demo_preview
Wi 2015 demo_previewWi 2015 demo_preview
Wi 2015 demo_previewGan Keng Hoon
 
An overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support SystemAn overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support SystemGan Keng Hoon
 

Plus de Gan Keng Hoon (17)

A View of Text Analytics from Word, Sentence and Document Levels
A View of Text Analytics from Word, Sentence and Document Levels A View of Text Analytics from Word, Sentence and Document Levels
A View of Text Analytics from Word, Sentence and Document Levels
 
Keywords Discovery with Simple Text Mining using R
Keywords Discovery with Simple Text Mining using RKeywords Discovery with Simple Text Mining using R
Keywords Discovery with Simple Text Mining using R
 
OSS 2020 Using SOLR as Open-Source Search Platform.pdf
OSS 2020 Using SOLR as Open-Source Search Platform.pdfOSS 2020 Using SOLR as Open-Source Search Platform.pdf
OSS 2020 Using SOLR as Open-Source Search Platform.pdf
 
Procrastination and Phd.pdf
Procrastination and Phd.pdfProcrastination and Phd.pdf
Procrastination and Phd.pdf
 
Guest Lecture for Principles of Data Analytics.pdf
Guest Lecture for Principles of Data Analytics.pdfGuest Lecture for Principles of Data Analytics.pdf
Guest Lecture for Principles of Data Analytics.pdf
 
Knowledge Representation Reasoning and Acquisition.pdf
Knowledge Representation Reasoning and Acquisition.pdfKnowledge Representation Reasoning and Acquisition.pdf
Knowledge Representation Reasoning and Acquisition.pdf
 
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
 
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
 
Text and Sentiment Analytics for Business Intelligence
Text and Sentiment Analytics for Business IntelligenceText and Sentiment Analytics for Business Intelligence
Text and Sentiment Analytics for Business Intelligence
 
Semantics in Retrieval
Semantics in Retrieval Semantics in Retrieval
Semantics in Retrieval
 
Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search Engine
 
Faceted Search for Finding Expertise Bibliographies
Faceted Search for Finding Expertise BibliographiesFaceted Search for Finding Expertise Bibliographies
Faceted Search for Finding Expertise Bibliographies
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challenge
 
ACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise SearchACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise Search
 
A Brief Introduction to Knowledge Acquisition, Representation and Publishing
A Brief Introduction to Knowledge Acquisition, Representation and PublishingA Brief Introduction to Knowledge Acquisition, Representation and Publishing
A Brief Introduction to Knowledge Acquisition, Representation and Publishing
 
Wi 2015 demo_preview
Wi 2015 demo_previewWi 2015 demo_preview
Wi 2015 demo_preview
 
An overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support SystemAn overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support System
 

Dernier

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Dernier (20)

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 

Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System

  • 1. Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System By Gan Keng Hoon*, Chua San Thai, Khoh Zhuo Yan, Goh Kau Yang School of Computer Sciences, Universiti Sains Malaysia
  • 2. Motivation Scientific articles are produced as results of research. Organizing scientific articles into subject areas or topics help in discovery, navigation etc.
  • 5. Motivation Takahiro Komamizu Toshiyuki Amagasa Hiroyuki Kitagawa , (2015),"Facet-value extraction scheme from textual contents in XML data“.
  • 6. Scope Application oriented research Expert Search System DBLP Dataset School of Computer Sciences, USM Goal Improving the categorization of scientific articles For Capturing expert’s expertise based on their publications. Enable category filtering during search.
  • 7. Existing Approaches Labelled Scientific Article Supervised Learning method to train and test Feature Selection Bags of Words, Ngram, POS, Term Frequency, TFIDF This research Train with Labelled Scientific Related Domain Texts Test with Scientific Article
  • 8. Research Justification Avoid the use of large number of labelled training texts Focusing on differentiating good training texts sources. Use reasonable small number of training texts to build subject category model.
  • 9. Process of category model construction on scientific article domain.
  • 10. Feature Selection Feature Term Generation N-gram technique is used to generate potential term candidates from the training text. E.g. D = “Search engine is an artificial intelligence system.” 2-gram word: Array ([0] => Search engine [1] => engine is [2] is an [3] => an artificial [4] => artificial intelligence [5] => intelligence system) Features Selection by TF-IDF Term Frequency Inverse Document Frequency (TF-IDF) is a common method for keyword weighting, which is to compute the TFIDF values and the top N TFIDF values are selected as features. This method penalizes the term when it occurs in different training texts. The TF- IDF values are computed as 𝑇𝑇𝑇𝑇 − 𝐼𝐼 𝐼𝐼 𝐼𝐼𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 = 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 × 𝑙𝑙𝑙𝑙 𝑙𝑙 𝑁𝑁𝐷𝐷 𝐷𝐷𝐷𝐷𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 where 𝐷𝐷𝐷𝐷𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 is the number of documents containing the term, 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 and 𝑁𝑁𝐷𝐷 is the total number of document.
  • 11. Transfer Training Approach Intuition If the training texts are representative enough to cover the concept of a category, hence this training sets can be obtained from any sources that share similar concepts or semantics. Criteria Sharing same or partially similar categories between two texts source. The categories must bear the same concept or meaning. The training source must be comprehensive to cover a category’s concept. The training source must be available but not the testing source. This approach is particular useful when the resources of unseen texts are not readily available.
  • 12. Training and Testing Category Model The training of category model, CM, can be defined using the 𝐶𝐶𝐶𝐶𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐵𝐵, function. For each category, 𝐶𝐶𝐶𝐶𝐶𝐶, the function takes in a set of documents, 𝐷𝐷𝐶𝐶𝐶𝐶𝐶𝐶, i.e. training texts; and map them to a set of features, 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶. 𝐶𝐶𝐶𝐶𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐵𝐵: 𝐷𝐷𝐶𝐶𝐶𝐶𝐶𝐶 → 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 The testing of category model is defined using the 𝐶𝐶𝐶𝐶𝑆𝑆𝑆𝑆𝑆𝑆, function. For each new document, 𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛, the function will map the document to a set of most relevant categories, 𝐶𝐶𝐶𝐶𝐶𝐶. 𝐶𝐶𝐶𝐶𝑆𝑆𝑆𝑆𝑆𝑆: 𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛 → 𝐶𝐶𝐶𝐶𝐶𝐶 Feature Similarity Scoring The scoring technique is based on Vector Space Model Cosine Similarity measure. The set of features set of category model is viewed as a set of vectors in a vector space. Each term will have its own axis. The similarity of a category and a document, 𝑆𝑆𝑆𝑆𝑆𝑆𝐹𝐹 can be calculated by comparing the deviation angle between the vectors as follows. 𝑆𝑆𝑆𝑆𝑆𝑆𝐹𝐹 = 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛 where 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 is the feature vector of a category and 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛 is the feature vector of a new document.
  • 13. Evaluation Settings Performance Metric Scientific article is correctly assigned to a category or otherwise. Expert judgement to evaluate. Training Texts Title and Abstract are used. Tasks Common (30 general cat) vs. Common + Specific Categories (30 general cat + 12 domain specific ) Automated Selection of Training Texts vs. Manual
  • 14. Evaluation Results Common categories + Automated training texts (%) Common and specific categories + Automated training texts (%) Common and specific categories + Manual training texts (%) Expert 1 62.50 68.75 81.25 Expert 2 46.67 46.67 53.33 Expert 3 33.33 33.33 66.67 Expert 4 33.33 41.67 41.67 Expert 5 43.75 37.50 28.13 (Average) (43.92) (45.59) (54.21)
  • 15. Conclusion Possibility To train a category model using training texts from one source and apply them on a different source. Challenge Selection of training texts as they could influence the accuracy of trained model. Limitation Selection of categories, whereby the selected set is too little to cover the domain’s (e.g. Computer Science) research area.
  • 16. Thank You For more of our work, please visit ir.cs.usm.my Email me at khgan@usm.my