SlideShare une entreprise Scribd logo
1  sur  4
Télécharger pour lire hors ligne
Pradnya Randive et al Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.1701-1704

RESEARCH ARTICLE

www.ijera.com

OPEN ACCESS

Improving Text Clustering Quality by Concept Mining
Pradnya Randive*, Nitin Pise**
*(Department of Computer Engineering, Pune University, India)
** (Department of Computer Engineering, Pune University, India)

ABSTRACT
In text mining most techniques depends on statistical analysis of terms. Statistical analysis trances important
terms within document only. However this concept based mining model analyses terms in sentence, document
and corpus level. This mining model consist of sentence based concept analysis, document based and corpus
based concept analysis and concept based similarity measure. Experimental result enhances text clustering
quality by using sentence, document, corpus and combined approach of concept analysis.
Keywords - Text Mining, Concept Based Mining, Text Clustering.

I.

INTRODUCTION

Text mining mostly used to find out unknown
information from natural language processing and data
mining by applying various techniques.
In this technique for discovering the
importance of term in document, term frequency of
term is calculated. Sometime we can notice that two
terms having same frequency in document but one
term leads more meaning than other, for this concept
based mining model is intended. In proposed model
three measures are evaluated for analyzing concept in
sentence, document and corpus levels. Semantic role
labeler is mostly associated with semantic terms. The
term which has more semantic role in sentence, it’s
known as Concept. And that Concept may be either
word or phrase depending on sentence of semantic
structure. When we put new document in system, the
proposed model discover concept match by scanning
all new documents and take out matching concept.
Similarity measure used for concept analysis on
sentence, document and corpus level that exceeds
similarity measures depending on the term analysis
model of document. The results are measured by using
F- measure and Entropy. This model we are going to
used for web documents.

II.

PREVIOUS WORK

Text mining based on statistical analysis
terms. The term frequency of statistical analysis
discovers the important terms within document only.
This model efficiently search substantial concept
within document based on new concept based
similarity measure. This allows concept matching and
similarity calculations in document by a robust and
accurate
way. Improves text clustering quality
surpasses traditional single term approach. And this
work extends to improve accuracy of similarity
measure of document. And another way of extending
this work to web document clustering [2]. Data mining
contains methods for handling huge databases with
resources such as memory and computation time.
www.ijera.com

bEMADS and gEMADS these two algorithms are
used based on Gaussian mixture model. Both resumes
data into subcluster and after that generate Gaussian
mixture. These two algorithms run several orders of
magnitude faster than maximum with little loss of
accuracy [3]. This paper shows solutions by weights
on the instance points, making clustering as hardest.
Here we can find out the best knowledge about
formalization. Weight modifications depend on local
variation. To get best result resembling algorithms and
weighted version of clustering algorithms like kmeans,
fuzzy
c-means,
Expectation
Maximization(EM) and k-harmonic [4].
Natural language processing play role like
information extraction, question answering and
summarization. Support vector machine improves
performance over earlier classifier. Here task is
reformulated as combined chunking and classification
problem for which statistical syntactic parser may not
be available. For the task of classification, head word
and predicate were the most salient features, but may
be difficult to estimate because of data sparsity. The
portability of the system to new genres of data, we
found a significant drop in performance for a text
source different than the one the system was trained
on [5].

III.

SYSTEM ARCHITECTURE

The techniques prescribed in the previous
work are used for document clustering. But it is only
for documents present on system. In the proposed
system we are going to use web documents and we
will get the clustered output and that have shown in
fig.1. This concept-based mining model consists of
sentence-based concept analysis, document-based
concept analysis, corpus-based concept-analysis, and
concept-based similarity measure, as depicted in
Figure 1. A web document is given as the input to the
proposed model. Each document has well-defined
sentence boundaries. Each sentence in the document
is labeled automatically based on the Prop Bank
1701 | P a g e
Pradnya Randive et al Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.1701-1704
notations. After running the semantic role labeler,
labeled verb argument structures, the output of the role
labeling task, are captured and analyzed by the
concept-based mining model on sentence, document,
and corpus level. In the concept-based mining model,
a labeled terms either word or phrase is considered as
concept.

www.ijera.com

b)
Calculating ctf in document d
A concept c can have many ctf values in
different sentences in the same document d. Thus, the
ctf value of concept c in document d is calculated by:

Where sn: total number
containing concepts in document d

of

sentences

2) Document Based Concept Analysis
For analyzing concepts at document level term
frequency tf in original document is calculated. The tf
is a local measure on the document level.

Fig 1. Concept Based Mining in Text clustering
The proposed model contains the following
modules:
A. Web Document
Web document is given as Input to the given system.
Here user can give any query to the browser. Pure
HTML pages are selected by removing extra scripting.
Web pages contain data such as hyperlinks, images,
script. So it is necessary to remove such unwanted
script if any, during the time when a page is selected
for processing. The HTML code is then transferred
into XML code. On that document next process is
processed that is Text pre-processing or data
processing.
B. Data Processing
First step is separate sentences from the
documents. After this label the terms with the help of
Prop Bank Notation. With the help of Porter algorithm
remove the stem word and stop words from the terms.
C. Concept Based Analysis
This is important module of the proposed
system. Here we have to calculate the frequencies of
the terms. Conceptual term frequency (ctf), Term
frequency (tf) and Document frequency (df) are
calculated
1) Sentence Based Concept Analysis
For analyzing every concept at sentence
level, concept based frequency measure; called
conceptual term frequency is used.
a)

Calculating ctf in sentence s
Ctf is the number of occurrences of concept c
in verb structure of sentence s. If concept c frequently
appears in structure of sentence s then it has principal
role of s.

www.ijera.com

3) Corpus Based Concept Analysis
To calculate concepts from documents, document
frequency df is used. Document Frequency df is the
global measure. With the help of Concept based
Analysis Algorithm we can calculate ctf, tf, df.
D. Similarity Approach
This module mainly contains three parts. Concept
based similarity, Singular Value Decomposition and
combined based similarity it contains. Here we get
that how many percentage of concept math with the
given web document.
E. Concept Based Similarity
A concept-based similarity measure depends
on matching concept at sentence, document, and
corpus instead of individual terms. This similarity
measure based on three main aspects. First is analyzed
label terms that capture semantic structure of each
sentence. Second is concept frequency that is used to
measure participation of concept in sentence as well as
document. Last is the concepts measured from number
of documents.
Concept based similarity between two
document is calculated by :

Term frequency is calculated by following formula:

F. Clustering Techniques
This module used three main basic
techniques
like
Single
pass,
Hierarchical
Agglomerative Clustering, and K-Nearest Neighbor.
With the help of these techniques we can get that
which cluster is having highest priority.
G. Output Cluster
Last module is the output Cluster. After
applying the clustering techniques we get clustered
document. That will help to find out main concepts
from the web document
1702 | P a g e
Pradnya Randive et al Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.1701-1704
IV.

www.ijera.com

SYSTEM IMPLEMENTATION

Proposed system model illustrates flow of
implementation as shown in Fig.1. First, web
document given input to the system where, HTML
pages are collected and their XML conversion is
carried out. In the second module that is in Text
Processing carried out separate sentences, label terms,
and removing stop words and stem words. Third
module Concept based analysis measures conceptual
term frequency (ctf), term frequency (tf), and
document frequency (df). Next module concept based
document similarity find out how many percentage of
concept is similar to the given concept. After this
applying the clustering techniques like single pass,
HAC algorithm and KNN algorithm on calculated
frequencies. So finally we get clustered output with
matching concepts.

Fig 3: Graph for Entropy

VI.
V.

RESULT ANALYSIS

To test the effectiveness of concept matching
in determining an accurate measure of the similarity
between documents, extensive sets of experiments
using the concept-based term analysis and similarity
measure are conducted. In the data sets, the text
directly is analyzed, rather than, using metadata
associated with the text documents. Following are
some graphs for the implementation of this system:
Precision (i,j)= Mij / Mj
Recall (i,j)= Mij / Mi
F-Score = 2PR / P+R
Where Mij is the numbers of members of
class i in cluster j, Mj is the number of members of
cluster j, and Mi is the number of members of class i.

CONCLUSION

Concept based mining model is composed of four
components and that improves text clustering quality.
The first component is the sentence-based concept
analysis which analyzes the semantic structure of each
sentence to capture the sentence concepts using the
proposed conceptual term frequency ctf measure.
Then, the second component, document-based concept
analysis, analyzes each concept at the document level
using the concept-based term frequency tf. The third
component analyzes concepts on the corpus level
using the document frequency df global measure. The
fourth component is the concept-based similarity
measure which allows measuring the importance of
each concept with respect to the semantics of the
sentence, the topic of the document, and the
discrimination among documents in a corpus. We can
apply the same model for the Text Classification.

REFERENCES
[1]

[2]

[3]
Fig 2: Graph for F-Measure

Entropy is calculated by following formula:
[4]

[5]

www.ijera.com

Shehata, Fakhri Karray, and Mohamed S.
Kamel, “An Efficient Concept-Based Mining
Model for Enhancing Text Clustering”,
IEEE Transaction on Knowledge and Data
Engg, VOL. 22, NO. 10, OCTOBER 2010
S. Shehata, F. Karray, and M. Kamel,
“Enhancing Text Clustering Using ConceptBased Mining Model,” Proc. Sixth IEEE Int’l
Conf. Data Mining (ICDM), 2006
M.-L. Wong, and K.S. Leung, “Scalable
Model-Based
Clustering for Large
Databases Based on Data Summarization,”
IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 27, no. 11, pp. 1710-1719,
Nov. 2005.
R. Nock and F. Nielsen, “On Weighting
Clustering,” IEEE Trans. Pattern Analysis
and Machine Intelligence, vol. 28, no. 8, pp.
1223-1235, Aug. 2006
S. Pradhan, K. Hacioglu, V. Krugler, W.
Ward, J.H. Martin, and D. Jurafsky, “Support
Vector Learning for Semantic Argument

1703 | P a g e
Pradnya Randive et al Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.1701-1704

[6]

[7]

[8]

www.ijera.com

Classification,” Machine Learning, vol. 60,
nos. 1-3,pp. 11-39, 2005.
Pradhan, W. Ward, K. Hacioglu, J. Martin,
and D. Jurafsky, “Shallow Semantic Parsing
Using Support Vector Machines,” Proc.
Human Language Technology/North Am.
Assoc. for Computational Linguistics
(HLT/NAACL), 2004
S. Pradhan, K. Hacioglu, W. Ward, J.H.
Martin, and D. Jurafsky, “Semantic Role
Parsing: Adding Semantic Structure to
Unstructured Text,” Proc. Third IEEE Int’l
Conf. Data Mining (ICDM), pp. 629-632,
2003.
P. Mitra, C. Murthy, and S.K. Pal,
“Unsupervised Feature Selection Using
Feature Similarity,” IEEE Trans. Pattern
Analysis and Machine Intelligence, vol. 24,
no. 3, pp. 301-312, Mar. 2002.

www.ijera.com

1704 | P a g e

Contenu connexe

Tendances

DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
 
MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...
MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...
MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...IJCSEA Journal
 
Indexing for Large DNA Database sequences
Indexing for Large DNA Database sequencesIndexing for Large DNA Database sequences
Indexing for Large DNA Database sequencesCSCJournals
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval modelbaradhimarch81
 
An Efficient Search Engine for Searching Desired File
An Efficient Search Engine for Searching Desired FileAn Efficient Search Engine for Searching Desired File
An Efficient Search Engine for Searching Desired FileIDES Editor
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationNinad Samel
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)KU Leuven
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document RankingBhaskar Mitra
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
A Comparative Result Analysis of Text Based Steganographic Approaches
A Comparative Result Analysis of Text Based Steganographic Approaches A Comparative Result Analysis of Text Based Steganographic Approaches
A Comparative Result Analysis of Text Based Steganographic Approaches iosrjce
 
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
Probabilistic Models of Novel Document Rankings for Faceted Topic RetrievalProbabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
Probabilistic Models of Novel Document Rankings for Faceted Topic RetrievalYI-JHEN LIN
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressBhaskar Mitra
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibEl Habib NFAOUI
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain OntologyKeerti Bhogaraju
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Foldersfeiwin
 

Tendances (20)

DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
 
MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...
MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...
MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...
 
Indexing for Large DNA Database sequences
Indexing for Large DNA Database sequencesIndexing for Large DNA Database sequences
Indexing for Large DNA Database sequences
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
 
An Efficient Search Engine for Searching Desired File
An Efficient Search Engine for Searching Desired FileAn Efficient Search Engine for Searching Desired File
An Efficient Search Engine for Searching Desired File
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
A Comparative Result Analysis of Text Based Steganographic Approaches
A Comparative Result Analysis of Text Based Steganographic Approaches A Comparative Result Analysis of Text Based Steganographic Approaches
A Comparative Result Analysis of Text Based Steganographic Approaches
 
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
Probabilistic Models of Novel Document Rankings for Faceted Topic RetrievalProbabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progress
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Folders
 
Cg4201552556
Cg4201552556Cg4201552556
Cg4201552556
 
E43022023
E43022023E43022023
E43022023
 

En vedette (20)

Jx3517171726
Jx3517171726Jx3517171726
Jx3517171726
 
Kg3517761779
Kg3517761779Kg3517761779
Kg3517761779
 
Ks3518401849
Ks3518401849Ks3518401849
Ks3518401849
 
Lg3519251930
Lg3519251930Lg3519251930
Lg3519251930
 
La3518941898
La3518941898La3518941898
La3518941898
 
Kk3517971799
Kk3517971799Kk3517971799
Kk3517971799
 
Kd3517551759
Kd3517551759Kd3517551759
Kd3517551759
 
Kb3517441747
Kb3517441747Kb3517441747
Kb3517441747
 
Kh3517801787
Kh3517801787Kh3517801787
Kh3517801787
 
Jy3517271730
Jy3517271730Jy3517271730
Jy3517271730
 
Jw3517111716
Jw3517111716Jw3517111716
Jw3517111716
 
Kj3517921796
Kj3517921796Kj3517921796
Kj3517921796
 
Jv3517051710
Jv3517051710Jv3517051710
Jv3517051710
 
Kc3517481754
Kc3517481754Kc3517481754
Kc3517481754
 
Kf3517721775
Kf3517721775Kf3517721775
Kf3517721775
 
Ki3517881791
Ki3517881791Ki3517881791
Ki3517881791
 
Jt3516941700
Jt3516941700Jt3516941700
Jt3516941700
 
Ke3517601771
Ke3517601771Ke3517601771
Ke3517601771
 
Kl3518001806
Kl3518001806Kl3518001806
Kl3518001806
 
Ka3517391743
Ka3517391743Ka3517391743
Ka3517391743
 

Similaire à Ju3517011704

G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach IJCSIS Research Publications
 
8 efficient multi-document summary generation using neural network
8 efficient multi-document summary generation using neural network8 efficient multi-document summary generation using neural network
8 efficient multi-document summary generation using neural networkINFOGAIN PUBLICATION
 
A Review on Text Mining in Data Mining
A Review on Text Mining in Data MiningA Review on Text Mining in Data Mining
A Review on Text Mining in Data Miningijsc
 
A Review on Text Mining in Data Mining
A Review on Text Mining in Data Mining  A Review on Text Mining in Data Mining
A Review on Text Mining in Data Mining ijsc
 
Answer extraction and passage retrieval for
Answer extraction and passage retrieval forAnswer extraction and passage retrieval for
Answer extraction and passage retrieval forWaheeb Ahmed
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...IRJET Journal
 
IRJET- Automatic Recapitulation of Text Document
IRJET- Automatic Recapitulation of Text DocumentIRJET- Automatic Recapitulation of Text Document
IRJET- Automatic Recapitulation of Text DocumentIRJET Journal
 
Context Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsContext Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsIJCSIS Research Publications
 
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ijnlc
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESkevig
 
Re-enactment of Newspaper Articles
Re-enactment of Newspaper ArticlesRe-enactment of Newspaper Articles
Re-enactment of Newspaper ArticlesEditor IJCATR
 
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
 

Similaire à Ju3517011704 (20)

G04124041046
G04124041046G04124041046
G04124041046
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach
 
8 efficient multi-document summary generation using neural network
8 efficient multi-document summary generation using neural network8 efficient multi-document summary generation using neural network
8 efficient multi-document summary generation using neural network
 
A Review on Text Mining in Data Mining
A Review on Text Mining in Data MiningA Review on Text Mining in Data Mining
A Review on Text Mining in Data Mining
 
A Review on Text Mining in Data Mining
A Review on Text Mining in Data Mining  A Review on Text Mining in Data Mining
A Review on Text Mining in Data Mining
 
Ijetcas14 446
Ijetcas14 446Ijetcas14 446
Ijetcas14 446
 
Answer extraction and passage retrieval for
Answer extraction and passage retrieval forAnswer extraction and passage retrieval for
Answer extraction and passage retrieval for
 
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text mining
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
 
G1803054653
G1803054653G1803054653
G1803054653
 
IRJET- Automatic Recapitulation of Text Document
IRJET- Automatic Recapitulation of Text DocumentIRJET- Automatic Recapitulation of Text Document
IRJET- Automatic Recapitulation of Text Document
 
Context Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsContext Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word Pairs
 
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
 
Re-enactment of Newspaper Articles
Re-enactment of Newspaper ArticlesRe-enactment of Newspaper Articles
Re-enactment of Newspaper Articles
 
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 

Dernier (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Ju3517011704

  • 1. Pradnya Randive et al Int. Journal of Engineering Research and Application ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.1701-1704 RESEARCH ARTICLE www.ijera.com OPEN ACCESS Improving Text Clustering Quality by Concept Mining Pradnya Randive*, Nitin Pise** *(Department of Computer Engineering, Pune University, India) ** (Department of Computer Engineering, Pune University, India) ABSTRACT In text mining most techniques depends on statistical analysis of terms. Statistical analysis trances important terms within document only. However this concept based mining model analyses terms in sentence, document and corpus level. This mining model consist of sentence based concept analysis, document based and corpus based concept analysis and concept based similarity measure. Experimental result enhances text clustering quality by using sentence, document, corpus and combined approach of concept analysis. Keywords - Text Mining, Concept Based Mining, Text Clustering. I. INTRODUCTION Text mining mostly used to find out unknown information from natural language processing and data mining by applying various techniques. In this technique for discovering the importance of term in document, term frequency of term is calculated. Sometime we can notice that two terms having same frequency in document but one term leads more meaning than other, for this concept based mining model is intended. In proposed model three measures are evaluated for analyzing concept in sentence, document and corpus levels. Semantic role labeler is mostly associated with semantic terms. The term which has more semantic role in sentence, it’s known as Concept. And that Concept may be either word or phrase depending on sentence of semantic structure. When we put new document in system, the proposed model discover concept match by scanning all new documents and take out matching concept. Similarity measure used for concept analysis on sentence, document and corpus level that exceeds similarity measures depending on the term analysis model of document. The results are measured by using F- measure and Entropy. This model we are going to used for web documents. II. PREVIOUS WORK Text mining based on statistical analysis terms. The term frequency of statistical analysis discovers the important terms within document only. This model efficiently search substantial concept within document based on new concept based similarity measure. This allows concept matching and similarity calculations in document by a robust and accurate way. Improves text clustering quality surpasses traditional single term approach. And this work extends to improve accuracy of similarity measure of document. And another way of extending this work to web document clustering [2]. Data mining contains methods for handling huge databases with resources such as memory and computation time. www.ijera.com bEMADS and gEMADS these two algorithms are used based on Gaussian mixture model. Both resumes data into subcluster and after that generate Gaussian mixture. These two algorithms run several orders of magnitude faster than maximum with little loss of accuracy [3]. This paper shows solutions by weights on the instance points, making clustering as hardest. Here we can find out the best knowledge about formalization. Weight modifications depend on local variation. To get best result resembling algorithms and weighted version of clustering algorithms like kmeans, fuzzy c-means, Expectation Maximization(EM) and k-harmonic [4]. Natural language processing play role like information extraction, question answering and summarization. Support vector machine improves performance over earlier classifier. Here task is reformulated as combined chunking and classification problem for which statistical syntactic parser may not be available. For the task of classification, head word and predicate were the most salient features, but may be difficult to estimate because of data sparsity. The portability of the system to new genres of data, we found a significant drop in performance for a text source different than the one the system was trained on [5]. III. SYSTEM ARCHITECTURE The techniques prescribed in the previous work are used for document clustering. But it is only for documents present on system. In the proposed system we are going to use web documents and we will get the clustered output and that have shown in fig.1. This concept-based mining model consists of sentence-based concept analysis, document-based concept analysis, corpus-based concept-analysis, and concept-based similarity measure, as depicted in Figure 1. A web document is given as the input to the proposed model. Each document has well-defined sentence boundaries. Each sentence in the document is labeled automatically based on the Prop Bank 1701 | P a g e
  • 2. Pradnya Randive et al Int. Journal of Engineering Research and Application ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.1701-1704 notations. After running the semantic role labeler, labeled verb argument structures, the output of the role labeling task, are captured and analyzed by the concept-based mining model on sentence, document, and corpus level. In the concept-based mining model, a labeled terms either word or phrase is considered as concept. www.ijera.com b) Calculating ctf in document d A concept c can have many ctf values in different sentences in the same document d. Thus, the ctf value of concept c in document d is calculated by: Where sn: total number containing concepts in document d of sentences 2) Document Based Concept Analysis For analyzing concepts at document level term frequency tf in original document is calculated. The tf is a local measure on the document level. Fig 1. Concept Based Mining in Text clustering The proposed model contains the following modules: A. Web Document Web document is given as Input to the given system. Here user can give any query to the browser. Pure HTML pages are selected by removing extra scripting. Web pages contain data such as hyperlinks, images, script. So it is necessary to remove such unwanted script if any, during the time when a page is selected for processing. The HTML code is then transferred into XML code. On that document next process is processed that is Text pre-processing or data processing. B. Data Processing First step is separate sentences from the documents. After this label the terms with the help of Prop Bank Notation. With the help of Porter algorithm remove the stem word and stop words from the terms. C. Concept Based Analysis This is important module of the proposed system. Here we have to calculate the frequencies of the terms. Conceptual term frequency (ctf), Term frequency (tf) and Document frequency (df) are calculated 1) Sentence Based Concept Analysis For analyzing every concept at sentence level, concept based frequency measure; called conceptual term frequency is used. a) Calculating ctf in sentence s Ctf is the number of occurrences of concept c in verb structure of sentence s. If concept c frequently appears in structure of sentence s then it has principal role of s. www.ijera.com 3) Corpus Based Concept Analysis To calculate concepts from documents, document frequency df is used. Document Frequency df is the global measure. With the help of Concept based Analysis Algorithm we can calculate ctf, tf, df. D. Similarity Approach This module mainly contains three parts. Concept based similarity, Singular Value Decomposition and combined based similarity it contains. Here we get that how many percentage of concept math with the given web document. E. Concept Based Similarity A concept-based similarity measure depends on matching concept at sentence, document, and corpus instead of individual terms. This similarity measure based on three main aspects. First is analyzed label terms that capture semantic structure of each sentence. Second is concept frequency that is used to measure participation of concept in sentence as well as document. Last is the concepts measured from number of documents. Concept based similarity between two document is calculated by : Term frequency is calculated by following formula: F. Clustering Techniques This module used three main basic techniques like Single pass, Hierarchical Agglomerative Clustering, and K-Nearest Neighbor. With the help of these techniques we can get that which cluster is having highest priority. G. Output Cluster Last module is the output Cluster. After applying the clustering techniques we get clustered document. That will help to find out main concepts from the web document 1702 | P a g e
  • 3. Pradnya Randive et al Int. Journal of Engineering Research and Application ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.1701-1704 IV. www.ijera.com SYSTEM IMPLEMENTATION Proposed system model illustrates flow of implementation as shown in Fig.1. First, web document given input to the system where, HTML pages are collected and their XML conversion is carried out. In the second module that is in Text Processing carried out separate sentences, label terms, and removing stop words and stem words. Third module Concept based analysis measures conceptual term frequency (ctf), term frequency (tf), and document frequency (df). Next module concept based document similarity find out how many percentage of concept is similar to the given concept. After this applying the clustering techniques like single pass, HAC algorithm and KNN algorithm on calculated frequencies. So finally we get clustered output with matching concepts. Fig 3: Graph for Entropy VI. V. RESULT ANALYSIS To test the effectiveness of concept matching in determining an accurate measure of the similarity between documents, extensive sets of experiments using the concept-based term analysis and similarity measure are conducted. In the data sets, the text directly is analyzed, rather than, using metadata associated with the text documents. Following are some graphs for the implementation of this system: Precision (i,j)= Mij / Mj Recall (i,j)= Mij / Mi F-Score = 2PR / P+R Where Mij is the numbers of members of class i in cluster j, Mj is the number of members of cluster j, and Mi is the number of members of class i. CONCLUSION Concept based mining model is composed of four components and that improves text clustering quality. The first component is the sentence-based concept analysis which analyzes the semantic structure of each sentence to capture the sentence concepts using the proposed conceptual term frequency ctf measure. Then, the second component, document-based concept analysis, analyzes each concept at the document level using the concept-based term frequency tf. The third component analyzes concepts on the corpus level using the document frequency df global measure. The fourth component is the concept-based similarity measure which allows measuring the importance of each concept with respect to the semantics of the sentence, the topic of the document, and the discrimination among documents in a corpus. We can apply the same model for the Text Classification. REFERENCES [1] [2] [3] Fig 2: Graph for F-Measure Entropy is calculated by following formula: [4] [5] www.ijera.com Shehata, Fakhri Karray, and Mohamed S. Kamel, “An Efficient Concept-Based Mining Model for Enhancing Text Clustering”, IEEE Transaction on Knowledge and Data Engg, VOL. 22, NO. 10, OCTOBER 2010 S. Shehata, F. Karray, and M. Kamel, “Enhancing Text Clustering Using ConceptBased Mining Model,” Proc. Sixth IEEE Int’l Conf. Data Mining (ICDM), 2006 M.-L. Wong, and K.S. Leung, “Scalable Model-Based Clustering for Large Databases Based on Data Summarization,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 11, pp. 1710-1719, Nov. 2005. R. Nock and F. Nielsen, “On Weighting Clustering,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1223-1235, Aug. 2006 S. Pradhan, K. Hacioglu, V. Krugler, W. Ward, J.H. Martin, and D. Jurafsky, “Support Vector Learning for Semantic Argument 1703 | P a g e
  • 4. Pradnya Randive et al Int. Journal of Engineering Research and Application ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.1701-1704 [6] [7] [8] www.ijera.com Classification,” Machine Learning, vol. 60, nos. 1-3,pp. 11-39, 2005. Pradhan, W. Ward, K. Hacioglu, J. Martin, and D. Jurafsky, “Shallow Semantic Parsing Using Support Vector Machines,” Proc. Human Language Technology/North Am. Assoc. for Computational Linguistics (HLT/NAACL), 2004 S. Pradhan, K. Hacioglu, W. Ward, J.H. Martin, and D. Jurafsky, “Semantic Role Parsing: Adding Semantic Structure to Unstructured Text,” Proc. Third IEEE Int’l Conf. Data Mining (ICDM), pp. 629-632, 2003. P. Mitra, C. Murthy, and S.K. Pal, “Unsupervised Feature Selection Using Feature Similarity,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp. 301-312, Mar. 2002. www.ijera.com 1704 | P a g e