SlideShare une entreprise Scribd logo
1  sur  2
Télécharger pour lire hors ligne
Extractive document summarization - an unsupervised approach

                    Jonatan Bengtsson⇤ , Christoffer Skeppstedt† , Svetoslav Marinov⇤

                             Findwise AB, † Tickster AB, Gothenburg, Sweden
                                      ⇤
   ⇤
       {jonatan.bengtsson,svetoslav.marinov}@findwise.com, † christoffer@tickster.com

                                                            Abstract
   In this paper we present and evaluate a system for automatic extractive document summarization. We employ three different
   unsupervised algorithms for sentence ranking - TextRank, K-means clustering and previously unexplored in this field, one-
   class SVM. By adding language and domain specific boosting we achieve state-of-the-art performance for English measured in
   ROUGE Ngram(1,1) score on the DUC 2002 dataset - 0,4797. In addition, the system can be used for both single and multi
   document summarization. We also present results for Swedish, based on a new corpus - featured Wikipedia articles.


1. Introduction                                                   frequency times inverse document frequency (TF*IDF)
An extractive summarization system tries to identify the          score.
most relevant sentences in an input document (aka single             NER for English is performed with Stanford Named En-
document summarization, SDS) or cluster of similar docu-          tity Recognizer (http://nlp.stanford.edu/software) while for
ments (aka multi-document summarization, MDS) and uses            Swedish we use OpenNLP. The latter library is also used
these to create a summary (Nenkova and McKeown, 2011).            for NP chunking for English. Dependency parsing is per-
This task can be divided into four subtask: document pro-         formed by MaltParser (http://maltparser.org).
cessing, sentence ranking, sentence selection and sentence
ordering. Section 2. describes all necessary document pro-        3. Sentence Ranking
cessing.                                                          A central component in an extractive summarization system
   We have chosen to work entirely with unsupervised ma-          is the sentence ranking algorithm. Its role is to assign a real
chine learning algorithms to achieve maximum domain in-           value rank to each input sentence or order the sentences
dependence. We utilize three algorithms for sentence rank-        according to their relevance.
ing - TextRank (Mihalcea and Tarau, 2004), K-means clus-
tering (Garc´a-Hern´ ndez et al., 2008) and One-class Sup-
             ı       a                                            3.1 TextRank
port Vector Machines (oSVM) (Sch¨ lkopf et al., 2001). In
                                     o                            TextRank (Mihalcea and Tarau, 2004) models a document
three of the summarization subtasks, a crucial component          as a graph, where nodes corresponds to sentences from the
is the calculation of sentence similarities. Both the rank-       document, and edges carry a weight describing the similar-
ing algorithms and the similarity measures are presented in       ity between the nodes. Once the graph has been constructed
more detail in Section 3.                                         the nodes are ranked by an iterative algorithm based on
   Sections 4. and 5. deal with the different customizations      Google’s PageRank (Brin and Page, 1998).
of the system to tackle the tasks of SDS and MDS respec-             The notion of sentence similarity is crucial to TextRank.
tively. Here we also describe the subtasks of sentence se-        Mihalcea and Tarau (2004) use the following similarity
                                                                                                         |S S |
lections and ordering.                                            measure: Similarity(Si , Sj ) = log|Sii j j | , where Si
                                                                                                           |+log|S
   In Section 6. we present the results from the evaluation       is a sentence from the document, |Si | is its length in words
on Swedish and English corpora.                                   and |Si  Sj | is the word overlap between Si and Sj .
                                                                     We have tested several other enhanced approaches, such
2. Document processing                                            as cosine, TF*IDF, POS tags and dependency tree based
The minimal preprocessing required by an extractive doc-          similarity measures.
ument summarization system is sentence splitting. While
this is sufficient to create a baseline, its performance will be   3.2 K-means clustering
suboptimal. Further linguistic processing is central to op-       Garc´a-Hern´ ndez et al. (2008) adapt the well-known K-
                                                                       ı       a
timizing the system. The basic requirements are tokeniza-         means clustering algorithm to the task of document sum-
tion, stemming and part-of-speech (POS) tagging. Given a          marization. We use the same approach and divide sentences
language where we have all the basic resources, our system        into k clusters from which we then select the most salient
will be able to produce a summary of a document.                  ones. We have tested three different ways for sentence rel-
   Sentence splitting, tokenization and POS tagging are           evance ordering - position, centroid and TextRank-based.
done using OpenNLP (http://opennlp.apache.org) for both           The value of k is conditioned on the mean sentence length
English and Swedish. In addition we have explored several         in a document and the desired summary length.
other means of linguistic analysis such as Named Entity              Each sentence is converted to a word vector before the
Recognition (NER), keyword extraction, dependency pars-           clustering begins. The vectors can contain all unique words
ing and noun phrase (NP) chunking. Finally we can aug-            of the document or a subset based on POS tag, document
ment the importance of words by calculating their term-           keywords or named entities.
3.3 One-class SVM                                                                                        English       Swedish
                                                                  Algorithm                          SDS      MDS      SDS
oSVM is a previously unexplored approach when it comes
                                                                  TextRank (TF*IDF, POS)             0.4797 0.2537     0.3593
to unsupervised, extractive summarization. Similarly to the
                                                                  K-means (TF*IDF, POS)              0.4680 0.2400     0.3539
K-means algorithm, sentences are seen as points in a coor-        oSVM (cosine)                      0.4343            0.3399
dinate system, but the task is to find the outline boundary,       2-stageSum (TextRank-based)                 0.2561
i.e. the support vectors that enclose all points. These vec-      (Mihalcea and Tarau, 2004)         0.4708
tors (or sentences) arguably define the document and are           (Garc´a-Hern´ ndez et al., 2008)
                                                                        ı      a                     0.4791
therefore interesting from a summarization point of view.         Baselinelead                       0.4649 0.2317     0.3350
   For the kernel function we choose the sentence similar-        Baselinerand                       0.3998 0.2054     0.3293
ity measures (cf. 3.1). Similarly to choosing k in (3.2), the
number of support vectors is dependent on the mean sen-                       Table 1: Results and Comparison
tence and desired summary lengths.
                                                                 dence. With relatively little language dependent processing
4. Single document summarization
                                                                 the system can be ported to new languages and domains.
For SDS we can use domain specific knowledge in order to          We have evaluated three different algorithms for sentence
boost the sentence rank and thus improve the performance         ranking where oSVM is previously unexplored in this field.
of the system. As an example, in the domain of newspa-           By adding domain knowledge in the form of sentence rank
per articles the sentence position tends to have a significant    boosting with the TextRank algorithm we receive higher
role, with initital sentences containing the gist of the arti-   ROUGE scores than other systems tested on DUC 2002
cle. We use an inverse square function to update the sen-        dataset. In addition, we have tested the system for Swedish
tence ranks: Boost(Si ) = Si .rank ⇤ (1 + p 1 ), where           on a new corpus with promising results.
                                               Si .pos
Si .rank is the prior value and Si .pos is the position of the
sentence in the document. We see such boosting functions         8. References
as important steps for domain customization.                     Danushka Bollegala, Naoaki Okazaki, and Mitsuru
   Once the sentences have been ranked the selection and           Ishizuka. 2006. A bottom-up approach to sentence or-
ordering tasks are relatively straightforward - we take the        dering for multi-document summarization. In In Pro-
highest ranked sentences until a word limit is reached and         ceedings of the COLING/ACL, pages 385–392.
order these according to their original position in the text.    Sergey Brin and Lawrence Page. 1998. The anatomy of
                                                                   a large-scale hypertextual web search engine. Comput.
5. Multi document summarization                                    Netw. ISDN Syst., 30(1-7):107–117, April.
When it comes to MDS, two different approaches have been         Ren´ Arnulfo Garc´a-Hern´ ndez, Romyna Montiel, Yu-
                                                                     e               ı       a
tested. The first one is to summarize a document cluster            lia Ledeneva, Er´ ndira Rend´ n, Alexander Gelbukh,
                                                                                      e           o
by taking all sentences in it. The other approach is based         and Rafael Cruz. 2008. Text Summarization by Sen-
on the work of (Mihalcea and Tarau, 2005) who use a two            tence Extraction Using Unsupervised Learning. In Pro-
stage approach, where each document is first summarized             ceedings of the 7th Mexican International Conference
and then we summarize only the summaries (2-stageSum).             on Artificial Intelligence: Advances in Artificial Intelli-
   MDS shares the same ranking algorithms as SDS, cou-             gence, MICAI ’08, pages 133–143, Berlin, Heidelberg.
pled with specific sentence selection and ordering. We rely         Springer-Verlag.
on similarity measures (cf. 3.1) to avoid selecting near sen-    Chin-Yew Lin and Eduard Hovy. 2003. Automatic evalu-
tence duplicates and adopt a topic/publication date-based          ation of summaries using N-gram co-occurrence statis-
approach for sentence ordering (Bollegala et al., 2006).           tics. In Proceedings of the 2003 Conference of the North
                                                                   American Chapter of the Association for Computational
6. Evaluation                                                      Linguistics on Human Language Technology, volume 1
The system is evaluated on the DUC 2002 corpus, which              of NAACL ’03, pages 71–78, Stroudsburg, PA, USA. As-
consists of 567 English news articles in 59 clusters paired        sociation for Computational Linguistics.
with 100 word summaries. For Swedish we use a corpus             Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing
of 251 featured Wikipedia articles from 2010, where the            Order into Texts. In Conference on Empirical Methods
introduction is considered to be the summary.                      in Natural Language Processing, Barcelona, Spain.
   We rely on the ROUGE toolkit to evaluate the automati-        Rada Mihalcea and Paul Tarau. 2005. Multi-document
cally generated summaries and use Ngram(1,1) F1 settings,          summarization with iterative graph-based algorithms.
as these have been shown to closely relate to human ratings        In in Proceedings of the AAAI Spring Symposium on
(Lin and Hovy, 2003), without stemming and stop word re-           Knowledge Collection from Volunteer Contributors.
moval. Two kinds of baseline systems are also tested - ran-      Ani Nenkova and Kathleen McKeown. 2011. Automatic
dom selection and leading sentence selection (see Table 1).        Summarization. Foundations and Trends in Information
                                                                   Retrieval, 5(2–3):103–233.
7. Conclusion                                                    Bernhard Sch¨ lkopf, John C. Platt, John C. Shawe-Taylor,
                                                                               o
In this paper we have presented a system capable of do-            Alex J. Smola, and Robert C. Williamson. 2001. Es-
ing both SDS and MDS. By relying on unsupervised ma-               timating the support of a high-dimensional distribution.
chine learning algorithms we achieve domain indepen-               Neural Comput., 13(7):1443–1471, July.

Contenu connexe

Tendances

Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisNYC Predictive Analytics
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsClaudia Wagner
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all thatZhibo Xiao
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Innovation Quotient Pvt Ltd
 
Topic models
Topic modelsTopic models
Topic modelsAjay Ohri
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingTomonari Masada
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015rusbase
 
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
 
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...IJCSEA Journal
 
Image-Based Literal Node Matching for Linked Data Integration
Image-Based Literal Node Matching for Linked Data IntegrationImage-Based Literal Node Matching for Linked Data Integration
Image-Based Literal Node Matching for Linked Data IntegrationIJwest
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 

Tendances (19)

Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
Canini09a
Canini09aCanini09a
Canini09a
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all that
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
 
L0261075078
L0261075078L0261075078
L0261075078
 
Topic models
Topic modelsTopic models
Topic models
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
 
Lec1
Lec1Lec1
Lec1
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
 
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
 
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
 
Image-Based Literal Node Matching for Linked Data Integration
Image-Based Literal Node Matching for Linked Data IntegrationImage-Based Literal Node Matching for Linked Data Integration
Image-Based Literal Node Matching for Linked Data Integration
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 

Similaire à Unsupervised extractive summarization using TextRank, K-means and oSVM

Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Edmond Lepedus
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similaritypathsproject
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationNinad Samel
 
Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...butest
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
 
Improvement of Text Summarization using Fuzzy Logic Based Method
Improvement of Text Summarization using Fuzzy Logic Based  MethodImprovement of Text Summarization using Fuzzy Logic Based  Method
Improvement of Text Summarization using Fuzzy Logic Based MethodIOSR Journals
 
Prediction of Answer Keywords using Char-RNN
Prediction of Answer Keywords using Char-RNNPrediction of Answer Keywords using Char-RNN
Prediction of Answer Keywords using Char-RNNIJECEIAES
 
A survey of Stemming Algorithms for Information Retrieval
A survey of Stemming Algorithms for Information RetrievalA survey of Stemming Algorithms for Information Retrieval
A survey of Stemming Algorithms for Information Retrievaliosrjce
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text miningIRJET Journal
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...kevig
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc
 
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...cscpconf
 

Similaire à Unsupervised extractive summarization using TextRank, K-means and oSVM (20)

Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
 
Cc35451454
Cc35451454Cc35451454
Cc35451454
 
ssc_icml13
ssc_icml13ssc_icml13
ssc_icml13
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...
 
p138-jiang
p138-jiangp138-jiang
p138-jiang
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
Improvement of Text Summarization using Fuzzy Logic Based Method
Improvement of Text Summarization using Fuzzy Logic Based  MethodImprovement of Text Summarization using Fuzzy Logic Based  Method
Improvement of Text Summarization using Fuzzy Logic Based Method
 
I6 mala3 sowmya
I6 mala3 sowmyaI6 mala3 sowmya
I6 mala3 sowmya
 
Prediction of Answer Keywords using Char-RNN
Prediction of Answer Keywords using Char-RNNPrediction of Answer Keywords using Char-RNN
Prediction of Answer Keywords using Char-RNN
 
K017367680
K017367680K017367680
K017367680
 
A survey of Stemming Algorithms for Information Retrieval
A survey of Stemming Algorithms for Information RetrievalA survey of Stemming Algorithms for Information Retrieval
A survey of Stemming Algorithms for Information Retrieval
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
 
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
 

Plus de Findwise

White Arkitekter - Findability Day Roadshow 2017
White Arkitekter - Findability Day Roadshow 2017White Arkitekter - Findability Day Roadshow 2017
White Arkitekter - Findability Day Roadshow 2017Findwise
 
AI och maskininlärning - Findability Day Roadshow 2017
AI och maskininlärning - Findability Day Roadshow 2017AI och maskininlärning - Findability Day Roadshow 2017
AI och maskininlärning - Findability Day Roadshow 2017Findwise
 
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017De kognitiva eran med IBM Watson - Findability Day Roadshow 2017
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017Findwise
 
Findwise and IBM Watson
Findwise and IBM WatsonFindwise and IBM Watson
Findwise and IBM WatsonFindwise
 
Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016Findwise
 
Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016Findwise
 
Findability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Big data analytics and machine learningFindability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Big data analytics and machine learningFindwise
 
Findability Day 2016 - Enterprise social collaboration
Findability Day 2016 - Enterprise social collaborationFindability Day 2016 - Enterprise social collaboration
Findability Day 2016 - Enterprise social collaborationFindwise
 
Findability Day 2016 - SKF case study
Findability Day 2016 - SKF case studyFindability Day 2016 - SKF case study
Findability Day 2016 - SKF case studyFindwise
 
Findability Day 2016 - Structuring content for user experience
Findability Day 2016 - Structuring content for user experienceFindability Day 2016 - Structuring content for user experience
Findability Day 2016 - Structuring content for user experienceFindwise
 
Findability Day 2016 - Augmented intelligence
Findability Day 2016 - Augmented intelligenceFindability Day 2016 - Augmented intelligence
Findability Day 2016 - Augmented intelligenceFindwise
 
Findability Day 2016 - What is GDPR?
Findability Day 2016 - What is GDPR?Findability Day 2016 - What is GDPR?
Findability Day 2016 - What is GDPR?Findwise
 
Findability Day 2016 - Get started with GDPR
Findability Day 2016 - Get started with GDPRFindability Day 2016 - Get started with GDPR
Findability Day 2016 - Get started with GDPRFindwise
 
Digital workplace och informationshantering i office 365
Digital workplace och informationshantering i office 365Digital workplace och informationshantering i office 365
Digital workplace och informationshantering i office 365Findwise
 
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...Findwise
 
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any mess
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any messFindability Day 2015 - Abby Covert - Keynote - How to make sense of any mess
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any messFindwise
 
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...Findwise
 
Findability Day 2015 Mattias Ellison - Findwise - Enterprise Search and fin...
Findability Day 2015   Mattias Ellison - Findwise - Enterprise Search and fin...Findability Day 2015   Mattias Ellison - Findwise - Enterprise Search and fin...
Findability Day 2015 Mattias Ellison - Findwise - Enterprise Search and fin...Findwise
 
Findability Day 2015 - Martin White - The future is search!
Findability Day 2015 - Martin White - The future is search!Findability Day 2015 - Martin White - The future is search!
Findability Day 2015 - Martin White - The future is search!Findwise
 
Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...
Findability Day 2015   Liam Holley - Dassault systems - Insight and discovery...Findability Day 2015   Liam Holley - Dassault systems - Insight and discovery...
Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...Findwise
 

Plus de Findwise (20)

White Arkitekter - Findability Day Roadshow 2017
White Arkitekter - Findability Day Roadshow 2017White Arkitekter - Findability Day Roadshow 2017
White Arkitekter - Findability Day Roadshow 2017
 
AI och maskininlärning - Findability Day Roadshow 2017
AI och maskininlärning - Findability Day Roadshow 2017AI och maskininlärning - Findability Day Roadshow 2017
AI och maskininlärning - Findability Day Roadshow 2017
 
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017De kognitiva eran med IBM Watson - Findability Day Roadshow 2017
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017
 
Findwise and IBM Watson
Findwise and IBM WatsonFindwise and IBM Watson
Findwise and IBM Watson
 
Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016
 
Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016
 
Findability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Big data analytics and machine learningFindability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Big data analytics and machine learning
 
Findability Day 2016 - Enterprise social collaboration
Findability Day 2016 - Enterprise social collaborationFindability Day 2016 - Enterprise social collaboration
Findability Day 2016 - Enterprise social collaboration
 
Findability Day 2016 - SKF case study
Findability Day 2016 - SKF case studyFindability Day 2016 - SKF case study
Findability Day 2016 - SKF case study
 
Findability Day 2016 - Structuring content for user experience
Findability Day 2016 - Structuring content for user experienceFindability Day 2016 - Structuring content for user experience
Findability Day 2016 - Structuring content for user experience
 
Findability Day 2016 - Augmented intelligence
Findability Day 2016 - Augmented intelligenceFindability Day 2016 - Augmented intelligence
Findability Day 2016 - Augmented intelligence
 
Findability Day 2016 - What is GDPR?
Findability Day 2016 - What is GDPR?Findability Day 2016 - What is GDPR?
Findability Day 2016 - What is GDPR?
 
Findability Day 2016 - Get started with GDPR
Findability Day 2016 - Get started with GDPRFindability Day 2016 - Get started with GDPR
Findability Day 2016 - Get started with GDPR
 
Digital workplace och informationshantering i office 365
Digital workplace och informationshantering i office 365Digital workplace och informationshantering i office 365
Digital workplace och informationshantering i office 365
 
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...
 
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any mess
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any messFindability Day 2015 - Abby Covert - Keynote - How to make sense of any mess
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any mess
 
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...
 
Findability Day 2015 Mattias Ellison - Findwise - Enterprise Search and fin...
Findability Day 2015   Mattias Ellison - Findwise - Enterprise Search and fin...Findability Day 2015   Mattias Ellison - Findwise - Enterprise Search and fin...
Findability Day 2015 Mattias Ellison - Findwise - Enterprise Search and fin...
 
Findability Day 2015 - Martin White - The future is search!
Findability Day 2015 - Martin White - The future is search!Findability Day 2015 - Martin White - The future is search!
Findability Day 2015 - Martin White - The future is search!
 
Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...
Findability Day 2015   Liam Holley - Dassault systems - Insight and discovery...Findability Day 2015   Liam Holley - Dassault systems - Insight and discovery...
Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...
 

Unsupervised extractive summarization using TextRank, K-means and oSVM

  • 1. Extractive document summarization - an unsupervised approach Jonatan Bengtsson⇤ , Christoffer Skeppstedt† , Svetoslav Marinov⇤ Findwise AB, † Tickster AB, Gothenburg, Sweden ⇤ ⇤ {jonatan.bengtsson,svetoslav.marinov}@findwise.com, † christoffer@tickster.com Abstract In this paper we present and evaluate a system for automatic extractive document summarization. We employ three different unsupervised algorithms for sentence ranking - TextRank, K-means clustering and previously unexplored in this field, one- class SVM. By adding language and domain specific boosting we achieve state-of-the-art performance for English measured in ROUGE Ngram(1,1) score on the DUC 2002 dataset - 0,4797. In addition, the system can be used for both single and multi document summarization. We also present results for Swedish, based on a new corpus - featured Wikipedia articles. 1. Introduction frequency times inverse document frequency (TF*IDF) An extractive summarization system tries to identify the score. most relevant sentences in an input document (aka single NER for English is performed with Stanford Named En- document summarization, SDS) or cluster of similar docu- tity Recognizer (http://nlp.stanford.edu/software) while for ments (aka multi-document summarization, MDS) and uses Swedish we use OpenNLP. The latter library is also used these to create a summary (Nenkova and McKeown, 2011). for NP chunking for English. Dependency parsing is per- This task can be divided into four subtask: document pro- formed by MaltParser (http://maltparser.org). cessing, sentence ranking, sentence selection and sentence ordering. Section 2. describes all necessary document pro- 3. Sentence Ranking cessing. A central component in an extractive summarization system We have chosen to work entirely with unsupervised ma- is the sentence ranking algorithm. Its role is to assign a real chine learning algorithms to achieve maximum domain in- value rank to each input sentence or order the sentences dependence. We utilize three algorithms for sentence rank- according to their relevance. ing - TextRank (Mihalcea and Tarau, 2004), K-means clus- tering (Garc´a-Hern´ ndez et al., 2008) and One-class Sup- ı a 3.1 TextRank port Vector Machines (oSVM) (Sch¨ lkopf et al., 2001). In o TextRank (Mihalcea and Tarau, 2004) models a document three of the summarization subtasks, a crucial component as a graph, where nodes corresponds to sentences from the is the calculation of sentence similarities. Both the rank- document, and edges carry a weight describing the similar- ing algorithms and the similarity measures are presented in ity between the nodes. Once the graph has been constructed more detail in Section 3. the nodes are ranked by an iterative algorithm based on Sections 4. and 5. deal with the different customizations Google’s PageRank (Brin and Page, 1998). of the system to tackle the tasks of SDS and MDS respec- The notion of sentence similarity is crucial to TextRank. tively. Here we also describe the subtasks of sentence se- Mihalcea and Tarau (2004) use the following similarity |S S | lections and ordering. measure: Similarity(Si , Sj ) = log|Sii j j | , where Si |+log|S In Section 6. we present the results from the evaluation is a sentence from the document, |Si | is its length in words on Swedish and English corpora. and |Si Sj | is the word overlap between Si and Sj . We have tested several other enhanced approaches, such 2. Document processing as cosine, TF*IDF, POS tags and dependency tree based The minimal preprocessing required by an extractive doc- similarity measures. ument summarization system is sentence splitting. While this is sufficient to create a baseline, its performance will be 3.2 K-means clustering suboptimal. Further linguistic processing is central to op- Garc´a-Hern´ ndez et al. (2008) adapt the well-known K- ı a timizing the system. The basic requirements are tokeniza- means clustering algorithm to the task of document sum- tion, stemming and part-of-speech (POS) tagging. Given a marization. We use the same approach and divide sentences language where we have all the basic resources, our system into k clusters from which we then select the most salient will be able to produce a summary of a document. ones. We have tested three different ways for sentence rel- Sentence splitting, tokenization and POS tagging are evance ordering - position, centroid and TextRank-based. done using OpenNLP (http://opennlp.apache.org) for both The value of k is conditioned on the mean sentence length English and Swedish. In addition we have explored several in a document and the desired summary length. other means of linguistic analysis such as Named Entity Each sentence is converted to a word vector before the Recognition (NER), keyword extraction, dependency pars- clustering begins. The vectors can contain all unique words ing and noun phrase (NP) chunking. Finally we can aug- of the document or a subset based on POS tag, document ment the importance of words by calculating their term- keywords or named entities.
  • 2. 3.3 One-class SVM English Swedish Algorithm SDS MDS SDS oSVM is a previously unexplored approach when it comes TextRank (TF*IDF, POS) 0.4797 0.2537 0.3593 to unsupervised, extractive summarization. Similarly to the K-means (TF*IDF, POS) 0.4680 0.2400 0.3539 K-means algorithm, sentences are seen as points in a coor- oSVM (cosine) 0.4343 0.3399 dinate system, but the task is to find the outline boundary, 2-stageSum (TextRank-based) 0.2561 i.e. the support vectors that enclose all points. These vec- (Mihalcea and Tarau, 2004) 0.4708 tors (or sentences) arguably define the document and are (Garc´a-Hern´ ndez et al., 2008) ı a 0.4791 therefore interesting from a summarization point of view. Baselinelead 0.4649 0.2317 0.3350 For the kernel function we choose the sentence similar- Baselinerand 0.3998 0.2054 0.3293 ity measures (cf. 3.1). Similarly to choosing k in (3.2), the number of support vectors is dependent on the mean sen- Table 1: Results and Comparison tence and desired summary lengths. dence. With relatively little language dependent processing 4. Single document summarization the system can be ported to new languages and domains. For SDS we can use domain specific knowledge in order to We have evaluated three different algorithms for sentence boost the sentence rank and thus improve the performance ranking where oSVM is previously unexplored in this field. of the system. As an example, in the domain of newspa- By adding domain knowledge in the form of sentence rank per articles the sentence position tends to have a significant boosting with the TextRank algorithm we receive higher role, with initital sentences containing the gist of the arti- ROUGE scores than other systems tested on DUC 2002 cle. We use an inverse square function to update the sen- dataset. In addition, we have tested the system for Swedish tence ranks: Boost(Si ) = Si .rank ⇤ (1 + p 1 ), where on a new corpus with promising results. Si .pos Si .rank is the prior value and Si .pos is the position of the sentence in the document. We see such boosting functions 8. References as important steps for domain customization. Danushka Bollegala, Naoaki Okazaki, and Mitsuru Once the sentences have been ranked the selection and Ishizuka. 2006. A bottom-up approach to sentence or- ordering tasks are relatively straightforward - we take the dering for multi-document summarization. In In Pro- highest ranked sentences until a word limit is reached and ceedings of the COLING/ACL, pages 385–392. order these according to their original position in the text. Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. Comput. 5. Multi document summarization Netw. ISDN Syst., 30(1-7):107–117, April. When it comes to MDS, two different approaches have been Ren´ Arnulfo Garc´a-Hern´ ndez, Romyna Montiel, Yu- e ı a tested. The first one is to summarize a document cluster lia Ledeneva, Er´ ndira Rend´ n, Alexander Gelbukh, e o by taking all sentences in it. The other approach is based and Rafael Cruz. 2008. Text Summarization by Sen- on the work of (Mihalcea and Tarau, 2005) who use a two tence Extraction Using Unsupervised Learning. In Pro- stage approach, where each document is first summarized ceedings of the 7th Mexican International Conference and then we summarize only the summaries (2-stageSum). on Artificial Intelligence: Advances in Artificial Intelli- MDS shares the same ranking algorithms as SDS, cou- gence, MICAI ’08, pages 133–143, Berlin, Heidelberg. pled with specific sentence selection and ordering. We rely Springer-Verlag. on similarity measures (cf. 3.1) to avoid selecting near sen- Chin-Yew Lin and Eduard Hovy. 2003. Automatic evalu- tence duplicates and adopt a topic/publication date-based ation of summaries using N-gram co-occurrence statis- approach for sentence ordering (Bollegala et al., 2006). tics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational 6. Evaluation Linguistics on Human Language Technology, volume 1 The system is evaluated on the DUC 2002 corpus, which of NAACL ’03, pages 71–78, Stroudsburg, PA, USA. As- consists of 567 English news articles in 59 clusters paired sociation for Computational Linguistics. with 100 word summaries. For Swedish we use a corpus Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing of 251 featured Wikipedia articles from 2010, where the Order into Texts. In Conference on Empirical Methods introduction is considered to be the summary. in Natural Language Processing, Barcelona, Spain. We rely on the ROUGE toolkit to evaluate the automati- Rada Mihalcea and Paul Tarau. 2005. Multi-document cally generated summaries and use Ngram(1,1) F1 settings, summarization with iterative graph-based algorithms. as these have been shown to closely relate to human ratings In in Proceedings of the AAAI Spring Symposium on (Lin and Hovy, 2003), without stemming and stop word re- Knowledge Collection from Volunteer Contributors. moval. Two kinds of baseline systems are also tested - ran- Ani Nenkova and Kathleen McKeown. 2011. Automatic dom selection and leading sentence selection (see Table 1). Summarization. Foundations and Trends in Information Retrieval, 5(2–3):103–233. 7. Conclusion Bernhard Sch¨ lkopf, John C. Platt, John C. Shawe-Taylor, o In this paper we have presented a system capable of do- Alex J. Smola, and Robert C. Williamson. 2001. Es- ing both SDS and MDS. By relying on unsupervised ma- timating the support of a high-dimensional distribution. chine learning algorithms we achieve domain indepen- Neural Comput., 13(7):1443–1471, July.