1. This paper presents and evaluates an unsupervised extractive document summarization system that uses TextRank, K-means clustering, and one-class SVM algorithms for sentence ranking.
2. The system achieves state-of-the-art performance on the DUC 2002 English dataset with a ROUGE score of 0.4797 and can also summarize Swedish documents.
3. Domain knowledge is added through sentence boosting to improve summarization of news articles, and similarities between sentences are calculated to avoid redundancy for multi-document summarization.
Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...
Unsupervised extractive summarization using TextRank, K-means and oSVM
1. Extractive document summarization - an unsupervised approach
Jonatan Bengtsson⇤ , Christoffer Skeppstedt† , Svetoslav Marinov⇤
Findwise AB, † Tickster AB, Gothenburg, Sweden
⇤
⇤
{jonatan.bengtsson,svetoslav.marinov}@findwise.com, † christoffer@tickster.com
Abstract
In this paper we present and evaluate a system for automatic extractive document summarization. We employ three different
unsupervised algorithms for sentence ranking - TextRank, K-means clustering and previously unexplored in this field, one-
class SVM. By adding language and domain specific boosting we achieve state-of-the-art performance for English measured in
ROUGE Ngram(1,1) score on the DUC 2002 dataset - 0,4797. In addition, the system can be used for both single and multi
document summarization. We also present results for Swedish, based on a new corpus - featured Wikipedia articles.
1. Introduction frequency times inverse document frequency (TF*IDF)
An extractive summarization system tries to identify the score.
most relevant sentences in an input document (aka single NER for English is performed with Stanford Named En-
document summarization, SDS) or cluster of similar docu- tity Recognizer (http://nlp.stanford.edu/software) while for
ments (aka multi-document summarization, MDS) and uses Swedish we use OpenNLP. The latter library is also used
these to create a summary (Nenkova and McKeown, 2011). for NP chunking for English. Dependency parsing is per-
This task can be divided into four subtask: document pro- formed by MaltParser (http://maltparser.org).
cessing, sentence ranking, sentence selection and sentence
ordering. Section 2. describes all necessary document pro- 3. Sentence Ranking
cessing. A central component in an extractive summarization system
We have chosen to work entirely with unsupervised ma- is the sentence ranking algorithm. Its role is to assign a real
chine learning algorithms to achieve maximum domain in- value rank to each input sentence or order the sentences
dependence. We utilize three algorithms for sentence rank- according to their relevance.
ing - TextRank (Mihalcea and Tarau, 2004), K-means clus-
tering (Garc´a-Hern´ ndez et al., 2008) and One-class Sup-
ı a 3.1 TextRank
port Vector Machines (oSVM) (Sch¨ lkopf et al., 2001). In
o TextRank (Mihalcea and Tarau, 2004) models a document
three of the summarization subtasks, a crucial component as a graph, where nodes corresponds to sentences from the
is the calculation of sentence similarities. Both the rank- document, and edges carry a weight describing the similar-
ing algorithms and the similarity measures are presented in ity between the nodes. Once the graph has been constructed
more detail in Section 3. the nodes are ranked by an iterative algorithm based on
Sections 4. and 5. deal with the different customizations Google’s PageRank (Brin and Page, 1998).
of the system to tackle the tasks of SDS and MDS respec- The notion of sentence similarity is crucial to TextRank.
tively. Here we also describe the subtasks of sentence se- Mihalcea and Tarau (2004) use the following similarity
|S S |
lections and ordering. measure: Similarity(Si , Sj ) = log|Sii j j | , where Si
|+log|S
In Section 6. we present the results from the evaluation is a sentence from the document, |Si | is its length in words
on Swedish and English corpora. and |Si Sj | is the word overlap between Si and Sj .
We have tested several other enhanced approaches, such
2. Document processing as cosine, TF*IDF, POS tags and dependency tree based
The minimal preprocessing required by an extractive doc- similarity measures.
ument summarization system is sentence splitting. While
this is sufficient to create a baseline, its performance will be 3.2 K-means clustering
suboptimal. Further linguistic processing is central to op- Garc´a-Hern´ ndez et al. (2008) adapt the well-known K-
ı a
timizing the system. The basic requirements are tokeniza- means clustering algorithm to the task of document sum-
tion, stemming and part-of-speech (POS) tagging. Given a marization. We use the same approach and divide sentences
language where we have all the basic resources, our system into k clusters from which we then select the most salient
will be able to produce a summary of a document. ones. We have tested three different ways for sentence rel-
Sentence splitting, tokenization and POS tagging are evance ordering - position, centroid and TextRank-based.
done using OpenNLP (http://opennlp.apache.org) for both The value of k is conditioned on the mean sentence length
English and Swedish. In addition we have explored several in a document and the desired summary length.
other means of linguistic analysis such as Named Entity Each sentence is converted to a word vector before the
Recognition (NER), keyword extraction, dependency pars- clustering begins. The vectors can contain all unique words
ing and noun phrase (NP) chunking. Finally we can aug- of the document or a subset based on POS tag, document
ment the importance of words by calculating their term- keywords or named entities.
2. 3.3 One-class SVM English Swedish
Algorithm SDS MDS SDS
oSVM is a previously unexplored approach when it comes
TextRank (TF*IDF, POS) 0.4797 0.2537 0.3593
to unsupervised, extractive summarization. Similarly to the
K-means (TF*IDF, POS) 0.4680 0.2400 0.3539
K-means algorithm, sentences are seen as points in a coor- oSVM (cosine) 0.4343 0.3399
dinate system, but the task is to find the outline boundary, 2-stageSum (TextRank-based) 0.2561
i.e. the support vectors that enclose all points. These vec- (Mihalcea and Tarau, 2004) 0.4708
tors (or sentences) arguably define the document and are (Garc´a-Hern´ ndez et al., 2008)
ı a 0.4791
therefore interesting from a summarization point of view. Baselinelead 0.4649 0.2317 0.3350
For the kernel function we choose the sentence similar- Baselinerand 0.3998 0.2054 0.3293
ity measures (cf. 3.1). Similarly to choosing k in (3.2), the
number of support vectors is dependent on the mean sen- Table 1: Results and Comparison
tence and desired summary lengths.
dence. With relatively little language dependent processing
4. Single document summarization
the system can be ported to new languages and domains.
For SDS we can use domain specific knowledge in order to We have evaluated three different algorithms for sentence
boost the sentence rank and thus improve the performance ranking where oSVM is previously unexplored in this field.
of the system. As an example, in the domain of newspa- By adding domain knowledge in the form of sentence rank
per articles the sentence position tends to have a significant boosting with the TextRank algorithm we receive higher
role, with initital sentences containing the gist of the arti- ROUGE scores than other systems tested on DUC 2002
cle. We use an inverse square function to update the sen- dataset. In addition, we have tested the system for Swedish
tence ranks: Boost(Si ) = Si .rank ⇤ (1 + p 1 ), where on a new corpus with promising results.
Si .pos
Si .rank is the prior value and Si .pos is the position of the
sentence in the document. We see such boosting functions 8. References
as important steps for domain customization. Danushka Bollegala, Naoaki Okazaki, and Mitsuru
Once the sentences have been ranked the selection and Ishizuka. 2006. A bottom-up approach to sentence or-
ordering tasks are relatively straightforward - we take the dering for multi-document summarization. In In Pro-
highest ranked sentences until a word limit is reached and ceedings of the COLING/ACL, pages 385–392.
order these according to their original position in the text. Sergey Brin and Lawrence Page. 1998. The anatomy of
a large-scale hypertextual web search engine. Comput.
5. Multi document summarization Netw. ISDN Syst., 30(1-7):107–117, April.
When it comes to MDS, two different approaches have been Ren´ Arnulfo Garc´a-Hern´ ndez, Romyna Montiel, Yu-
e ı a
tested. The first one is to summarize a document cluster lia Ledeneva, Er´ ndira Rend´ n, Alexander Gelbukh,
e o
by taking all sentences in it. The other approach is based and Rafael Cruz. 2008. Text Summarization by Sen-
on the work of (Mihalcea and Tarau, 2005) who use a two tence Extraction Using Unsupervised Learning. In Pro-
stage approach, where each document is first summarized ceedings of the 7th Mexican International Conference
and then we summarize only the summaries (2-stageSum). on Artificial Intelligence: Advances in Artificial Intelli-
MDS shares the same ranking algorithms as SDS, cou- gence, MICAI ’08, pages 133–143, Berlin, Heidelberg.
pled with specific sentence selection and ordering. We rely Springer-Verlag.
on similarity measures (cf. 3.1) to avoid selecting near sen- Chin-Yew Lin and Eduard Hovy. 2003. Automatic evalu-
tence duplicates and adopt a topic/publication date-based ation of summaries using N-gram co-occurrence statis-
approach for sentence ordering (Bollegala et al., 2006). tics. In Proceedings of the 2003 Conference of the North
American Chapter of the Association for Computational
6. Evaluation Linguistics on Human Language Technology, volume 1
The system is evaluated on the DUC 2002 corpus, which of NAACL ’03, pages 71–78, Stroudsburg, PA, USA. As-
consists of 567 English news articles in 59 clusters paired sociation for Computational Linguistics.
with 100 word summaries. For Swedish we use a corpus Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing
of 251 featured Wikipedia articles from 2010, where the Order into Texts. In Conference on Empirical Methods
introduction is considered to be the summary. in Natural Language Processing, Barcelona, Spain.
We rely on the ROUGE toolkit to evaluate the automati- Rada Mihalcea and Paul Tarau. 2005. Multi-document
cally generated summaries and use Ngram(1,1) F1 settings, summarization with iterative graph-based algorithms.
as these have been shown to closely relate to human ratings In in Proceedings of the AAAI Spring Symposium on
(Lin and Hovy, 2003), without stemming and stop word re- Knowledge Collection from Volunteer Contributors.
moval. Two kinds of baseline systems are also tested - ran- Ani Nenkova and Kathleen McKeown. 2011. Automatic
dom selection and leading sentence selection (see Table 1). Summarization. Foundations and Trends in Information
Retrieval, 5(2–3):103–233.
7. Conclusion Bernhard Sch¨ lkopf, John C. Platt, John C. Shawe-Taylor,
o
In this paper we have presented a system capable of do- Alex J. Smola, and Robert C. Williamson. 2001. Es-
ing both SDS and MDS. By relying on unsupervised ma- timating the support of a high-dimensional distribution.
chine learning algorithms we achieve domain indepen- Neural Comput., 13(7):1443–1471, July.