The document discusses summarization techniques, including extractive and generative summarization. Extractive summarization uses keyword extraction while generative summarization creates new sentences for the summary. Simple statistics-based methods are described, including tracking the most frequent words in a document and identifying collocations or word networks. Generating sentences can use statistical tables of word frequencies, Bayesian probabilities of words in certain positions, or N-grams and part-of-speech tags. The goal is to efficiently summarize large documents with important information.
2. Summarization
Diberikan sebuah dokumen (korpus), ringkas dalam
kata-kata yang mewakili isinya
Extractive summarization
kata-kata kunci
Generative summarization
Kalimat ringkasan
Information Retrieval – ISD312 Summarization 2
3. Simple statistics
Most frequent words
import nltk
from __future__ import division
from nltk.book import *
Information Retrieval – ISD312 Summarization 3
4. import nltk
from __future__ import division
from nltk.book import *
def kataKunci(df, ambang):
max = 0
for vocab in df.keys():
if max < df[vocab]:
max = df[vocab]
for vocab in df.keys():
if df[vocab] / max > ambang:
print vocab,
print ''
Information Retrieval – ISD312 Summarization 4
5. Frase, Kumpulan kata
Collocations
Jaringan kata dalam dokumen
Information Retrieval – ISD312 Summarization 5
6. Membangkitkan kalimat
Simple statistics
Tabel statistik kemunculan kata
Statistik Bayesian
Probabilitas sebuah kata pada awal kalimat
Probabilitas sebuah kata mengikuti kata lainnya
Metode lain
N-gram
POS-tag
Information Retrieval – ISD312 Summarization 6
7. The rapid growth of the Internet has resulted in enormous
amounts of information that has become more difficult to access
efficiently. Internet users require tools to help manage this vast
quantity of information. The primary goal of this research is to
create an efficient and effective tool that is able to summarize
large documents quickly. This research presents a linear time
algorithm for calculating lexical chains which is a method of
capturing the “aboutness” of a document. This method is
compared to previous, less efficient methods of lexical chain
extraction. We also provide alternative methods for extracting
and scoring lexical chains. We show that our method provides
similar results to previous research, but is substantially more
efficient. This efficiency is necessary in Internet search
applications where many large documents may need to be
summarized at once, and where the response time to the end
user is extremely important.
Information Retrieval – ISD312 Summarization 7
9. import nltk
data = 'Sebuah contoh kalimat yang ingin
dianalisis menggunakan NLTK'
tokens = nltk.word_tokenize(data)
text = nltk.Text(tokens)
Information Retrieval – ISD312 Summarization 9