SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
Extracting Article Text from the Web with Maximum Subsequence Segmentation Author: Jeff Pasternack, Dan Roth Publication: WWW’2009 Presenter: Jhih-Ming Chen 1
Outline Introduction Application Problem Description Identifying Articles Cleaning Extracted Blocks Experimental Setup Data Tokenization Algorithms Maximum Subsequence Optimization Applying Maximum Subsequence Results Conclusion 2
Introduction Extracting article text, as with cleaning HTML documents ingeneral,has become much more complex. page layout contain sidebars, inserts,headers, navigation panels, ads, etc.. Most commonapproach has been to hand-code rules, often as regularexpressions. these can be very accurate, but are highly laborintensive and easily broken by changes to a page’s structure. Thishas led to interest to finding a better solution. 3
Application There are a number of applications where article text extraction is either essential or can improve overall results. As devices such as PDAs and cell-phones have become commonplace, adapting web pages for small screens has become increasing important. In the context of article pages the article text is the primary concern of the user, and, once extracted. such as automatically generating RSS feeds from article pages. Eliminating the irrelevant context surrounding the article text is also useful in information retrieval. Query expansion benefits for the same reason, as unrelated terms in the page outside the article can be ignored. 4
Problem Description Given a page, we would like to Decide if the page contains an article. Identify a contiguous block of HTML in the page starting with the first word in the article and ending with the last. Remove everything other than the article text (and its included markup tags) itself. As a practical matter, in the case where the first word or last word is nested within one or more pairs of tags. 5
Identifying Articles The first step, determining whether a page contains an article, is a document classification problem. This paper’s evaluation implicitly assumes that such a classifier is provided, since all our testing examples contain articles. No such assumption is made in training. A complication arises in determining exactly where an article begins and where it ends. A news article typically follows the pattern  “headline -> bylines -> main text -> bylines”. 6
Identifying Articles An extraction may start at  the first word of a headline. the first word of an authorship byline before themain text (e.g. “By John Doe”) or the organization byline (e.g.“The Associated Press”). if no author is given, the first word of the main text.  An extraction may end at the last word of the main text. the last word of an authorship byline or organization byline appearing after the main text the first word of a headline. Articles may thus have up to six possible extractions. 7
Identifying Articles When experimental results are reported, predicted extractions are compared: The main text alone (The shortest of the possible extractions. This is what the wrappers that generate the training examples). and The “best” extraction, the one that yields the most favorable       F1-score for that particular prediction. 8
Cleaning Extracted Blocks After inspecting the evaluation sets, authors found this could be done using just a few rules with little error. Starting with an extracted block of HTML that starts at thefirst word and ends at the last word of the main text of the article: Remove the contents of any <IFRAME> or <TABLE> tag pair. Remove the contents of any <DIV> tag pair that contains a hyperlink (<A>), <IFRAME>, <TABLE>, <IMG>, <EMBED>, <APPLET> or <OBJECT>. This heuristic works well because the text of most news articles very rarely contains links, and embeds are intended to either show you something (e.g. a picture) or get you to go somewhere (e.g. a link). Outside the news domain will not always be sofortunate. 9
Experimental Setup Data 24000 training examples were semi-automatically gathered from twelve news websites. For each of the twelve training domains, a few pages were examined and a wrapper was used to extract the article text. These wrappers were then used to label the pages we had spidered from the source 450 evaluation examples were gathered from forty-five news websites (Big 5 and Myriad 40) and tagged by hand. “Big 5” set—fifty examples were tagged for each of the first five evaluation sources. “Myriad 40” set consists of the remaining forty sources chosen randomly from a list of several hundred (obtained from the Yahoo! Directory) and contains an international mix of English-language sites of widely varying size and sophistication. 10
Experimental Setup Tokenization Breaking up the HTML into a list of tags, words and symbols, where numbers are considered to be “words”. When tokenizing pages, a few transformations are applied: A list of “common” tags was created by examining the training data and selecting all tags that occurred in pages of more than one source. Any tag not in this list is transformed into a generic <UNKNOWN>. Porter stemming is then applied to non-numeric words, and all numbers are stemmed to “1”. Everything between <SCRIPT> and <STYLE> tag pairs is discarded. 11
Algorithms Maximum Subsequence Optimization Given a sequence S = (s1, s2,…,sn), where si ∈ R, a maximum subsequence of S is a sequence T = (sa, sa+1,…, sb) where 1 ≤ a ≤ b ≤ n. The sum of all the subsequence’s elements as its value. 12
Applying Maximum Subsequence Baseline 1: Longest Cell or <DIV> It is adapted from [17], selects the table cell containing the most characters. This paper modified it to also consider the contents of <DIV> tags, since not all articles in the evaluation set were embedded in a table. Baseline 2: Simple MSS Find a maximum subsequence of asequence of scores generated with an extremely simple scoringfunction that assigns a score of -3.25 to every tag and 1 to wordsand symbols. 13
The Supervised Approach Maximum subsequence segmentation uses local, token-level classifiers to find a score for each token in the document, resulting in a sequence of scores. The classifiers used in the experiments are all Naïve Bayes with only two types of features. trigrams Given a list of n tokens U =(u1, u2,…, un), the trigram of token i is <ui, ui+1, ui+2>. most-recent-unclosed-tags When a new opening tag is encountered, it is added to the stack. When a closing tag is found, the stack unwinds until a corresponding opening tag is matched and removed. The most-recent-unclosed-tag is the tag currently at the top of the stack. “in” or “out” labeldepending upon whether the particular token is within the articletext or not. 14
The Supervised Approach When classifying new tokens, the model will consider an unrecognized trigram t not appearing in the training corpus as equally likely for either label: P(t|in) = P(t|out) = 0.5 The certainty of the classifier’s prediction, the probability p that a token is “in”, is then transformed into a score (p - 0.5). The supervised approach tendsto fail when the local classifier mistakenly believes that a sizableset of tokens outside the article are “in”. e.g.lengthy commentsections that, to the classifier, look like article text. 15
The Supervised Approach There are three possible remedies. Choose a more expressive feature set or moresophisticated classifier with an accompanying increase incomplexity. Add constraints that exploit ourknowledge of the domain. <HR> tags, not in the article text, but they mayseparate that text from other content such as comments. This shows a substantial overall gain in the experimental results. Find a way to apply relatively higher scores to the more“important” tokens near the edges of the article text. prevent the maximum subsequence from exceeding theseboundaries to erroneously include other nearby positive-valuedsubsequences. However, which tokens are near the borders varies from source-to-source. 16
The Semi-Supervised Approach Semi-supervised bootstrapping algorithm Train a Naïve Bayes local classifier on the provided labeled training examples with trigram and most-recent-unclosed-tag features, as in the supervised approach. Predict extractions for the unlabeled documents U. Iterate: (ten iterations) Choose a portion of the documents in U with the seemingly most likely correct predicted extractions, and call these L. Find “importance weights” for the trigrams of the documents in L. Train a new Naïve Bayes local classifier over the documents in L with trigram features. Predict new extractions for the documents in U. 17
The Semi-Supervised Approach We are then left with three questions: How to estimate the likelihood a prediction is correct. What number of the most likely correct labeled examples to put into L. How to find and apply “importance weights” for the token trigrams. 18
The Semi-Supervised Approach Remembering that documents are of varying lengths, we might try obtaining a [0, 1] score v by: pi is the probability given by the Naïve Bayes predictor that a token is “in”. j is the index of the first token of the predicted article text. k is the index of the last token. n is the total number of tokens in the document. If Naïve Bayes predicts with certainty (p = 1 or 0) that a token is “in” and this is later corrected after finding the maximum subsequence, v will be 0. 19
The Semi-Supervised Approach This suggests a windowed approach: w is the window radius. A window radius of 10 was found to work well and is the parameter used in the experiments. Once every prediction is scored, a number of the highest-scoring are selected. In the experiments, only a limited number of unlabeled examples are available (100 from each of the five evaluation sources). Select with 60% (60 from each source) of the seemingly best-labeled examples from each source chosen for L. 20
The Semi-Supervised Approach “Importance weight”  More important tokens near the borders can exert greater influence on the final maximum subsequence found. Directly calculate a weight for each trigram feature based upon where tokens with that feature are relative to the edges of the article text. the “importance weight” of trigram h, mh, is given by: This the set of all tokens with trigram h. j and k is the indices of the first and last token in the article text, respectively. d is the distance cutoff.(d = 64) 21
The Semi-Supervised Approach Importance weights vary from 1 (a trigram always appears immediately adjacent to the border between article text and non-article text) to 0 (a trigram is always found d or more tokens away from the border). If pi is the probability of “in” produced by Naïve Bayes for a token i with trigram h, then the score can be calculated as     (pi-0.5)(mh・c + 1), where c ∈ R is a [0,∞) constant. c = 24. 22
The Semi-Supervised Approach 23 60%U = L model training example Naïve Bayes unlabeled document
Efficiency Tokenization, feature extraction, local classification, and maximum subsequence optimization may be interleaved so that the supervised method is not only linear time but also single pass. The semi-supervised approach, while still linear time, requires several passes over the documents to be labeled. Both algorithms require only a single pass over the labeled training data, after which it can be discarded. This may be a significant practical advantage since, given the appropriate wrapper, training on a source can occur while spidering, without ever having to persist pages to storage. 24
Results Baseline 1: Longest Cell or <DIV> Table cell and <div> boundaries did not typically encapsulate the article text, and merely selecting the longest of these results in, as one might expect, low precision, but high recall. 25
Results Baseline 2: Simple MSS There is a high degree of variability in performance among the sources tested—an F1-score of 99.5% for nypost.com versus 68.4% for suntimes.com 26
Results Supervised Algorithm Looking at individual sources, we find that freep.com suffers from low precision because of lengthy comment sections that the local classifier mistakes for article text. 27
Results Supervised Algorithm Though in these experiments the number of examples per source is fixed, there is an interesting potential tradeoff between adding easily-obtained examples from an existing source and writing wrappers to use new sources. 28
Results Supervised + <HR> Constraint Algorithm A simple implementation of this idea is to add a constraint that articles cannot contain an <HR> (horizontal rule) tag, truncating any extraction when one is found. 29
Results Semi-Supervised Algorithm The semi-supervised algorithm is more computationally expensive than the supervised methods but yields notably improved performance. Because the algorithm requires a large number of examples from each source, only the Big 5 sources are tested. 30
Conclusion The effectiveness of efficient algorithms utilizing maximum subsequence segmentation to extract the text of articles from HTML documents has been demonstrated. While the semi-supervised results are significantly better than the supervised, constraints may be able to bridge the gap somewhat, and the supervised methods are faster as well as single-pass. While hand-written web scrapers are likely to persist for some time, MSS provides a fast, accurate method for extracting article text from a large number of sources in a variety of domains with relatively little effort, computational or otherwise. 31

Contenu connexe

Tendances

Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Foldersfeiwin
 
ODD EVEN BASED BINARY SEARCH
ODD EVEN BASED BINARY SEARCHODD EVEN BASED BINARY SEARCH
ODD EVEN BASED BINARY SEARCHIAEME Publication
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introductionguest0edcaf
 
Cohesive Software Design
Cohesive Software DesignCohesive Software Design
Cohesive Software Designijtsrd
 
Probabilistic Information Retrieval
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information RetrievalHarsh Thakkar
 
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Quinsulon Israel
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text miningIRJET Journal
 
K anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseK anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseLeMeniz Infotech
 
Conceptual framework for abstractive text summarization
Conceptual framework for abstractive text summarizationConceptual framework for abstractive text summarization
Conceptual framework for abstractive text summarizationijnlc
 
CS8091_BDA_Unit_III_Content_Based_Recommendation
CS8091_BDA_Unit_III_Content_Based_RecommendationCS8091_BDA_Unit_III_Content_Based_Recommendation
CS8091_BDA_Unit_III_Content_Based_RecommendationPalani Kumar
 
Text summarization
Text summarizationText summarization
Text summarizationkareemhashem
 
Document Summarization
Document SummarizationDocument Summarization
Document SummarizationPratik Kumar
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarizationAbdelaziz Al-Rihawi
 
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNTextual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNRounak Dhaneriya
 

Tendances (20)

Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Folders
 
ODD EVEN BASED BINARY SEARCH
ODD EVEN BASED BINARY SEARCHODD EVEN BASED BINARY SEARCH
ODD EVEN BASED BINARY SEARCH
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Cohesive Software Design
Cohesive Software DesignCohesive Software Design
Cohesive Software Design
 
Probabilistic Information Retrieval
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information Retrieval
 
Ir 02
Ir   02Ir   02
Ir 02
 
Ir 03
Ir   03Ir   03
Ir 03
 
Text summarization
Text summarizationText summarization
Text summarization
 
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
 
Ju3517011704
Ju3517011704Ju3517011704
Ju3517011704
 
Cg4201552556
Cg4201552556Cg4201552556
Cg4201552556
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
 
K anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseK anonymity for crowdsourcing database
K anonymity for crowdsourcing database
 
Conceptual framework for abstractive text summarization
Conceptual framework for abstractive text summarizationConceptual framework for abstractive text summarization
Conceptual framework for abstractive text summarization
 
CS8091_BDA_Unit_III_Content_Based_Recommendation
CS8091_BDA_Unit_III_Content_Based_RecommendationCS8091_BDA_Unit_III_Content_Based_Recommendation
CS8091_BDA_Unit_III_Content_Based_Recommendation
 
Text summarization
Text summarizationText summarization
Text summarization
 
Document Summarization
Document SummarizationDocument Summarization
Document Summarization
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarization
 
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNTextual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNN
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 

En vedette

いったい何ができる?!福岡県産BaserCMSの基本機能と秘めたポテンシャル
いったい何ができる?!福岡県産BaserCMSの基本機能と秘めたポテンシャルいったい何ができる?!福岡県産BaserCMSの基本機能と秘めたポテンシャル
いったい何ができる?!福岡県産BaserCMSの基本機能と秘めたポテンシャルRyuji Egashira
 
131101 foss4 g_tokyo_grass_shell_presentation
131101 foss4 g_tokyo_grass_shell_presentation131101 foss4 g_tokyo_grass_shell_presentation
131101 foss4 g_tokyo_grass_shell_presentationTakayuki Nuimura
 
Firefox OS in Japan
Firefox OS in JapanFirefox OS in Japan
Firefox OS in Japandynamis
 
Google Web Toolkits
Google Web ToolkitsGoogle Web Toolkits
Google Web ToolkitsYiguang Hu
 
Comments oriented blog summarization by sentence extraction
Comments oriented blog summarization by sentence extractionComments oriented blog summarization by sentence extraction
Comments oriented blog summarization by sentence extractionJhih-Ming Chen
 
Progress Report 090820 2
Progress Report 090820 2Progress Report 090820 2
Progress Report 090820 2Jhih-Ming Chen
 
Adaptive web page content identification
Adaptive web page content identificationAdaptive web page content identification
Adaptive web page content identificationJhih-Ming Chen
 
Content extraction via tag ratios
Content extraction via tag ratiosContent extraction via tag ratios
Content extraction via tag ratiosJhih-Ming Chen
 
Progress Report 20091002
Progress Report 20091002Progress Report 20091002
Progress Report 20091002Jhih-Ming Chen
 
Financial Comic Information Retrieval System
Financial Comic Information Retrieval SystemFinancial Comic Information Retrieval System
Financial Comic Information Retrieval SystemJhih-Ming Chen
 

En vedette (14)

いったい何ができる?!福岡県産BaserCMSの基本機能と秘めたポテンシャル
いったい何ができる?!福岡県産BaserCMSの基本機能と秘めたポテンシャルいったい何ができる?!福岡県産BaserCMSの基本機能と秘めたポテンシャル
いったい何ができる?!福岡県産BaserCMSの基本機能と秘めたポテンシャル
 
Osc2013 nagoya0622moodle
Osc2013 nagoya0622moodleOsc2013 nagoya0622moodle
Osc2013 nagoya0622moodle
 
Fluzz pilulas 25
Fluzz pilulas 25Fluzz pilulas 25
Fluzz pilulas 25
 
131101 foss4 g_tokyo_grass_shell_presentation
131101 foss4 g_tokyo_grass_shell_presentation131101 foss4 g_tokyo_grass_shell_presentation
131101 foss4 g_tokyo_grass_shell_presentation
 
Firefox OS in Japan
Firefox OS in JapanFirefox OS in Japan
Firefox OS in Japan
 
Google Web Toolkits
Google Web ToolkitsGoogle Web Toolkits
Google Web Toolkits
 
Comments oriented blog summarization by sentence extraction
Comments oriented blog summarization by sentence extractionComments oriented blog summarization by sentence extraction
Comments oriented blog summarization by sentence extraction
 
Progress Report 090820 2
Progress Report 090820 2Progress Report 090820 2
Progress Report 090820 2
 
Ghost
GhostGhost
Ghost
 
Adaptive web page content identification
Adaptive web page content identificationAdaptive web page content identification
Adaptive web page content identification
 
Content extraction via tag ratios
Content extraction via tag ratiosContent extraction via tag ratios
Content extraction via tag ratios
 
Progress Report 20091002
Progress Report 20091002Progress Report 20091002
Progress Report 20091002
 
Financial Comic Information Retrieval System
Financial Comic Information Retrieval SystemFinancial Comic Information Retrieval System
Financial Comic Information Retrieval System
 
Shops4b
Shops4bShops4b
Shops4b
 

Similaire à Maximum Subsequence Segmentation for Article Text Extraction

MacAD.UK 2018: Your Code Should Document Itself
MacAD.UK 2018: Your Code Should Document ItselfMacAD.UK 2018: Your Code Should Document Itself
MacAD.UK 2018: Your Code Should Document ItselfBryson Tyrrell
 
SessionTen_CaseStudies
SessionTen_CaseStudiesSessionTen_CaseStudies
SessionTen_CaseStudiesHellen Gakuruh
 
1212 regular meeting
1212 regular meeting1212 regular meeting
1212 regular meetingmarxliouville
 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxdickonsondorris
 
IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET Journal
 
Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Robert Monné
 
Article link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docxArticle link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docxfredharris32
 
Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web: Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web: butest
 
Boilerplate removal and content
Boilerplate removal and contentBoilerplate removal and content
Boilerplate removal and contentIJCSEA Journal
 
Boilerplate Removal and Content Extraction from Dynamic Web Pages
Boilerplate Removal and Content Extraction from Dynamic Web PagesBoilerplate Removal and Content Extraction from Dynamic Web Pages
Boilerplate Removal and Content Extraction from Dynamic Web PagesIJCSEA Journal
 
"the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar."the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar.Vladimir Ulogov
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Miningiosrjce
 
Data structures and algorithms short note (version 14).pd
Data structures and algorithms short note (version 14).pdData structures and algorithms short note (version 14).pd
Data structures and algorithms short note (version 14).pdNimmi Weeraddana
 

Similaire à Maximum Subsequence Segmentation for Article Text Extraction (20)

Python Programming Homework Help
Python Programming Homework HelpPython Programming Homework Help
Python Programming Homework Help
 
MacAD.UK 2018: Your Code Should Document Itself
MacAD.UK 2018: Your Code Should Document ItselfMacAD.UK 2018: Your Code Should Document Itself
MacAD.UK 2018: Your Code Should Document Itself
 
SessionTen_CaseStudies
SessionTen_CaseStudiesSessionTen_CaseStudies
SessionTen_CaseStudies
 
FivaTech
FivaTechFivaTech
FivaTech
 
1212 regular meeting
1212 regular meeting1212 regular meeting
1212 regular meeting
 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docx
 
IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text Rank
 
Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)
 
final
finalfinal
final
 
FinalReport
FinalReportFinalReport
FinalReport
 
Article link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docxArticle link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docx
 
Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web: Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web:
 
Module 4.pptx
Module 4.pptxModule 4.pptx
Module 4.pptx
 
Boilerplate removal and content
Boilerplate removal and contentBoilerplate removal and content
Boilerplate removal and content
 
Boilerplate Removal and Content Extraction from Dynamic Web Pages
Boilerplate Removal and Content Extraction from Dynamic Web PagesBoilerplate Removal and Content Extraction from Dynamic Web Pages
Boilerplate Removal and Content Extraction from Dynamic Web Pages
 
"the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar."the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar.
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Mining
 
E017252831
E017252831E017252831
E017252831
 
Data structures and algorithms short note (version 14).pd
Data structures and algorithms short note (version 14).pdData structures and algorithms short note (version 14).pd
Data structures and algorithms short note (version 14).pd
 
Tldr
TldrTldr
Tldr
 

Maximum Subsequence Segmentation for Article Text Extraction

  • 1. Extracting Article Text from the Web with Maximum Subsequence Segmentation Author: Jeff Pasternack, Dan Roth Publication: WWW’2009 Presenter: Jhih-Ming Chen 1
  • 2. Outline Introduction Application Problem Description Identifying Articles Cleaning Extracted Blocks Experimental Setup Data Tokenization Algorithms Maximum Subsequence Optimization Applying Maximum Subsequence Results Conclusion 2
  • 3. Introduction Extracting article text, as with cleaning HTML documents ingeneral,has become much more complex. page layout contain sidebars, inserts,headers, navigation panels, ads, etc.. Most commonapproach has been to hand-code rules, often as regularexpressions. these can be very accurate, but are highly laborintensive and easily broken by changes to a page’s structure. Thishas led to interest to finding a better solution. 3
  • 4. Application There are a number of applications where article text extraction is either essential or can improve overall results. As devices such as PDAs and cell-phones have become commonplace, adapting web pages for small screens has become increasing important. In the context of article pages the article text is the primary concern of the user, and, once extracted. such as automatically generating RSS feeds from article pages. Eliminating the irrelevant context surrounding the article text is also useful in information retrieval. Query expansion benefits for the same reason, as unrelated terms in the page outside the article can be ignored. 4
  • 5. Problem Description Given a page, we would like to Decide if the page contains an article. Identify a contiguous block of HTML in the page starting with the first word in the article and ending with the last. Remove everything other than the article text (and its included markup tags) itself. As a practical matter, in the case where the first word or last word is nested within one or more pairs of tags. 5
  • 6. Identifying Articles The first step, determining whether a page contains an article, is a document classification problem. This paper’s evaluation implicitly assumes that such a classifier is provided, since all our testing examples contain articles. No such assumption is made in training. A complication arises in determining exactly where an article begins and where it ends. A news article typically follows the pattern “headline -> bylines -> main text -> bylines”. 6
  • 7. Identifying Articles An extraction may start at the first word of a headline. the first word of an authorship byline before themain text (e.g. “By John Doe”) or the organization byline (e.g.“The Associated Press”). if no author is given, the first word of the main text. An extraction may end at the last word of the main text. the last word of an authorship byline or organization byline appearing after the main text the first word of a headline. Articles may thus have up to six possible extractions. 7
  • 8. Identifying Articles When experimental results are reported, predicted extractions are compared: The main text alone (The shortest of the possible extractions. This is what the wrappers that generate the training examples). and The “best” extraction, the one that yields the most favorable F1-score for that particular prediction. 8
  • 9. Cleaning Extracted Blocks After inspecting the evaluation sets, authors found this could be done using just a few rules with little error. Starting with an extracted block of HTML that starts at thefirst word and ends at the last word of the main text of the article: Remove the contents of any <IFRAME> or <TABLE> tag pair. Remove the contents of any <DIV> tag pair that contains a hyperlink (<A>), <IFRAME>, <TABLE>, <IMG>, <EMBED>, <APPLET> or <OBJECT>. This heuristic works well because the text of most news articles very rarely contains links, and embeds are intended to either show you something (e.g. a picture) or get you to go somewhere (e.g. a link). Outside the news domain will not always be sofortunate. 9
  • 10. Experimental Setup Data 24000 training examples were semi-automatically gathered from twelve news websites. For each of the twelve training domains, a few pages were examined and a wrapper was used to extract the article text. These wrappers were then used to label the pages we had spidered from the source 450 evaluation examples were gathered from forty-five news websites (Big 5 and Myriad 40) and tagged by hand. “Big 5” set—fifty examples were tagged for each of the first five evaluation sources. “Myriad 40” set consists of the remaining forty sources chosen randomly from a list of several hundred (obtained from the Yahoo! Directory) and contains an international mix of English-language sites of widely varying size and sophistication. 10
  • 11. Experimental Setup Tokenization Breaking up the HTML into a list of tags, words and symbols, where numbers are considered to be “words”. When tokenizing pages, a few transformations are applied: A list of “common” tags was created by examining the training data and selecting all tags that occurred in pages of more than one source. Any tag not in this list is transformed into a generic <UNKNOWN>. Porter stemming is then applied to non-numeric words, and all numbers are stemmed to “1”. Everything between <SCRIPT> and <STYLE> tag pairs is discarded. 11
  • 12. Algorithms Maximum Subsequence Optimization Given a sequence S = (s1, s2,…,sn), where si ∈ R, a maximum subsequence of S is a sequence T = (sa, sa+1,…, sb) where 1 ≤ a ≤ b ≤ n. The sum of all the subsequence’s elements as its value. 12
  • 13. Applying Maximum Subsequence Baseline 1: Longest Cell or <DIV> It is adapted from [17], selects the table cell containing the most characters. This paper modified it to also consider the contents of <DIV> tags, since not all articles in the evaluation set were embedded in a table. Baseline 2: Simple MSS Find a maximum subsequence of asequence of scores generated with an extremely simple scoringfunction that assigns a score of -3.25 to every tag and 1 to wordsand symbols. 13
  • 14. The Supervised Approach Maximum subsequence segmentation uses local, token-level classifiers to find a score for each token in the document, resulting in a sequence of scores. The classifiers used in the experiments are all Naïve Bayes with only two types of features. trigrams Given a list of n tokens U =(u1, u2,…, un), the trigram of token i is <ui, ui+1, ui+2>. most-recent-unclosed-tags When a new opening tag is encountered, it is added to the stack. When a closing tag is found, the stack unwinds until a corresponding opening tag is matched and removed. The most-recent-unclosed-tag is the tag currently at the top of the stack. “in” or “out” labeldepending upon whether the particular token is within the articletext or not. 14
  • 15. The Supervised Approach When classifying new tokens, the model will consider an unrecognized trigram t not appearing in the training corpus as equally likely for either label: P(t|in) = P(t|out) = 0.5 The certainty of the classifier’s prediction, the probability p that a token is “in”, is then transformed into a score (p - 0.5). The supervised approach tendsto fail when the local classifier mistakenly believes that a sizableset of tokens outside the article are “in”. e.g.lengthy commentsections that, to the classifier, look like article text. 15
  • 16. The Supervised Approach There are three possible remedies. Choose a more expressive feature set or moresophisticated classifier with an accompanying increase incomplexity. Add constraints that exploit ourknowledge of the domain. <HR> tags, not in the article text, but they mayseparate that text from other content such as comments. This shows a substantial overall gain in the experimental results. Find a way to apply relatively higher scores to the more“important” tokens near the edges of the article text. prevent the maximum subsequence from exceeding theseboundaries to erroneously include other nearby positive-valuedsubsequences. However, which tokens are near the borders varies from source-to-source. 16
  • 17. The Semi-Supervised Approach Semi-supervised bootstrapping algorithm Train a Naïve Bayes local classifier on the provided labeled training examples with trigram and most-recent-unclosed-tag features, as in the supervised approach. Predict extractions for the unlabeled documents U. Iterate: (ten iterations) Choose a portion of the documents in U with the seemingly most likely correct predicted extractions, and call these L. Find “importance weights” for the trigrams of the documents in L. Train a new Naïve Bayes local classifier over the documents in L with trigram features. Predict new extractions for the documents in U. 17
  • 18. The Semi-Supervised Approach We are then left with three questions: How to estimate the likelihood a prediction is correct. What number of the most likely correct labeled examples to put into L. How to find and apply “importance weights” for the token trigrams. 18
  • 19. The Semi-Supervised Approach Remembering that documents are of varying lengths, we might try obtaining a [0, 1] score v by: pi is the probability given by the Naïve Bayes predictor that a token is “in”. j is the index of the first token of the predicted article text. k is the index of the last token. n is the total number of tokens in the document. If Naïve Bayes predicts with certainty (p = 1 or 0) that a token is “in” and this is later corrected after finding the maximum subsequence, v will be 0. 19
  • 20. The Semi-Supervised Approach This suggests a windowed approach: w is the window radius. A window radius of 10 was found to work well and is the parameter used in the experiments. Once every prediction is scored, a number of the highest-scoring are selected. In the experiments, only a limited number of unlabeled examples are available (100 from each of the five evaluation sources). Select with 60% (60 from each source) of the seemingly best-labeled examples from each source chosen for L. 20
  • 21. The Semi-Supervised Approach “Importance weight” More important tokens near the borders can exert greater influence on the final maximum subsequence found. Directly calculate a weight for each trigram feature based upon where tokens with that feature are relative to the edges of the article text. the “importance weight” of trigram h, mh, is given by: This the set of all tokens with trigram h. j and k is the indices of the first and last token in the article text, respectively. d is the distance cutoff.(d = 64) 21
  • 22. The Semi-Supervised Approach Importance weights vary from 1 (a trigram always appears immediately adjacent to the border between article text and non-article text) to 0 (a trigram is always found d or more tokens away from the border). If pi is the probability of “in” produced by Naïve Bayes for a token i with trigram h, then the score can be calculated as (pi-0.5)(mh・c + 1), where c ∈ R is a [0,∞) constant. c = 24. 22
  • 23. The Semi-Supervised Approach 23 60%U = L model training example Naïve Bayes unlabeled document
  • 24. Efficiency Tokenization, feature extraction, local classification, and maximum subsequence optimization may be interleaved so that the supervised method is not only linear time but also single pass. The semi-supervised approach, while still linear time, requires several passes over the documents to be labeled. Both algorithms require only a single pass over the labeled training data, after which it can be discarded. This may be a significant practical advantage since, given the appropriate wrapper, training on a source can occur while spidering, without ever having to persist pages to storage. 24
  • 25. Results Baseline 1: Longest Cell or <DIV> Table cell and <div> boundaries did not typically encapsulate the article text, and merely selecting the longest of these results in, as one might expect, low precision, but high recall. 25
  • 26. Results Baseline 2: Simple MSS There is a high degree of variability in performance among the sources tested—an F1-score of 99.5% for nypost.com versus 68.4% for suntimes.com 26
  • 27. Results Supervised Algorithm Looking at individual sources, we find that freep.com suffers from low precision because of lengthy comment sections that the local classifier mistakes for article text. 27
  • 28. Results Supervised Algorithm Though in these experiments the number of examples per source is fixed, there is an interesting potential tradeoff between adding easily-obtained examples from an existing source and writing wrappers to use new sources. 28
  • 29. Results Supervised + <HR> Constraint Algorithm A simple implementation of this idea is to add a constraint that articles cannot contain an <HR> (horizontal rule) tag, truncating any extraction when one is found. 29
  • 30. Results Semi-Supervised Algorithm The semi-supervised algorithm is more computationally expensive than the supervised methods but yields notably improved performance. Because the algorithm requires a large number of examples from each source, only the Big 5 sources are tested. 30
  • 31. Conclusion The effectiveness of efficient algorithms utilizing maximum subsequence segmentation to extract the text of articles from HTML documents has been demonstrated. While the semi-supervised results are significantly better than the supervised, constraints may be able to bridge the gap somewhat, and the supervised methods are faster as well as single-pass. While hand-written web scrapers are likely to persist for some time, MSS provides a fast, accurate method for extracting article text from a large number of sources in a variety of domains with relatively little effort, computational or otherwise. 31