SlideShare une entreprise Scribd logo
1  sur  51
Text Statistics & Document Parsing
             Lecture 7
           Sean A. Golliher
-   Only collect URLs on www.montana.edu
    - List out total
    - number of unique URLs
    - No sub-domains (www.cs.montana.edu) etc.
-   Assume page rank of home page is pr7 during 1st
    iteration
-   Calculate Google Matrix of all sub URLs.
-   Use Google Matrix Formula
-   Deliver list of Sub URLs and page ranks
-   Read:
    http://static.googleusercontent.com/external_conten
    t/untrusted_dlcp/research.google.com/en/us/pubs/a
    rchive/37043.pdf
-   TREC: Test collections of documents. Evaluation
    corpus
-   Text REtreival conference http://trec.nist.gov
-   Tweets2011 http://trec.nist.gov/data/tweets/
-   Twitter corpus tools:
    https://github.com/lintool/twitter-corpus-tools
-   Papers from 2011:
    http://trec.nist.gov/pubs/trec20/t20.proceedings.html
-   Linguistics Data Constortium:
    http://www.ldc.upenn.edu/
-   http://googleresearch.blogspot.com/2006/08/all-our-
    n-gram-are-belong-to-you.html (Not exactly. $24k)
 Converting documents to index terms (Last
  time)
 Why?
     Matching the exact string of characters typed by
      the user is too restrictive
      ○ i.e., it doesn‟t work very well in terms of
        effectiveness
     Not all words are of equal value in a search
     Sometimes not clear where words begin and end
      ○ Not even clear what a word is in some languages
         e.g., Chinese, Korean
 Huge variety of words used in text but
 Many statistical characteristics of word
  occurrences are predictable
     e.g., distribution of word counts
   Retrieval models and ranking algorithms
    depend heavily on statistical properties of
    words
     e.g., important words occur often in documents
     but are not high frequency in collection
   Distribution of word frequencies is very
    skewed
     a few words occur very often, many words hardly
      ever occur
     e.g., two most common words (“the”, “of”) make
      up about 10% of all word occurrences in text
      documents
   Zipf‟s “law”:
     observation that rank (r) of a word times its
      frequency (f) is approximately a constant (k)
      ○ assuming words are ranked in order of decreasing
        frequency
 r = rank = position in frequency table ranked
  by highest 1st
 f = % of word occurrence in corpus
 r = k/f
 Or, the most frequent word will appear ~ 2x
  as often as the 2nd most frequent word.
   Probability of occurrence of a word
     Pr = f/n
      ○ n is total number of word occurrences in text
      ○ f is frequency of word

 Zipf‟s law becomes r *Pr = c
 For English c ~0.1
Total documents
  84,678
  Total word occurrences    39,749,179
  Vocabulary size              198,763
  Words occurring > 1000 times
  4,169
  Words occurring once
  70,064
Word        Freq.      r        Pr(%) r.Pr
assistant 5,095         1,021     .013         0.13
sewers       100      17,110 2.56 × 10−4       0.04
toothbrush 10         51,555 2.56 × 10−5       0.01
hazmat         1 166,945 2.56 × 10−6       0.04
• Note problems at high and low frequencies
   As corpus grows, so does vocabulary size
     Fewer new words when corpus is already large
   Observed relationship (Heaps’ Law):

            v = k.nβ
      where v is vocabulary size (number of unique
    words),              n is the number of words in
    corpus,                     k, β are parameters that
    vary for each corpus        (typical values given are
    10 ≤ k ≤ 100 and β ≈ 0.5)
   Predictions for TREC collections are
    accurate for large numbers of words
     e.g., first 10,879,522 words of the AP89
      collection scanned
     prediction is 100,151 unique words
     actual number is 100,024
   Predictions for small numbers of words
    (i.e. < 1000) are much worse
   Heaps‟ Law works with very large
    corpora
     new words occurring even after seeing 30
     million!
New words come from a variety of sources
     ○ spelling errors, invented words (e.g. product,
       company names), code, other languages, email
       addresses, etc.
   Search engines must deal with these
    large and growing vocabularies
   How many pages contain all of the query terms?
   For the query “a b c”:
   P(a ∩ b ∩ c) = P(a)*P(b)*P(c) (joint probability)
   P(a) = fa/N .....
   fabc = N · fa/N · fb/N · fc/N = (fa · fb · fc)/N2
      ○ Since f = N*P
      ○ Assuming that terms occur independently
      ○ fabc is the estimated size of the result set
      ○ fa, fb, fc are the number of documents that terms a, b, and c
        occur in
      ○ N is the number of documents in the collection
Collection size (N) is
25,205,179
 Poor estimates because words are not
  independent
 Better estimates possible if co-occurrence
  information available
    P(c|(a ∩ b))
    Probability that the words occurs if a and b occur.
 Def: Forming words from sequence of
  characters
 Surprisingly complex in English, can be
  harder in other languages
 Early IR systems:
     any sequence of alphanumeric characters of
      length 3 or more
     terminated by a space or other special
      character
     upper-case changed to lower-case
   Example:
     “Bigcorp's 2007 bi-annual report showed profits
      rose 10%.” becomes
     “bigcorp 2007 annual report showed profits rose”
 Too simple for search applications or even
  large-scale experiments
 Why? Too much information lost
     Small decisions in tokenizing can have major
     impact on effectiveness of some queries
   Small words can be important in some
    queries, usually in combinations
      ○ xp, ma, pm, ben e king, el paso, master p, gm, j
       lo, world war II
   Both hyphenated and non-hyphenated forms
    of many words are common
     Sometimes hyphen is not needed
      ○ e-bay, wal-mart, active-x, cd-rom, t-shirts
     At other times, hyphens should be considered
     either as part of the word or a word separator
      ○ winston-salem, mazda rx-7, e-cards, pre-
       diabetes, t-mobile, spanish-speaking
 Special characters are an important part of
  tags, URLs, code in documents
 Capitalized words can have different
  meaning from lower case words
     Bush, Apple
   Apostrophes can be a part of a word, a part
    of a possessive, or just a mistake
     rosie o'donnell, can't, don't, 80's, 1890's, men's
     straw hats, master's degree, england's ten largest
     cities, shriner's
   Numbers can be important, including
    decimals
     nokia 3250, top 10 courses, united 93, quicktime
     6.5 pro, 92.3 the beat, 288358
   Periods can occur in numbers, abbreviations,
    URLs, ends of sentences, and other
    situations
     I.B.M., Ph.D., cs.umass.edu, F.E.A.R.
 First step is to use parser to identify appropriate
  parts of document to tokenize
 Defer complex decisions to other components
     word is any sequence of alphanumeric
      characters, terminated by a space or special
      character, with everything converted to lower-case
     everything indexed
     example: 92.3 → 92 3 but search finds documents
      with 92 and 3 adjacent
     incorporate some rules to reduce dependence on
      query transformation components
 Not that different than simple tokenizing
  process used in past
 Examples of rules used with TREC
     Apostrophes in words ignored
      ○ o‟connor → oconnor bob‟s → bobs
     Periods in abbreviations ignored
      ○ I.B.M. → ibm Ph.D. → ph d
 Function words (determiners,
  prepositions) have little meaning on their
  own
 High occurrence frequencies
 Treated as stopwords (i.e. removed)
     reduce index space, improve response time,
      improve effectiveness
   Can be important in combinations
     e.g., “to be or not to be”
 Stopword list can be created from high-
  frequency words or based on a standard
  list
 Lists are customized for
  applications, domains, and even parts of
  documents
     e.g., “click” is a good stopword for anchor text
   Best policy is to index all words in
    documents, make decisions about which
    words to use at query time
   Many morphological variations of words
     inflectional (plurals, tenses)
     derivational (making verbs nouns etc.)
 In most cases, these have the same or very
  similar meanings
 Stemmers attempt to reduce morphological
  variations of words to a common stem
     usually involves removing suffixes
   Can be done at indexing time or as part of
    query processing (like stopwords)
   User searching for news about Michael
    Phelps. Types “Michael Phelps Swimming”

   News articles appear with latest events
    containing events that already happened like
    “swam”.

   SE needs to match swimming to swam.
 Generally a small but significant
  effectiveness improvement
 Finding words that are semantically similar.
     can be crucial for some languages
     e.g., 5-10% improvement for English, up to 50%
     in Arabic




              Words with the Arabic root ktb
   Two basic types
     Dictionary-based: uses lists of related words
     Algorithmic: uses program to determine
     related words
   Algorithmic stemmers
     suffix-s: remove „s‟ endings assuming plural
      ○ e.g., cats → cat, lakes → lake, wiis → wii
      ○ Many false negatives: supplies → supplie
      ○ Some false positives: ups → up
   Algorithmic stemmer used in IR
    experiments since the 70s
   Consists of a series of rules designed to
    the longest possible suffix at each step
   Effective in TREC
   Produces stems not words
   Makes a number of errors and difficult to
    modify
   Example step (1 of 5)
   Porter2 stemmer addresses some of these issues
   Approach has been used with other languages
   Hybrid algorithmic-dictionary
     Word checked in dictionary
      ○ If present, either left alone or replaced with “exception”
      ○ If not present, word is checked for suffixes that could be
        removed
      ○ After removal, dictionary is checked again

 Produces words not stems
 Comparable effectiveness
 Lower false positive rate, somewhat higher false
  negative
 Many queries are 2-3 word phrases
 Phrases are
     More precise than single words
      ○ e.g., documents containing “black sea” vs. two words
        “black” and “sea”
     Less ambiguous
      ○ e.g., “big apple” vs. “apple”
   Can be difficult for ranking
      ○ e.g., Given query “fishing supplies”, how do we score
        documents with
         exact phrase many times, exact phrase just once, individual
          words in same sentence, same paragraph, whole
          document, variations on words?
 Text processing issue – how are phrases
  recognized?
 Three possible approaches:
     Identify syntactic phrases using a part-of-
      speech (POS) tagger
     Use word n-grams
     Store word positions in indexes and use
      proximity operators in queries
   POS taggers use statistical models of text to
    predict syntactic tags of words
     Example tags:
      ○ NN (singular noun), NNS (plural noun), VB (verb),
        VBD (verb, past tense), VBN (verb, past participle),
        IN (preposition), JJ (adjective), CC (conjunction,
        e.g., “and”, “or”), PRP (pronoun), and MD (modal
        auxiliary, e.g., “can”, “will”).
   Phrases can then be defined as simple noun
    groups, for example
   POS tagging too slow for large collections
   Simpler definition – phrase is any sequence of n
    words – known as n-grams
     bigram: 2 word sequence, trigram: 3 word
      sequence, unigram: single words
     N-grams also used at character level for applications
      such as OCR
   N-grams typically formed from overlapping
    sequences of words
     i.e. move n-word “window” one word at a time in
      document
     http://googleresearch.blogspot.com/2006/08/all-our-n-
      gram-are-belong-to-you.html?z
 Frequent n-grams are more likely to be
  meaningful phrases
 N-grams form a Zipf distribution
     Better fit than words alone
   Could index all n-grams up to specified
    length
     Much faster than POS tagging
     Uses a lot of storage
      ○ e.g., document containing 1,000 words would
        contain 3,990 instances of word n-grams of length 2
        ≤n≤5
 Web search engines index n-grams
 Google sample:




   Most frequent trigram in English is “all rights
    reserved”
     In Chinese, “limited liability corporation”
 Some parts of documents are more
  important than others
 Document parser recognizes structure
  using markup, such as HTML tags
     Headers, anchor text, bolded text all likely to
      be important
     Metadata can also be important
     Links used for link analysis
 Links are a key component of the Web
  (Page Rank Lecture)
 Important for navigation, but also for
  search
     e.g., <a href="http://example.com" >Example
      website</a>
     “Example website” is the anchor text
     “http://example.com” is the destination link
     both are used by search engines
   Used as a description of the content of the
    destination page
     i.e., collection of anchor text in all links pointing to
      a page used as an additional text field
 Anchor text tends to be
  short, descriptive, and similar to query text
 Retrieval experiments have shown that
  anchor text has significant impact on
  effectiveness for some types of queries
     i.e., more than PageRank
   Preliminaries:
     1) Extract links from the source text. You'll also want to extract
      the URL from each document in a separate file. Now you have
      all the links (source-destination pairs) and all the source
      documents
     2) Remove all links from the list that do not connect two
      documents in the corpus. The easiest way to do this is to sort all
      links by destination, then compare that against the corpus URLs
      list (also sorted)
     3) Create a new file I that contains a (url, pagerank) pair for each
      URL in the corpus. The initial PageRank value is 1/#D (#D =
      number of urls)
   At this point there are two interesting files:
      [L] links (trimmed to contain only corpus links, sorted by source
      URL)
     [I] URL/PageRank pairs, initialized to a constant

Contenu connexe

Tendances

11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)ThennarasuSakkan
 
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...Association for Computational Linguistics
 
A NEW STEMMER TO IMPROVE INFORMATION RETRIEVAL
A NEW STEMMER TO IMPROVE INFORMATION RETRIEVALA NEW STEMMER TO IMPROVE INFORMATION RETRIEVAL
A NEW STEMMER TO IMPROVE INFORMATION RETRIEVALIJNSA Journal
 
Sequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTSequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTwltrimbl
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Chunyang Chen
 
NLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological ParsingNLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological ParsingHemantha Kulathilake
 
NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:Parsing with Context-Free Grammar NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:Parsing with Context-Free Grammar Hemantha Kulathilake
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Tobias Wunner
 
What can corpus software do? Routledge chpt 11
 What can corpus software do? Routledge chpt 11 What can corpus software do? Routledge chpt 11
What can corpus software do? Routledge chpt 11RajpootBhatti5
 
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰ssuserc35c0e
 
All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...wltrimbl
 
Hps a hierarchical persian stemming method
Hps a hierarchical persian stemming methodHps a hierarchical persian stemming method
Hps a hierarchical persian stemming methodijnlc
 
HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)Nick Hathaway
 
Lda and it's applications
Lda and it's applicationsLda and it's applications
Lda and it's applicationsBabu Priyavrat
 
Tricks in natural language processing
Tricks in natural language processingTricks in natural language processing
Tricks in natural language processingBabu Priyavrat
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - IJaganadh Gopinadhan
 
Differentiable reasoning over a virtual knowledge base (Dhingra+ 2020) 文献講読
Differentiable reasoning over a virtual knowledge base (Dhingra+ 2020) 文献講読Differentiable reasoning over a virtual knowledge base (Dhingra+ 2020) 文献講読
Differentiable reasoning over a virtual knowledge base (Dhingra+ 2020) 文献講読poppyuri
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizersHa Loc Do
 
Recent developments in_text_steganography
Recent developments in_text_steganographyRecent developments in_text_steganography
Recent developments in_text_steganographyRubab Rizvi
 

Tendances (19)

11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)
 
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
 
A NEW STEMMER TO IMPROVE INFORMATION RETRIEVAL
A NEW STEMMER TO IMPROVE INFORMATION RETRIEVALA NEW STEMMER TO IMPROVE INFORMATION RETRIEVAL
A NEW STEMMER TO IMPROVE INFORMATION RETRIEVAL
 
Sequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTSequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RAST
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
 
NLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological ParsingNLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological Parsing
 
NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:Parsing with Context-Free Grammar NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:Parsing with Context-Free Grammar
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1
 
What can corpus software do? Routledge chpt 11
 What can corpus software do? Routledge chpt 11 What can corpus software do? Routledge chpt 11
What can corpus software do? Routledge chpt 11
 
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
 
All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...
 
Hps a hierarchical persian stemming method
Hps a hierarchical persian stemming methodHps a hierarchical persian stemming method
Hps a hierarchical persian stemming method
 
HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)
 
Lda and it's applications
Lda and it's applicationsLda and it's applications
Lda and it's applications
 
Tricks in natural language processing
Tricks in natural language processingTricks in natural language processing
Tricks in natural language processing
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
Differentiable reasoning over a virtual knowledge base (Dhingra+ 2020) 文献講読
Differentiable reasoning over a virtual knowledge base (Dhingra+ 2020) 文献講読Differentiable reasoning over a virtual knowledge base (Dhingra+ 2020) 文献講読
Differentiable reasoning over a virtual knowledge base (Dhingra+ 2020) 文献講読
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizers
 
Recent developments in_text_steganography
Recent developments in_text_steganographyRecent developments in_text_steganography
Recent developments in_text_steganography
 

Similaire à Lecture 7- Text Statistics and Document Parsing

Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfHabtamu100
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalcaptainmactavish1996
 
Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Shahriar Rafee
 
information retrival and text processing
information retrival and text processinginformation retrival and text processing
information retrival and text processingmausamraushan2288
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfJemalNesre1
 
More on Indexing Text Operations (1).pptx
More on Indexing  Text Operations (1).pptxMore on Indexing  Text Operations (1).pptx
More on Indexing Text Operations (1).pptxMahsadelavari
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptSamuelKetema1
 
Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...NALESVPMEngg
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsshrey bhate
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)Uma Se
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining TechniquesHouw Liong The
 
The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationIconic Translation Machines
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extractionGabriel Hamilton
 

Similaire à Lecture 7- Text Statistics and Document Parsing (20)

Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
 
Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Info 2402 irt-chapter_4
Info 2402 irt-chapter_4
 
information retrival and text processing
information retrival and text processinginformation retrival and text processing
information retrival and text processing
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
 
Ir 03
Ir   03Ir   03
Ir 03
 
NLP new words
NLP new wordsNLP new words
NLP new words
 
More on Indexing Text Operations (1).pptx
More on Indexing  Text Operations (1).pptxMore on Indexing  Text Operations (1).pptx
More on Indexing Text Operations (1).pptx
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
 
Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...
 
NLP
NLPNLP
NLP
 
NLP
NLPNLP
NLP
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine Translation
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 

Plus de Sean Golliher

Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)Sean Golliher
 
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:Sean Golliher
 
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Sean Golliher
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Sean Golliher
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - IndexingSean Golliher
 
PageRank and The Google Matrix
PageRank and The Google MatrixPageRank and The Google Matrix
PageRank and The Google MatrixSean Golliher
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerSean Golliher
 

Plus de Sean Golliher (9)

Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)
 
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
 
Goprez sg
Goprez  sgGoprez  sg
Goprez sg
 
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
 
PageRank and The Google Matrix
PageRank and The Google MatrixPageRank and The Google Matrix
PageRank and The Google Matrix
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Dernier (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Lecture 7- Text Statistics and Document Parsing

  • 1. Text Statistics & Document Parsing Lecture 7 Sean A. Golliher
  • 2. - Only collect URLs on www.montana.edu - List out total - number of unique URLs - No sub-domains (www.cs.montana.edu) etc. - Assume page rank of home page is pr7 during 1st iteration - Calculate Google Matrix of all sub URLs. - Use Google Matrix Formula - Deliver list of Sub URLs and page ranks - Read: http://static.googleusercontent.com/external_conten t/untrusted_dlcp/research.google.com/en/us/pubs/a rchive/37043.pdf
  • 3. - TREC: Test collections of documents. Evaluation corpus - Text REtreival conference http://trec.nist.gov - Tweets2011 http://trec.nist.gov/data/tweets/ - Twitter corpus tools: https://github.com/lintool/twitter-corpus-tools - Papers from 2011: http://trec.nist.gov/pubs/trec20/t20.proceedings.html - Linguistics Data Constortium: http://www.ldc.upenn.edu/ - http://googleresearch.blogspot.com/2006/08/all-our- n-gram-are-belong-to-you.html (Not exactly. $24k)
  • 4.  Converting documents to index terms (Last time)  Why?  Matching the exact string of characters typed by the user is too restrictive ○ i.e., it doesn‟t work very well in terms of effectiveness  Not all words are of equal value in a search  Sometimes not clear where words begin and end ○ Not even clear what a word is in some languages  e.g., Chinese, Korean
  • 5.  Huge variety of words used in text but  Many statistical characteristics of word occurrences are predictable  e.g., distribution of word counts  Retrieval models and ranking algorithms depend heavily on statistical properties of words  e.g., important words occur often in documents but are not high frequency in collection
  • 6. Distribution of word frequencies is very skewed  a few words occur very often, many words hardly ever occur  e.g., two most common words (“the”, “of”) make up about 10% of all word occurrences in text documents  Zipf‟s “law”:  observation that rank (r) of a word times its frequency (f) is approximately a constant (k) ○ assuming words are ranked in order of decreasing frequency
  • 7.  r = rank = position in frequency table ranked by highest 1st  f = % of word occurrence in corpus  r = k/f  Or, the most frequent word will appear ~ 2x as often as the 2nd most frequent word.
  • 8. Probability of occurrence of a word  Pr = f/n ○ n is total number of word occurrences in text ○ f is frequency of word  Zipf‟s law becomes r *Pr = c  For English c ~0.1
  • 9.
  • 10. Total documents 84,678 Total word occurrences 39,749,179 Vocabulary size 198,763 Words occurring > 1000 times 4,169 Words occurring once 70,064 Word Freq. r Pr(%) r.Pr assistant 5,095 1,021 .013 0.13 sewers 100 17,110 2.56 × 10−4 0.04 toothbrush 10 51,555 2.56 × 10−5 0.01 hazmat 1 166,945 2.56 × 10−6 0.04
  • 11.
  • 12. • Note problems at high and low frequencies
  • 13. As corpus grows, so does vocabulary size  Fewer new words when corpus is already large  Observed relationship (Heaps’ Law): v = k.nβ where v is vocabulary size (number of unique words), n is the number of words in corpus, k, β are parameters that vary for each corpus (typical values given are 10 ≤ k ≤ 100 and β ≈ 0.5)
  • 14.
  • 15. Predictions for TREC collections are accurate for large numbers of words  e.g., first 10,879,522 words of the AP89 collection scanned  prediction is 100,151 unique words  actual number is 100,024  Predictions for small numbers of words (i.e. < 1000) are much worse
  • 16.
  • 17. Heaps‟ Law works with very large corpora  new words occurring even after seeing 30 million! New words come from a variety of sources ○ spelling errors, invented words (e.g. product, company names), code, other languages, email addresses, etc.  Search engines must deal with these large and growing vocabularies
  • 18. How many pages contain all of the query terms?  For the query “a b c”:  P(a ∩ b ∩ c) = P(a)*P(b)*P(c) (joint probability)  P(a) = fa/N .....  fabc = N · fa/N · fb/N · fc/N = (fa · fb · fc)/N2 ○ Since f = N*P ○ Assuming that terms occur independently ○ fabc is the estimated size of the result set ○ fa, fb, fc are the number of documents that terms a, b, and c occur in ○ N is the number of documents in the collection
  • 19. Collection size (N) is 25,205,179
  • 20.  Poor estimates because words are not independent  Better estimates possible if co-occurrence information available P(c|(a ∩ b)) Probability that the words occurs if a and b occur.
  • 21.  Def: Forming words from sequence of characters  Surprisingly complex in English, can be harder in other languages  Early IR systems:  any sequence of alphanumeric characters of length 3 or more  terminated by a space or other special character  upper-case changed to lower-case
  • 22. Example:  “Bigcorp's 2007 bi-annual report showed profits rose 10%.” becomes  “bigcorp 2007 annual report showed profits rose”  Too simple for search applications or even large-scale experiments  Why? Too much information lost  Small decisions in tokenizing can have major impact on effectiveness of some queries
  • 23. Small words can be important in some queries, usually in combinations ○ xp, ma, pm, ben e king, el paso, master p, gm, j lo, world war II  Both hyphenated and non-hyphenated forms of many words are common  Sometimes hyphen is not needed ○ e-bay, wal-mart, active-x, cd-rom, t-shirts  At other times, hyphens should be considered either as part of the word or a word separator ○ winston-salem, mazda rx-7, e-cards, pre- diabetes, t-mobile, spanish-speaking
  • 24.  Special characters are an important part of tags, URLs, code in documents  Capitalized words can have different meaning from lower case words  Bush, Apple  Apostrophes can be a part of a word, a part of a possessive, or just a mistake  rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's degree, england's ten largest cities, shriner's
  • 25. Numbers can be important, including decimals  nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat, 288358  Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations  I.B.M., Ph.D., cs.umass.edu, F.E.A.R.
  • 26.  First step is to use parser to identify appropriate parts of document to tokenize  Defer complex decisions to other components  word is any sequence of alphanumeric characters, terminated by a space or special character, with everything converted to lower-case  everything indexed  example: 92.3 → 92 3 but search finds documents with 92 and 3 adjacent  incorporate some rules to reduce dependence on query transformation components
  • 27.  Not that different than simple tokenizing process used in past  Examples of rules used with TREC  Apostrophes in words ignored ○ o‟connor → oconnor bob‟s → bobs  Periods in abbreviations ignored ○ I.B.M. → ibm Ph.D. → ph d
  • 28.  Function words (determiners, prepositions) have little meaning on their own  High occurrence frequencies  Treated as stopwords (i.e. removed)  reduce index space, improve response time, improve effectiveness  Can be important in combinations  e.g., “to be or not to be”
  • 29.  Stopword list can be created from high- frequency words or based on a standard list  Lists are customized for applications, domains, and even parts of documents  e.g., “click” is a good stopword for anchor text  Best policy is to index all words in documents, make decisions about which words to use at query time
  • 30. Many morphological variations of words  inflectional (plurals, tenses)  derivational (making verbs nouns etc.)  In most cases, these have the same or very similar meanings  Stemmers attempt to reduce morphological variations of words to a common stem  usually involves removing suffixes  Can be done at indexing time or as part of query processing (like stopwords)
  • 31. User searching for news about Michael Phelps. Types “Michael Phelps Swimming”  News articles appear with latest events containing events that already happened like “swam”.  SE needs to match swimming to swam.
  • 32.  Generally a small but significant effectiveness improvement  Finding words that are semantically similar.  can be crucial for some languages  e.g., 5-10% improvement for English, up to 50% in Arabic Words with the Arabic root ktb
  • 33. Two basic types  Dictionary-based: uses lists of related words  Algorithmic: uses program to determine related words  Algorithmic stemmers  suffix-s: remove „s‟ endings assuming plural ○ e.g., cats → cat, lakes → lake, wiis → wii ○ Many false negatives: supplies → supplie ○ Some false positives: ups → up
  • 34. Algorithmic stemmer used in IR experiments since the 70s  Consists of a series of rules designed to the longest possible suffix at each step  Effective in TREC  Produces stems not words  Makes a number of errors and difficult to modify
  • 35. Example step (1 of 5)
  • 36. Porter2 stemmer addresses some of these issues  Approach has been used with other languages
  • 37. Hybrid algorithmic-dictionary  Word checked in dictionary ○ If present, either left alone or replaced with “exception” ○ If not present, word is checked for suffixes that could be removed ○ After removal, dictionary is checked again  Produces words not stems  Comparable effectiveness  Lower false positive rate, somewhat higher false negative
  • 38.
  • 39.  Many queries are 2-3 word phrases  Phrases are  More precise than single words ○ e.g., documents containing “black sea” vs. two words “black” and “sea”  Less ambiguous ○ e.g., “big apple” vs. “apple”  Can be difficult for ranking ○ e.g., Given query “fishing supplies”, how do we score documents with  exact phrase many times, exact phrase just once, individual words in same sentence, same paragraph, whole document, variations on words?
  • 40.  Text processing issue – how are phrases recognized?  Three possible approaches:  Identify syntactic phrases using a part-of- speech (POS) tagger  Use word n-grams  Store word positions in indexes and use proximity operators in queries
  • 41. POS taggers use statistical models of text to predict syntactic tags of words  Example tags: ○ NN (singular noun), NNS (plural noun), VB (verb), VBD (verb, past tense), VBN (verb, past participle), IN (preposition), JJ (adjective), CC (conjunction, e.g., “and”, “or”), PRP (pronoun), and MD (modal auxiliary, e.g., “can”, “will”).  Phrases can then be defined as simple noun groups, for example
  • 42.
  • 43.
  • 44. POS tagging too slow for large collections  Simpler definition – phrase is any sequence of n words – known as n-grams  bigram: 2 word sequence, trigram: 3 word sequence, unigram: single words  N-grams also used at character level for applications such as OCR  N-grams typically formed from overlapping sequences of words  i.e. move n-word “window” one word at a time in document  http://googleresearch.blogspot.com/2006/08/all-our-n- gram-are-belong-to-you.html?z
  • 45.  Frequent n-grams are more likely to be meaningful phrases  N-grams form a Zipf distribution  Better fit than words alone  Could index all n-grams up to specified length  Much faster than POS tagging  Uses a lot of storage ○ e.g., document containing 1,000 words would contain 3,990 instances of word n-grams of length 2 ≤n≤5
  • 46.  Web search engines index n-grams  Google sample:  Most frequent trigram in English is “all rights reserved”  In Chinese, “limited liability corporation”
  • 47.  Some parts of documents are more important than others  Document parser recognizes structure using markup, such as HTML tags  Headers, anchor text, bolded text all likely to be important  Metadata can also be important  Links used for link analysis
  • 48.
  • 49.  Links are a key component of the Web (Page Rank Lecture)  Important for navigation, but also for search  e.g., <a href="http://example.com" >Example website</a>  “Example website” is the anchor text  “http://example.com” is the destination link  both are used by search engines
  • 50. Used as a description of the content of the destination page  i.e., collection of anchor text in all links pointing to a page used as an additional text field  Anchor text tends to be short, descriptive, and similar to query text  Retrieval experiments have shown that anchor text has significant impact on effectiveness for some types of queries  i.e., more than PageRank
  • 51. Preliminaries:  1) Extract links from the source text. You'll also want to extract the URL from each document in a separate file. Now you have all the links (source-destination pairs) and all the source documents  2) Remove all links from the list that do not connect two documents in the corpus. The easiest way to do this is to sort all links by destination, then compare that against the corpus URLs list (also sorted)  3) Create a new file I that contains a (url, pagerank) pair for each URL in the corpus. The initial PageRank value is 1/#D (#D = number of urls)  At this point there are two interesting files:  [L] links (trimmed to contain only corpus links, sorted by source URL)  [I] URL/PageRank pairs, initialized to a constant

Notes de l'éditeur

  1. Probability
  2. 50 most popular words in the APR. Most innacurate for low and high frequency words.
  3. Log-log plot shows problems at low and high frequncies. Mandelbrot proposed alternative.
  4. As the size of a corpus grows so does the vocabulary. Invented words, spelling errors, startups, products etc.
  5. Since f = N*P we can multiply N * Pabc to get to fa*fb*fn/N^2
  6. Why is this a bad estimate? Probability of a word occuring depends on if another word is already present.
  7. Joint probability =
  8. False positive…pairs of words that have the same stems. The words organization and organ would have the same stem. So the algorithm does it’s job but they both reduce to organ. False negative are pairs of words that have different stems. European and Europe would have different stems but should not. So the algorithm works correctly but the out put is not correct.
  9. Should we look for broad match, exact match, or phrase match? Should it appear in the same paragraph or the same document?