SlideShare une entreprise Scribd logo
1  sur  75
Télécharger pour lire hors ligne
Text and Web Mining
Machine Learning and Data Mining (Unit 19)




Prof. Pier Luca Lanzi
References                                2




  Jiawei Han and Micheline Kamber,
  quot;Data Mining: Concepts and
  Techniquesquot;, The Morgan Kaufmann
  Series in Data Management Systems
  (Second Edition)

  Chapter 10, part 2

  Web Mining Course by
  Gregory-Platesky Shapiro available at
  www.kdnuggets.com




                  Prof. Pier Luca Lanzi
Mining Text Data: An Introduction                                        3




                          Data Mining / Knowledge Discovery




 Structured Data            Multimedia                      Free Text                   Hypertext

                                                                                 <a href>Frank Rizzo
                                                        Frank Rizzo bought
HomeLoan (
                                                                                 </a> Bought
                                                      his home from Lake
 Loanee: Frank Rizzo
                                                                                 <a hef>this home</a>
                                                      View Real Estate in
 Lender: MWF
                                                                                 from <a href>Lake
                                                      1992.
 Agency: Lake View
                                                                                 View Real Estate</a>
                                                        He paid $200,000
 Amount: $200,000
                                                                                 In <b>1992</b>.
                                                      under a15-year loan
 Term: 15 years
                       Loans($200K,[map],...)                                    <p>...
                                                      from MW Financial.
)

                              Prof. Pier Luca Lanzi
Bag-of-Tokens Approaches                                         4




           Documents                                           Token Sets


   Four score and seven                                     nation – 5
 years ago our fathers brought                              civil - 1
 forth on this continent, a new                             war – 2
 nation, conceived in Liberty,                  Feature     men – 2
 and dedicated to the                          Extraction   died – 4
 proposition that all men are                               people – 5
 created equal.                                             Liberty – 1
   Now we are engaged in a                                  God – 1
 great civil war, testing                                   …
 whether that nation, or …



           Loses all order-specific information!
                 Severely limits context!
                       Prof. Pier Luca Lanzi
Natural Language Processing                                                 5




               A dog is chasing a boy on the playground                                   Lexical
                                                                                         analysis
              Det   Noun Aux        Verb       Det Noun Prep Det            Noun
                                                                                      (part-of-speech
                                                                                         tagging)
                                                                       Noun Phrase
             Noun Phrase                       Noun Phrase
                            Complex Verb

                                                                    Prep Phrase
Semantic analysis                     Verb Phrase
                                                                            Syntactic analysis
 Dog(d1).                                                                       (Parsing)
 Boy(b1).
                                                      Verb Phrase
 Playground(p1).
 Chasing(d1,b1,p1).
                                           Sentence
         +
Scared(x) if Chasing(_,x,_).
                                                                         A person saying this may
                                                                      be reminding another person to
                                                                            get the dog back…
   Scared(b1)
                                                                           Pragmatic analysis
   Inference
                                                                              (speech act)

(Taken from ChengXiang Zhai, CS 397cxz – Fall 2003)
                                   Prof. Pier Luca Lanzi
General NLP—Too Difficult!                             6



  Word-level ambiguity
     “design” can be a noun or a verb (Ambiguous POS)
     “root” has multiple meanings (Ambiguous sense)
  Syntactic ambiguity
     “natural language processing” (Modification)
     “A man saw a boy with a telescope.” (PP Attachment)
  Anaphora resolution
     “John persuaded Bill to buy a TV for himself.”
     (himself = John or Bill?)
  Presupposition
     “He has quit smoking.” implies that he smoked before.

 Humans rely on context to interpret (when possible).
 This context may extend beyond a given document!

 (Taken from ChengXiang Zhai, CS 397cxz – Fall 2003)
                           Prof. Pier Luca Lanzi
Shallow Linguistics                       7



  English Lexicon
  Part-of-Speech Tagging
  Word Sense Disambiguation
  Phrase Detection / Parsing




                  Prof. Pier Luca Lanzi
WordNet                                                 8



 An extensive lexical network for the English language
 Contains over 138,838 words.
 Several graphs, one for each part-of-speech.
 Synsets (synonym sets), each defining a semantic sense.
 Relationship information (antonym, hyponym, meronym …)
 Downloadable for free (UNIX, Windows)
 Expanding to other languages (Global WordNet Association)
 Funded >$3 million, mainly government (translation interest)
 Founder George Miller, National Medal of Science, 1991.

                           watery          parched



               moist                         dry       arid
                             wet
  synonym

                                                              antonym
                            damp           anhydrous

                   Prof. Pier Luca Lanzi
Part-of-Speech Tagging                                                             9




                       Training data (Annotated text)
    This    sentence     serves      as     an example            of     annotated        text…
    Det        N           V1        P      Det  N                P         V2             N



                                                                   This is a new                  sentence.
                                      POS Tagger
“This is a new sentence.”
                                                                   Det Aux Det Adj                   N



                        Pick the most1likely, ttag tsequence.
                                  p(w ,..., wk 1,..., k )
                                               ⎧ p(t1 | w1 )... p(tk | wk ) p(w1 )... p(wk )
                                               ⎪
               p(w1 ,..., wk , t1 ,..., tk ) = ⎨ k
                                               ⎪∏ p(wi | ti ) p(ti | ti−1 )
                                                                                          Independent assignment
                                               ⎩ =1
                ⎧ p(t1 | w1 )... p(tk | wk ) p(iw1 )... p(wk )                                Most common tag
                ⎪
               =⎨ k
                ⎪∏ p(wi | ti ) p(ti | ti −1 )
                ⎩ i =1                                                               Partial dependency
                                                                                    (HMM)
                              Prof. Pier Luca Lanzi
Word Sense Disambiguation                                 10




                                                        ?
“The difficulties of computational linguistics are rooted in ambiguity.”
                                       N      Aux V        P   N

      Supervised Learning
      Features:
          Neighboring POS tags (N Aux V P N)
          Neighboring words (linguistics are rooted in ambiguity)
          Stemmed form (root)
          Dictionary/Thesaurus entries of neighboring words
          High co-occurrence words (plant, tree, origin,…)
          Other senses of word within discourse
      Algorithms:
          Rule-based Learning (e.g. IG guided)
          Statistical Learning (i.e. Naïve Bayes)
          Unsupervised Learning (i.e. Nearest Neighbor)

(Adapted from ChengXiang Zhai, CS 397cxz – Fall 2003)

                                Prof. Pier Luca Lanzi
Parsing                                                                                 11

(Adapted from ChengXiang Zhai, CS 397cxz – Fall 2003)

Choose most likely parse tree…                                     S                   Probability of this tree=0.000015

                                                          NP                   VP
     Probabilistic CFG
                                                    Det
             S→ NP VP          1.0                           BNP              VP                         PP
             NP → Det BNP      0.3
             NP → BNP                               A
                               0.4                             N       Aux         V        NP      P             NP
                               0.3
             NP→ NP PP
Grammar
                                                                       is chasing                   on
             BNP→ N                                          dog
                               …
             VP → V                                                                        a boy
                                                                                                         the playground
                                               .
             VP → Aux V NP                     .
                                               .
                               …
             VP → VP PP                                                                Probability of this tree=0.000011
                                                                   S
             PP → P NP         1.0
                                                          NP                   VP
             V → chasing       0.01                 Det                                            NP
                                                             BNP        Aux        V
             Aux→ is
                                                                                                             PP
             N → dog          0.003                 A                          chasing NP
                                                               N         is
             N → boy
Lexicon                                                                                                           NP
                                                                                                         P
             N→ playground     …
                                                             dog                            a boy
             Det→ the                                                                                   on
                               …
             Det→ a
             P → on                                                                                      the playground
                                     Prof. Pier Luca Lanzi
Obstacles                                     12



  Ambiguity

              A man saw a boy with a telescope.

  Computational Intensity

                 Imposes a context horizon.

  Text Mining NLP Approach
     Locate promising fragments using fast IR methods
     (bag-of-tokens)
     Only apply slow NLP techniques to promising fragments




                  Prof. Pier Luca Lanzi
Summary: Shallow NLP                            13



  However, shallow NLP techniques are feasible and useful:
  Lexicon – machine understandable linguistic knowledge
     possible senses, definitions, synonyms, antonyms, typeof,
     etc.
  POS Tagging – limit ambiguity (word/POS), entity extraction
  WSD – stem/synonym/hyponym matches (doc and query)
     Query: “Foreign cars”     Document: “I’m selling a 1976
     Jaguar…”
  Parsing – logical view of information (inference?,
  translation?)
     A man saw a boy with a telescope.”
  Even without complete NLP, any additional knowledge
  extracted from text data can only be beneficial.
  Ingenuity will determine the applications.


                   Prof. Pier Luca Lanzi
References for NLP                                 14



  C. D. Manning and H. Schutze, “Foundations of Natural Language
  Processing”, MIT Press, 1999.
  S. Russell and P. Norvig, “Artificial Intelligence: A Modern
  Approach”, Prentice Hall, 1995.
  S. Chakrabarti, “Mining the Web: Statistical Analysis of Hypertext
  and Semi-Structured Data”, Morgan Kaufmann, 2002.
  G. Miller, R. Beckwith, C. FellBaum, D. Gross, K. Miller, and R.
  Tengi. Five papers on WordNet. Princeton University, August 1993.
  C. Zhai, Introduction to NLP, Lecture Notes for CS 397cxz, UIUC,
  Fall 2003.
  M. Hearst, Untangling Text Data Mining, ACL’99, invited paper.
  http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-
  tdm.html
  R. Sproat, Introduction to Computational Linguistics, LING 306,
  UIUC, Fall 2003.
  A Road Map to Text Mining and Web Mining, University of Texas
  resource page. http://www.cs.utexas.edu/users/pebronia/text-
  mining/
  Computational Linguistics and Text Mining Group, IBM Research,
  http://www.research.ibm.com/dssgrp/


                    Prof. Pier Luca Lanzi
Text Databases and Information Retrieval      15



  Text databases (document databases)
     Large collections of documents from various sources:
     news articles, research papers, books, digital libraries,
     E-mail messages, and Web pages, library database, etc.
     Data stored is usually semi-structured
     Traditional information retrieval techniques become
     inadequate for the increasingly vast amounts of text data

  Information retrieval
     A field developed in parallel with database systems
     Information is organized into (a large number of)
     documents
     Information retrieval problem: locating relevant
     documents based on user input, such as keywords or
     example documents

                  Prof. Pier Luca Lanzi
Information Retrieval (IR)                   16



  Typical Information Retrieval systems
     Online library catalogs
     Online document management systems

  Information retrieval vs. database systems
     Some DB problems are not present in IR, e.g.,
     update, transaction management, complex objects
     Some IR problems are not addressed well in DBMS, e.g.,
     unstructured documents, approximate search using
     keywords and relevance




                  Prof. Pier Luca Lanzi
Basic Measures for Text Retrieval                              17




                Relevant              Relevant &
                                                   Retrieved
                                       Retrieved




                                 All Documents



  Precision: the percentage of retrieved documents that are in fact
  relevant to the query (i.e., “correct” responses)
                         | {Relevant} ∩ {Retrieved } |
             precision =
                                | {Retrieved} |
  Recall: the percentage of documents that are relevant to the query
  and were, in fact, retrieved
                         | {Relevant} ∩ {Retrieved } |
             precision =
                                 | {Relevant} |
                       Prof. Pier Luca Lanzi
Information Retrieval Techniques             18



  Basic Concepts
     A document can be described by a set of representative
     keywords called index terms.
     Different index terms have varying relevance when used
     to describe document contents.
     This effect is captured through the assignment of
     numerical weights to each index term of a document.
     (e.g.: frequency, tf-idf)

  DBMS Analogy
    Index Terms    Attributes
    Weights   Attribute Values




                  Prof. Pier Luca Lanzi
Information Retrieval Techniques          19



  Index Terms (Attribute) Selection:
     Stop list
     Word stem
     Index terms weighting methods

  Terms    Documents Frequency Matrices

  Information Retrieval Models:
     Boolean Model
     Vector Model
     Probabilistic Model




                  Prof. Pier Luca Lanzi
Boolean Model                                20



  Consider that index terms are either present
  or absent in a document
  As a result, the index term weights are assumed
  to be all binaries
  A query is composed of index terms linked by three
  connectives: not, and, and or
      E.g.: car and repair, plane or airplane
  The Boolean model predicts that each document is either
  relevant or non-relevant based on the match of a document
  to the query




                  Prof. Pier Luca Lanzi
Keyword-Based Retrieval                      21



  A document is represented by a string, which can be
  identified by a set of keywords

  Queries may use expressions of keywords
    E.g., car and repair shop, tea or coffee,
    DBMS but not Oracle
    Queries and retrieval should consider synonyms,
    e.g., repair and maintenance

  Major difficulties of the model
    Synonymy: A keyword T does not appear anywhere in the
    document, even though the document is closely related to
    T, e.g., data mining
    Polysemy: The same keyword may mean different things
    in different contexts, e.g., mining

                  Prof. Pier Luca Lanzi
Similarity-Based Retrieval in Text Data      22



  Finds similar documents based on a set of common keywords

  Answer should be based on the degree of relevance based on
  the nearness of the keywords, relative frequency of the
  keywords, etc.

  Basic techniques

  Stop list
     Set of words that are deemed “irrelevant”, even though
     they may appear frequently
     E.g., a, the, of, for, to, with, etc.
     Stop lists may vary when document set varies



                     Prof. Pier Luca Lanzi
Similarity-Based Retrieval in Text Data               23



  Word stem
    Several words are small syntactic variants of each other since
    they share a common word stem
    E.g., drug, drugs, drugged

  A term frequency table
      Each entry frequent_table(i, j) = # of occurrences of the word
      ti in document di
      Usually, the ratio instead of the absolute number of occurrences
      is used

  Similarity metrics: measure the closeness of a document to a query
  (a set of keywords)
      Relative term occurrences
      Cosine distance:
                                         v1 ⋅ v2
                       sim(v1 , v2 ) =
                                       | v1 || v2 |
                    Prof. Pier Luca Lanzi
Indexing Techniques                                24



  Inverted index
     Maintains two hash- or B+-tree indexed tables:
       • document_table: a set of document records <doc_id,
         postings_list>
       • term_table: a set of term records, <term, postings_list>
     Answer query: Find all docs associated with one or a set of
     terms
     + easy to implement
     – do not handle well synonymy and polysemy, and posting lists
     could be too long (storage could be very large)

  Signature file
     Associate a signature with each document
     A signature is a representation of an ordered list of terms that
     describe the document
     Order is obtained by frequency analysis, stemming and stop lists




                    Prof. Pier Luca Lanzi
Vector Space Model                                  25



  Documents and user queries are represented as m-dimensional
  vectors, where m is the total number of index terms in the
  document collection.
  The degree of similarity of the document d with regard to the query
  q is calculated as the correlation between the vectors that represent
  them, using measures such as the Euclidian distance or the cosine
  of the angle between these two vectors.




                    Prof. Pier Luca Lanzi
Latent Semantic Indexing (1)                         26



  Basic idea
     Similar documents have similar word frequencies
     Difficulty: the size of the term frequency matrix is very large
     Use a singular value decomposition (SVD) techniques to reduce
     the size of frequency table
     Retain the K most significant rows of the frequency table
  Method
     Create a term x document weighted frequency matrix A
     SVD construction: A = U * S * V’
     Define K and obtain Uk ,, Sk , and Vk.
     Create query vector q’ .
     Project q’ into the term-document space: Dq = q’ * Uk * Sk-1
     Calculate similarities: cos α = Dq . D / ||Dq|| * ||D||




                    Prof. Pier Luca Lanzi
Latent Semantic Indexing (2)                   27




Weighted Frequency Matrix




  Query Terms:
  - Insulation
  - Joint


                        Prof. Pier Luca Lanzi
Probabilistic Model                             28



  Basic assumption: Given a user query, there is a set of
  documents which contains exactly the relevant documents
  and no other (ideal answer set)
  Querying process as a process of specifying the properties of
  an ideal answer set. Since these properties are not known at
  query time, an initial guess is made
  This initial guess allows the generation of a preliminary
  probabilistic description of the ideal answer set which is used
  to retrieve the first set of documents
  An interaction with the user is then initiated with the purpose
  of improving the probabilistic description of the answer set




                   Prof. Pier Luca Lanzi
Types of Text Data Mining                      29



  Keyword-based association analysis
  Automatic document classification
  Similarity detection
     Cluster documents by a common author
     Cluster documents containing information
     from a common source
  Link analysis: unusual correlation between entities
  Sequence analysis: predicting a recurring event
  Anomaly detection: find information that violates usual
  patterns
  Hypertext analysis
     Patterns in anchors/links
     Anchor text correlations with linked objects



                  Prof. Pier Luca Lanzi
Keyword-Based Association Analysis                  30



  Motivation
    Collect sets of keywords or terms that occur frequently
    together and then find the association or correlation
    relationships among them

  Association Analysis Process
     Preprocess the text data by parsing, stemming,
     removing stop words, etc.
     Evoke association mining algorithms
      • Consider each document as a transaction
      • View a set of keywords in the document as
        a set of items in the transaction
     Term level association mining
      • No need for human effort in tagging documents
      • The number of meaningless results and
        the execution time is greatly reduced



                   Prof. Pier Luca Lanzi
Text Classification (1)                         31



  Motivation
    Automatic classification for the large number of on-line
    text documents (Web pages, e-mails, corporate intranets,
    etc.)

  Classification Process
     Data preprocessing
     Definition of training set and test sets
     Creation of the classification model using the selected
     classification algorithm
     Classification model validation
     Classification of new/unknown text documents

  Text document classification differs
  from the classification of relational data
     Document databases are not structured according to
     attribute-value pairs


                   Prof. Pier Luca Lanzi
Text Classification (2)                   32



  Classification Algorithms:
     Support Vector Machines
     K-Nearest Neighbors
     Naïve Bayes
     Neural Networks
     Decision Trees
     Association rule-based
     Boosting




                  Prof. Pier Luca Lanzi
Document Clustering                                 33



  Motivation
     Automatically group related documents based on their contents
     No predetermined training sets or taxonomies
     Generate a taxonomy at runtime

  Clustering Process
     Data preprocessing:
     remove stop words, stem, feature extraction, lexical analysis,
     etc.
     Hierarchical clustering:
     compute similarities applying clustering algorithms.
     Model-Based clustering (Neural Network Approach):
     clusters are represented by “exemplars”. (e.g.: SOM)




                    Prof. Pier Luca Lanzi
Text Categorization                             34



  Pre-given categories and labeled document examples
  (Categories may form hierarchy)
  Classify new documents
  A standard classification (supervised learning ) problem



                                                     Sports
                          Categorization
                                                     Business
                             System
                                                     Education

                                                …     …
                                                     Science
                                    Sports
                                    Business

                                    Education




                   Prof. Pier Luca Lanzi
Applications                              35



  News article classification
  Automatic email filtering
  Webpage classification
  Word sense disambiguation
  ……




                  Prof. Pier Luca Lanzi
Categorization Methods                                36



  Manual: Typically rule-based
    Does not scale up (labor-intensive, rule inconsistency)
    May be appropriate for special data on a particular
    domain

  Automatic: Typically exploiting machine learning techniques
     Vector space model based
      •   Prototype-based (Rocchio)
      •   K-nearest neighbor (KNN)
      •   Decision-tree (learn rules)
      •   Neural Networks (learn non-linear classifier)
      •   Support Vector Machines (SVM)
     Probabilistic or generative model based
      • Naïve Bayes classifier



                      Prof. Pier Luca Lanzi
Vector Space Model                              37



  Represent a doc by a term vector
    Term: basic concept, e.g., word or phrase
    Each term defines one dimension
    N terms define a N-dimensional space
    Element of vector corresponds to term weight
    E.g., d = (x1,…,xN), xi is “importance” of term I

  New document is assigned to the most likely category based
  on vector similarity




                   Prof. Pier Luca Lanzi
Illustration of the Vector Space Model                      38




                          Starbucks
                                                      C2   Category 2


     Category 3
              C3




                                                             Java
                                            new doc


                                               C1 Category 1
        Microsoft



                    Prof. Pier Luca Lanzi
What VS Model Does Not Specify?                 39



  How to select terms to capture “basic concepts”
    Word stopping
      • e.g. “a”, “the”, “always”, “along”
     Word stemming
      • e.g. “computer”, “computing”, “computerize” => “compute”
    Latent semantic indexing
  How to assign weights
    Not all words are equally important: Some are more
    indicative than others
      • e.g. “algebra” vs. “science”
  How to measure the similarity




                    Prof. Pier Luca Lanzi
How to Assign Weights                           40



  Two-fold heuristics based on frequency

  TF (Term frequency)
     More frequent within a document       more relevant to
     semantics
     E.g., “query” vs. “commercial”

  IDF (Inverse document frequency)
     Less frequent among documents        more discriminative
     E.g. “algebra” vs. “science”




                  Prof. Pier Luca Lanzi
Term Frequency Weighting                     41



  Weighting
    The more frequent, the more relevant to topic
    E.g. “query” vs. “commercial”
    Raw TF= f(t,d): how many times term t appears in doc d

  Normalization
     When document length varies,
     then relative frequency is preferred
     E.g., Maximum frequency normalization




                 Prof. Pier Luca Lanzi
Inverse Document Frequency Weighting         42



  Idea: The less frequent terms among documents are the
  more discriminative

  Formula:




  Given
     n, total number of docs
     k, the number of docs with term t appearing
     (the DF document frequency)




                  Prof. Pier Luca Lanzi
TF-IDF Weighting                               43



  TF-IDF weighting: weight(t, d) = TF(t, d) * IDF(t)
     Frequent within doc   high tf     high weight
     Selective among docs     high idf    high weight

  Recall VS model
     Each selected term represents one dimension
     Each doc is represented by a feature vector
     Its t-term coordinate of document d is the TF-IDF weight
     This is more reasonable

  Just for illustration …
     Many complex and more effective weighting variants exist
     in practice



                   Prof. Pier Luca Lanzi
How to Measure Similarity?                  44



  Given two document



  Similarity definition
     dot product




     normalized dot product (or cosine)




                    Prof. Pier Luca Lanzi
Illustrative Example                                                            45




          text
doc1     mining     Sim(newdoc,doc1)=4.8*2.4+4.5*4.5
         search
         engine
                    Sim(newdoc,doc2)=2.4*2.4
          text
                                                                  To whom is newdoc
                                                                  more similar?
         travel
          text
                    Sim(newdoc,doc3)=0
          map
doc2     travel

                               text   mining   travel   map search engine govern president congress
                    IDF(faked) 2.4      4.5      2.8    3.3   2.1    5.4     2.2    3.2      4.3

       government   doc1       2(4.8) 1(4.5)                     1(2.1)   1(5.4)
        president   doc2       1(2.4 )          2 (5.6) 1(3.3)
        congress
doc3                doc3                                                           1 (2.2) 1(3.2)   1(4.3)

                    newdoc     1(2.4) 1(4.5)
       ……
                             Prof. Pier Luca Lanzi
Vector Space Model-Based Classifiers          46



  What do we have so far?
    A feature space with similarity measure
    This is a classic supervised learning problem
    Search for an approximation to classification hyper plane

  VS model based classifiers
     K-NN
     Decision tree based
     Neural networks
     Support vector machine




                  Prof. Pier Luca Lanzi
Probabilistic Model                            47



  Category C is modeled as a probability distribution
  of predefined random events
  Random events model the process
  of generating documents
  Therefore, how likely a document d belongs
  to category C is measured through the probability
  for category C to generate d.




                  Prof. Pier Luca Lanzi
Quick Revisit of Bayes’ Rule                                 48



  Category Hypothesis space: H = {C1 , …, Cn}
  One document: D


                               P ( D | Ci ) P(Ci )
                  P (Ci | D) =
                                     P( D)

  As we want to pick the most likely category C*, we can drop
  p(D)
          Posterior probability of Ci


C* = arg max C P(C | D) = arg max C P ( D | C ) P(C )
                                            Document model for category C


                    Prof. Pier Luca Lanzi
Probabilistic Model: Multi-Bernoulli                                                              49



   Event: word presence or absence

   D = (x1, …, x|V|)
   xi =1 for presence of word wi
   xi =0 for absence

   Parameters
      {p(wi=1|C), p(wi=0|C)}
      p(wi=1|C)+ p(wi=0|C)=1
                               |V |                          |V |                          |V |
 p( D = ( x1 ,..., x|V | ) | C )= ∏ p( wi = xi | C ) =     ∏                              ∏
                                                                        p(wi = 1| C )                 p( wi = 0 | C )
                              i =1                        i =1, xi =1                   i =1, xi =0




                                  Prof. Pier Luca Lanzi
Probabilistic Model: Multinomial                                       50



  Event: word selection/sampling

  D = (n1, …, n|V|)
  ni: frequency of word wi
  n=n1,+…+ n|V|

  Parameters
     {p(wi|C)}
     p(w1|C)+… p(w|v||C) = 1

                                                  ⎛              ⎞ |V |
                                                       n
                                                    n1 ... n|V | ⎠ ∏
   p ( D = ( n1 , ..., n|v | ) | C )= p ( n | C ) ⎜                     p ( w i | C ) ni
                                                                 ⎟
                                                  ⎝                i =1




                           Prof. Pier Luca Lanzi
Parameter Estimation                                                                      51




                                                Category prior
 Training examples:
                                                                                     | E (Ci ) |
                                                            p (Ci ) =
                        E(C2)
  E(C1)                                                                          k

                                                                             ∑ | E (C ) |     j
                                                                                 j =1

       C1                   C2
                                                Multi-Bernoulli Doc model
            Ck                                                    ∑ δ(w ,d)+0.5                          ⎧1 if wj occursin d
                                                                             j
                                                                d∈E(Ci )
                                                                                             δ(wj ,d) = ⎨
                                              p(wj =1| Ci ) =
                                                                      | E(Ci )| +1                       ⎩0 otherwise
                       E(Ck)
                                                Multinomial doc model
Vocabulary: V = {w1, …, w|V|}
                                                                   ∑ c(w ,d)+1   j
                                                                  d∈E(Ci )
                                              p(wj | Ci ) = |V|                                    c(wj ,d) =counts of wj in d
                                                            ∑ ∑ c(w ,d)+|V |     m
                                                            m=1 d∈E(Ci )



                      Prof. Pier Luca Lanzi
Classification of New Document                                                                   52




            Multi-Bernoulli                                                      Multinomial
 d = (x1,..., x|V| ) x ∈{0,1                                  d = (n1,..., n|V| ) | d |= n = n1 +... + n|V|
                            }
                                                              C* = argmaxC P(D | C)P(C)
 C* = argmaxC P(D | C)P(C)
                                                                                           |V|
                                                                  = argmaxC p(n | C)∏p(wi | C)ni P(C)
                  |V|
    = argmaxC ∏p(wi = xi | C)P(C)                                                          i=1
                  i=1                                                                                         |V|
                                                                  = argmaxC log p(n | C) + log p(C) + ∑ni log p(wi | C)
                                |V|
    = argmaxC log p(C) + ∑log p(wi = xi | C)                                                                  i=1
                                                                                                 |V|
                                                                  ≈ argmaxC log p(C) + ∑ni log p(wi | C)
                                i=1
                                                                                                 i=1




                                      Prof. Pier Luca Lanzi
Categorization Methods                      53



  Vector space model
     K-NN
     Decision tree
     Neural network
     Support vector machine

  Probabilistic model
     Naïve Bayes classifier

  Many, many others and variants exist
    e.g. Bim, Nb, Ind, Swap-1, LLSF, Widrow-Hoff, Rocchio,
    Gis-W, … …




                   Prof. Pier Luca Lanzi
Evaluation (1)                              54



  Effectiveness measure
     Classic: Precision & Recall




      • Precision

      • Recall




                    Prof. Pier Luca Lanzi
Evaluation (2)                                       55



  Benchmarks
    Classic: Reuters collection
      • A set of newswire stories classified under categories related
        to economics.
  Effectiveness
     Difficulties of strict comparison
      • different parameter setting
      • different “split” (or selection) between training and testing
      • various optimizations … …
     However widely recognizable
      • Best: Boosting-based committee classifier & SVM
      • Worst: Naïve Bayes classifier
     Need to consider other factors, especially efficiency




                    Prof. Pier Luca Lanzi
Summary: Text Categorization                   56



  Wide application domain
  Comparable effectiveness to professionals
     Manual Text Classification is not 100% and unlikely to
     improve substantially
     Automatic Text Classification is growing at a steady pace
  Prospects and extensions
     Very noisy text, such as text from O.C.R.
     Speech transcripts




                  Prof. Pier Luca Lanzi
Research Problems in Text Mining               57



  Google: what is the next step?
  How to find the pages that match approximately the
  sophisticated documents, with incorporation of user-profiles
  or preferences?
  Look back of Google: inverted indicies
  Construction of indicies for the sophisticated documents,
  with incorporation of user-profiles or preferences
  Similarity search of such pages using such indicies




                  Prof. Pier Luca Lanzi
References for Text Mining                     58



  Fabrizio Sebastiani, “Machine Learning in Automated Text
  Categorization”, ACM Computing Surveys, Vol. 34, No.1,
  March 2002
  Soumen Chakrabarti, “Data mining for hypertext: A tutorial
  survey”, ACM SIGKDD Explorations, 2000.
  Cleverdon, “Optimizing convenient online accesss to
  bibliographic databases”, Information Survey, Use4, 1, 37-
  47, 1984
  Yiming Yang, “An evaluation of statistical approaches to text
  categorization”, Journal of Information Retrieval, 1:67-88,
  1999.
  Yiming Yang and Xin Liu “A re-examination of text
  categorization methods”. Proceedings of ACM SIGIR
  Conference on Research and Development in Information
  Retrieval (SIGIR'99, pp 42--49), 1999.


                   Prof. Pier Luca Lanzi
World Wide Web: a brief history           59



  Who invented the wheel is unknown
  Who invented the World-Wide Web ?

  (Sir) Tim Berners-Lee
  in 1989, while working at CERN,
  invented the World Wide Web,
  including URL scheme, HTML, and in
  1990 wrote the first server and the
  first browser
  Mosaic browser developed by Marc
  Andreessen and Eric Bina at NCSA
  (National Center for Supercomputing
  Applications) in 1993; helped rapid
  web spread
  Mosaic was basis for Netscape …


                  Prof. Pier Luca Lanzi
What is Web Mining?                           60




   Discovering interesting and useful information
             from Web content and usage

  Examples
     Web search, e.g. Google, Yahoo, MSN, Ask, …
     Specialized search: e.g. Froogle (comparison shopping),
     job ads (Flipdog)
     eCommerce
     Recommendations (Netflix, Amazon, etc.)
     Improving conversion rate: next best product to offer
     Advertising, e.g. Google Adsense
     Fraud detection: click fraud detection, …
     Improving Web site design and performance


                  Prof. Pier Luca Lanzi
How does it differ from “classical”                               61
Data Mining?

  The web is not a relation
     Textual information and linkage structure
  Usage data is huge and growing rapidly
     Google’s usage logs are bigger than their web crawl
     Data generated per day is comparable to largest
     conventional data warehouses
  Ability to react in real-time to usage patterns
     No human in the loop




                                          Reproduced from Ullman & Rajaraman with permission
                  Prof. Pier Luca Lanzi
How big is the Web?                                               62



  The number of pages is technically, infinite
     Because of dynamically generated content
     Lots of duplication (30-40%)

  Best estimate of “unique” static HTML pages comes from
  search engine claims
     Google = 8 billion
     Yahoo = 20 billion
     Lots of marketing hype




                                          Reproduced from Ullman & Rajaraman with permission
                  Prof. Pier Luca Lanzi
Netcraft Survey:                                 63
76,184,000 web sites (Feb 2006)

     Netcraft survey




 http://news.netcraft.com/archives/web_server_survey.html
                   Prof. Pier Luca Lanzi
Mining the World-Wide Web                    64



  The WWW is huge, widely distributed, global information
  service center for
     Information services: news, advertisements, consumer
     information, financial management, education,
     government, e-commerce, etc.
     Hyper-link information
     Access and usage information
  WWW provides rich sources for data mining
  Challenges
     Too huge for effective data warehousing and data mining
     Too complex and heterogeneous: no standards and
     structure




                  Prof. Pier Luca Lanzi
Web Mining: A more challenging task          65



  Searches for
     Web access patterns
     Web structures
     Regularity and dynamics of Web contents
  Problems
     The “abundance” problem
     Limited coverage of the Web: hidden Web sources,
     majority of data in DBMS
     Limited query interface based on keyword-oriented search
     Limited customization to individual users




                  Prof. Pier Luca Lanzi
Web Mining Taxonomy                                                    66




                                      Web Mining




            Web Content               Web Structure             Web Usage
              Mining                     Mining                  Mining




      Web Page                                     General Access      Customized
                      Search Result
    Content Mining                                Pattern Tracking    Usage Tracking
                         Mining




                          Prof. Pier Luca Lanzi
Mining the World-Wide Web                                                       67




                                                Web Mining

                      Web Content
                        Mining
                                                     Web Structure
                                                                         Web Usage
                                                        Mining
                                                                          Mining

        Web Page Content Mining

Web Page Summarization
                                                                      General Access     Customized
                                                    Search Result
WebLog , WebOQL …:                                                   Pattern Tracking   Usage Tracking
                                                       Mining
Web Structuring query languages;
Can identify information within given web
pages
Ahoy!:Uses heuristics to distinguish personal
home pages from other web pages
ShopBot: Looks for product prices within web
pages




                                  Prof. Pier Luca Lanzi
Mining the World-Wide Web                                                    68




                                           Web Mining



                 Web Content
                                               Web Structure
                   Mining
                                                                   Web Usage
                                                  Mining
                                                                    Mining
  Web Page
Content Mining

                       Search Result Mining                     General Access     Customized
                                                               Pattern Tracking   Usage Tracking

                 Search Engine Result
                 Summarization
                 Clustering Search Result :
                 Categorizes documents using
                 phrases in titles and snippets



                               Prof. Pier Luca Lanzi
Mining the World-Wide Web                                                 69




                                             Web Mining



      Web Content                                                                     Web Usage
        Mining                                                                         Mining
                                       Web Structure Mining
                        Using Links
                        PageRank
                        CLEVER
                        Use interconnections between web pages to give
                                                                                   General Access
        Search Result   weight to pages.
                                                                                  Pattern Tracking
           Mining

                        Using Generalization
  Web Page                                                                                Customized
                        MLDB, VWV
Content Mining                                                                           Usage Tracking
                        Uses a multi-level database representation of the
                        Web. Counters (popularity) and link lists are used
                        for capturing structure.



                                 Prof. Pier Luca Lanzi
Mining the World-Wide Web                                     70




                                       Web Mining




        Web Content        Web Structure               Web Usage
          Mining              Mining                    Mining




                                General Access Pattern Tracking
  Web Page                                                           Customized
Content Mining                                                      Usage Tracking

                           Web Log Mining
           Search Result   Uses KDD techniques to understand
              Mining
                           general access patterns and trends.
                           Can shed light on better structure and
                           grouping of resource providers.



                           Prof. Pier Luca Lanzi
Mining the World-Wide Web                                                71




                                         Web Mining




        Web Content         Web Structure                        Web Usage
          Mining               Mining                             Mining




                                               Customized Usage Tracking
  Web Page                  General Access
Content Mining             Pattern Tracking
                                               Adaptive Sites
                                               Analyzes access patterns of each user at a
           Search Result
                                               time. Web site restructures itself automatically
              Mining
                                               by learning from user access patterns.




                             Prof. Pier Luca Lanzi
Web Usage Mining                              72



  Mining Web log records to discover user access patterns of
  Web pages
  Applications
     Target potential customers for electronic commerce
     Enhance the quality and delivery of Internet information
     services to the end user
     Improve Web server system performance
     Identify potential prime advertisement locations
  Web logs provide rich information about Web dynamics
     Typical Web log entry includes the URL requested, the IP
     address from which the request originated, and a
     timestamp




                  Prof. Pier Luca Lanzi
Web Usage Mining                                 73



      Understanding is a pre-requisite to improvement
           1 Google, but 70,000,000+ web sites

  What Applications?
    Simple and Basic
     • Monitor performance, bandwidth usage
     • Catch errors (404 errors- pages not found)
     • Improve web site design (shortcuts for frequent paths,
       remove links not used, etc)
     •…
    Advanced and Business Critical
     • eCommerce: improve conversion, sales, profit
     • Fraud detection: click stream fraud, …
     •…



                  Prof. Pier Luca Lanzi
Techniques for Web usage mining                74



  Construct multidimensional view on the Weblog database
    Perform multidimensional OLAP analysis to find the top N
    users, top N accessed Web pages, most frequently
    accessed time periods, etc.

  Perform data mining on Weblog records
     Find association patterns, sequential patterns, and trends
     of Web accessing
     May need additional information,e.g., user browsing
     sequences of the Web pages in the Web server buffer

  Conduct studies to
    Analyze system performance, improve system design by
    Web caching, Web page prefetching, and Web page
    swapping



                   Prof. Pier Luca Lanzi
References                                                       75



  Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, “Extracting Content Structure for
  Web Pages based on Visual Representation”, The Fifth Asia Pacific Web Conference,
  2003.
  Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, “VIPS: a Vision-based Page
  Segmentation Algorithm”, Microsoft Technical Report (MSR-TR-2003-79), 2003.
  Shipeng Yu, Deng Cai, Ji-Rong Wen and Wei-Ying Ma, “Improving Pseudo-Relevance
  Feedback in Web Information Retrieval Using Web Page Segmentation”, 12th
  International World Wide Web Conference (WWW2003), May 2003.
  Ruihua Song, Haifeng Liu, Ji-Rong Wen and Wei-Ying Ma, “Learning Block Importance
  Models for Web Pages”, 13th International World Wide Web Conference (WWW2004),
  May 2004.
  Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, “Block-based Web Search”,
  SIGIR 2004, July 2004 .
  Deng Cai, Xiaofei He, Ji-Rong Wen and Wei-Ying Ma, “Block-Level Link Analysis”, SIGIR
  2004, July 2004 .
  Deng Cai, Xiaofei He, Wei-Ying Ma, Ji-Rong Wen and Hong-Jiang Zhang, “Organizing
  WWW Images Based on The Analysis of Page Layout and Web Link Structure”, The
  IEEE International Conference on Multimedia and EXPO (ICME'2004) , June 2004
  Deng Cai, Xiaofei He, Zhiwei Li, Wei-Ying Ma and Ji-Rong Wen, “Hierarchical Clustering
  of WWW Image Search Results Using Visual, Textual and Link Analysis”,12th ACM
  International Conference on Multimedia, Oct. 2004 .




                         Prof. Pier Luca Lanzi

Contenu connexe

Tendances

Multimedia Database
Multimedia Database Multimedia Database
Multimedia Database Avnish Patel
 
Multimedia Data Mining using Deep Learning
Multimedia Data Mining using Deep LearningMultimedia Data Mining using Deep Learning
Multimedia Data Mining using Deep LearningBhagyashree Barde
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql DatabasePrashant Gupta
 
Codd's rules
Codd's rulesCodd's rules
Codd's rulesMohd Arif
 
Network and System Administration chapter 2
Network and System Administration chapter 2Network and System Administration chapter 2
Network and System Administration chapter 2IgguuMuude
 
Bsc cs ii-dbms- u-iii-data modeling using e.r. model (entity relationship model)
Bsc cs ii-dbms- u-iii-data modeling using e.r. model (entity relationship model)Bsc cs ii-dbms- u-iii-data modeling using e.r. model (entity relationship model)
Bsc cs ii-dbms- u-iii-data modeling using e.r. model (entity relationship model)Rai University
 
Query Decomposition and data localization
Query Decomposition and data localization Query Decomposition and data localization
Query Decomposition and data localization Hafiz faiz
 
Lecture 08 distributed dbms
Lecture 08 distributed dbmsLecture 08 distributed dbms
Lecture 08 distributed dbmsemailharmeet
 
ADBMS Object and Object Relational Databases
ADBMS  Object  and Object Relational Databases ADBMS  Object  and Object Relational Databases
ADBMS Object and Object Relational Databases Jayanthi Kannan MK
 
Query processing and optimization (updated)
Query processing and optimization (updated)Query processing and optimization (updated)
Query processing and optimization (updated)Ravinder Kamboj
 
directory structure and file system mounting
directory structure and file system mountingdirectory structure and file system mounting
directory structure and file system mountingrajshreemuthiah
 
Active directory
Active directory Active directory
Active directory deshvikas
 
Dynamic storage allocation techniques in Compiler design
Dynamic storage allocation techniques in Compiler designDynamic storage allocation techniques in Compiler design
Dynamic storage allocation techniques in Compiler designkunjan shah
 
Query processing-and-optimization
Query processing-and-optimizationQuery processing-and-optimization
Query processing-and-optimizationWBUTTUTORIALS
 
Data base security & integrity
Data base security &  integrityData base security &  integrity
Data base security & integrityPooja Dixit
 
Database Security & Encryption
Database Security & EncryptionDatabase Security & Encryption
Database Security & EncryptionTech Sanhita
 
Data warehousing labs maunal
Data warehousing labs maunalData warehousing labs maunal
Data warehousing labs maunalEducation
 

Tendances (20)

Multimedia Database
Multimedia Database Multimedia Database
Multimedia Database
 
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
Distributed DBMS - Unit 3 - Distributed DBMS ArchitectureDistributed DBMS - Unit 3 - Distributed DBMS Architecture
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
 
Multimedia Data Mining using Deep Learning
Multimedia Data Mining using Deep LearningMultimedia Data Mining using Deep Learning
Multimedia Data Mining using Deep Learning
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
 
Codd's rules
Codd's rulesCodd's rules
Codd's rules
 
Network and System Administration chapter 2
Network and System Administration chapter 2Network and System Administration chapter 2
Network and System Administration chapter 2
 
Ordbms
OrdbmsOrdbms
Ordbms
 
Bsc cs ii-dbms- u-iii-data modeling using e.r. model (entity relationship model)
Bsc cs ii-dbms- u-iii-data modeling using e.r. model (entity relationship model)Bsc cs ii-dbms- u-iii-data modeling using e.r. model (entity relationship model)
Bsc cs ii-dbms- u-iii-data modeling using e.r. model (entity relationship model)
 
Query Decomposition and data localization
Query Decomposition and data localization Query Decomposition and data localization
Query Decomposition and data localization
 
Lecture 08 distributed dbms
Lecture 08 distributed dbmsLecture 08 distributed dbms
Lecture 08 distributed dbms
 
ADBMS Object and Object Relational Databases
ADBMS  Object  and Object Relational Databases ADBMS  Object  and Object Relational Databases
ADBMS Object and Object Relational Databases
 
Query processing and optimization (updated)
Query processing and optimization (updated)Query processing and optimization (updated)
Query processing and optimization (updated)
 
directory structure and file system mounting
directory structure and file system mountingdirectory structure and file system mounting
directory structure and file system mounting
 
Active directory
Active directory Active directory
Active directory
 
Dynamic storage allocation techniques in Compiler design
Dynamic storage allocation techniques in Compiler designDynamic storage allocation techniques in Compiler design
Dynamic storage allocation techniques in Compiler design
 
Query processing-and-optimization
Query processing-and-optimizationQuery processing-and-optimization
Query processing-and-optimization
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 
Data base security & integrity
Data base security &  integrityData base security &  integrity
Data base security & integrity
 
Database Security & Encryption
Database Security & EncryptionDatabase Security & Encryption
Database Security & Encryption
 
Data warehousing labs maunal
Data warehousing labs maunalData warehousing labs maunal
Data warehousing labs maunal
 

En vedette

Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining AreaMahamudHasanCSE
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an IntroductionAli Abbasi
 
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...Ryan Rosario
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data MiningAmritanshu Mehra
 
Data warehousing and data mining
Data warehousing and data miningData warehousing and data mining
Data warehousing and data miningSnehali Chake
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAmdocs
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Data Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsData Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsMotaz Saad
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data miningDataminingTools Inc
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data MiningSushil Kulkarni
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 

En vedette (19)

Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining Area
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
 
Big Data v Data Mining
Big Data v Data MiningBig Data v Data Mining
Big Data v Data Mining
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
Data mining and_big_data_web
Data mining and_big_data_webData mining and_big_data_web
Data mining and_big_data_web
 
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
 
Lecture 01 Data Mining
Lecture 01 Data MiningLecture 01 Data Mining
Lecture 01 Data Mining
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
 
Data warehousing and data mining
Data warehousing and data miningData warehousing and data mining
Data warehousing and data mining
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Data Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsData Mining and Business Intelligence Tools
Data Mining and Business Intelligence Tools
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data Mining
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Data mining
Data miningData mining
Data mining
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 

Plus de Pier Luca Lanzi

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i VideogiochiPier Luca Lanzi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiPier Luca Lanzi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomePier Luca Lanzi
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaPier Luca Lanzi
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Pier Luca Lanzi
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationPier Luca Lanzi
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationPier Luca Lanzi
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningPier Luca Lanzi
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningPier Luca Lanzi
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesPier Luca Lanzi
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationPier Luca Lanzi
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringPier Luca Lanzi
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringPier Luca Lanzi
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringPier Luca Lanzi
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesPier Luca Lanzi
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsPier Luca Lanzi
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesPier Luca Lanzi
 

Plus de Pier Luca Lanzi (20)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 

Dernier

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Dernier (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Machine Learning and Data Mining: 19 Mining Text And Web Data

  • 1. Text and Web Mining Machine Learning and Data Mining (Unit 19) Prof. Pier Luca Lanzi
  • 2. References 2 Jiawei Han and Micheline Kamber, quot;Data Mining: Concepts and Techniquesquot;, The Morgan Kaufmann Series in Data Management Systems (Second Edition) Chapter 10, part 2 Web Mining Course by Gregory-Platesky Shapiro available at www.kdnuggets.com Prof. Pier Luca Lanzi
  • 3. Mining Text Data: An Introduction 3 Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext <a href>Frank Rizzo Frank Rizzo bought HomeLoan ( </a> Bought his home from Lake Loanee: Frank Rizzo <a hef>this home</a> View Real Estate in Lender: MWF from <a href>Lake 1992. Agency: Lake View View Real Estate</a> He paid $200,000 Amount: $200,000 In <b>1992</b>. under a15-year loan Term: 15 years Loans($200K,[map],...) <p>... from MW Financial. ) Prof. Pier Luca Lanzi
  • 4. Bag-of-Tokens Approaches 4 Documents Token Sets Four score and seven nation – 5 years ago our fathers brought civil - 1 forth on this continent, a new war – 2 nation, conceived in Liberty, Feature men – 2 and dedicated to the Extraction died – 4 proposition that all men are people – 5 created equal. Liberty – 1 Now we are engaged in a God – 1 great civil war, testing … whether that nation, or … Loses all order-specific information! Severely limits context! Prof. Pier Luca Lanzi
  • 5. Natural Language Processing 5 A dog is chasing a boy on the playground Lexical analysis Det Noun Aux Verb Det Noun Prep Det Noun (part-of-speech tagging) Noun Phrase Noun Phrase Noun Phrase Complex Verb Prep Phrase Semantic analysis Verb Phrase Syntactic analysis Dog(d1). (Parsing) Boy(b1). Verb Phrase Playground(p1). Chasing(d1,b1,p1). Sentence + Scared(x) if Chasing(_,x,_). A person saying this may be reminding another person to get the dog back… Scared(b1) Pragmatic analysis Inference (speech act) (Taken from ChengXiang Zhai, CS 397cxz – Fall 2003) Prof. Pier Luca Lanzi
  • 6. General NLP—Too Difficult! 6 Word-level ambiguity “design” can be a noun or a verb (Ambiguous POS) “root” has multiple meanings (Ambiguous sense) Syntactic ambiguity “natural language processing” (Modification) “A man saw a boy with a telescope.” (PP Attachment) Anaphora resolution “John persuaded Bill to buy a TV for himself.” (himself = John or Bill?) Presupposition “He has quit smoking.” implies that he smoked before. Humans rely on context to interpret (when possible). This context may extend beyond a given document! (Taken from ChengXiang Zhai, CS 397cxz – Fall 2003) Prof. Pier Luca Lanzi
  • 7. Shallow Linguistics 7 English Lexicon Part-of-Speech Tagging Word Sense Disambiguation Phrase Detection / Parsing Prof. Pier Luca Lanzi
  • 8. WordNet 8 An extensive lexical network for the English language Contains over 138,838 words. Several graphs, one for each part-of-speech. Synsets (synonym sets), each defining a semantic sense. Relationship information (antonym, hyponym, meronym …) Downloadable for free (UNIX, Windows) Expanding to other languages (Global WordNet Association) Funded >$3 million, mainly government (translation interest) Founder George Miller, National Medal of Science, 1991. watery parched moist dry arid wet synonym antonym damp anhydrous Prof. Pier Luca Lanzi
  • 9. Part-of-Speech Tagging 9 Training data (Annotated text) This sentence serves as an example of annotated text… Det N V1 P Det N P V2 N This is a new sentence. POS Tagger “This is a new sentence.” Det Aux Det Adj N Pick the most1likely, ttag tsequence. p(w ,..., wk 1,..., k ) ⎧ p(t1 | w1 )... p(tk | wk ) p(w1 )... p(wk ) ⎪ p(w1 ,..., wk , t1 ,..., tk ) = ⎨ k ⎪∏ p(wi | ti ) p(ti | ti−1 ) Independent assignment ⎩ =1 ⎧ p(t1 | w1 )... p(tk | wk ) p(iw1 )... p(wk ) Most common tag ⎪ =⎨ k ⎪∏ p(wi | ti ) p(ti | ti −1 ) ⎩ i =1 Partial dependency (HMM) Prof. Pier Luca Lanzi
  • 10. Word Sense Disambiguation 10 ? “The difficulties of computational linguistics are rooted in ambiguity.” N Aux V P N Supervised Learning Features: Neighboring POS tags (N Aux V P N) Neighboring words (linguistics are rooted in ambiguity) Stemmed form (root) Dictionary/Thesaurus entries of neighboring words High co-occurrence words (plant, tree, origin,…) Other senses of word within discourse Algorithms: Rule-based Learning (e.g. IG guided) Statistical Learning (i.e. Naïve Bayes) Unsupervised Learning (i.e. Nearest Neighbor) (Adapted from ChengXiang Zhai, CS 397cxz – Fall 2003) Prof. Pier Luca Lanzi
  • 11. Parsing 11 (Adapted from ChengXiang Zhai, CS 397cxz – Fall 2003) Choose most likely parse tree… S Probability of this tree=0.000015 NP VP Probabilistic CFG Det S→ NP VP 1.0 BNP VP PP NP → Det BNP 0.3 NP → BNP A 0.4 N Aux V NP P NP 0.3 NP→ NP PP Grammar is chasing on BNP→ N dog … VP → V a boy the playground . VP → Aux V NP . . … VP → VP PP Probability of this tree=0.000011 S PP → P NP 1.0 NP VP V → chasing 0.01 Det NP BNP Aux V Aux→ is PP N → dog 0.003 A chasing NP N is N → boy Lexicon NP P N→ playground … dog a boy Det→ the on … Det→ a P → on the playground Prof. Pier Luca Lanzi
  • 12. Obstacles 12 Ambiguity A man saw a boy with a telescope. Computational Intensity Imposes a context horizon. Text Mining NLP Approach Locate promising fragments using fast IR methods (bag-of-tokens) Only apply slow NLP techniques to promising fragments Prof. Pier Luca Lanzi
  • 13. Summary: Shallow NLP 13 However, shallow NLP techniques are feasible and useful: Lexicon – machine understandable linguistic knowledge possible senses, definitions, synonyms, antonyms, typeof, etc. POS Tagging – limit ambiguity (word/POS), entity extraction WSD – stem/synonym/hyponym matches (doc and query) Query: “Foreign cars” Document: “I’m selling a 1976 Jaguar…” Parsing – logical view of information (inference?, translation?) A man saw a boy with a telescope.” Even without complete NLP, any additional knowledge extracted from text data can only be beneficial. Ingenuity will determine the applications. Prof. Pier Luca Lanzi
  • 14. References for NLP 14 C. D. Manning and H. Schutze, “Foundations of Natural Language Processing”, MIT Press, 1999. S. Russell and P. Norvig, “Artificial Intelligence: A Modern Approach”, Prentice Hall, 1995. S. Chakrabarti, “Mining the Web: Statistical Analysis of Hypertext and Semi-Structured Data”, Morgan Kaufmann, 2002. G. Miller, R. Beckwith, C. FellBaum, D. Gross, K. Miller, and R. Tengi. Five papers on WordNet. Princeton University, August 1993. C. Zhai, Introduction to NLP, Lecture Notes for CS 397cxz, UIUC, Fall 2003. M. Hearst, Untangling Text Data Mining, ACL’99, invited paper. http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99- tdm.html R. Sproat, Introduction to Computational Linguistics, LING 306, UIUC, Fall 2003. A Road Map to Text Mining and Web Mining, University of Texas resource page. http://www.cs.utexas.edu/users/pebronia/text- mining/ Computational Linguistics and Text Mining Group, IBM Research, http://www.research.ibm.com/dssgrp/ Prof. Pier Luca Lanzi
  • 15. Text Databases and Information Retrieval 15 Text databases (document databases) Large collections of documents from various sources: news articles, research papers, books, digital libraries, E-mail messages, and Web pages, library database, etc. Data stored is usually semi-structured Traditional information retrieval techniques become inadequate for the increasingly vast amounts of text data Information retrieval A field developed in parallel with database systems Information is organized into (a large number of) documents Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents Prof. Pier Luca Lanzi
  • 16. Information Retrieval (IR) 16 Typical Information Retrieval systems Online library catalogs Online document management systems Information retrieval vs. database systems Some DB problems are not present in IR, e.g., update, transaction management, complex objects Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance Prof. Pier Luca Lanzi
  • 17. Basic Measures for Text Retrieval 17 Relevant Relevant & Retrieved Retrieved All Documents Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) | {Relevant} ∩ {Retrieved } | precision = | {Retrieved} | Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved | {Relevant} ∩ {Retrieved } | precision = | {Relevant} | Prof. Pier Luca Lanzi
  • 18. Information Retrieval Techniques 18 Basic Concepts A document can be described by a set of representative keywords called index terms. Different index terms have varying relevance when used to describe document contents. This effect is captured through the assignment of numerical weights to each index term of a document. (e.g.: frequency, tf-idf) DBMS Analogy Index Terms Attributes Weights Attribute Values Prof. Pier Luca Lanzi
  • 19. Information Retrieval Techniques 19 Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms Documents Frequency Matrices Information Retrieval Models: Boolean Model Vector Model Probabilistic Model Prof. Pier Luca Lanzi
  • 20. Boolean Model 20 Consider that index terms are either present or absent in a document As a result, the index term weights are assumed to be all binaries A query is composed of index terms linked by three connectives: not, and, and or E.g.: car and repair, plane or airplane The Boolean model predicts that each document is either relevant or non-relevant based on the match of a document to the query Prof. Pier Luca Lanzi
  • 21. Keyword-Based Retrieval 21 A document is represented by a string, which can be identified by a set of keywords Queries may use expressions of keywords E.g., car and repair shop, tea or coffee, DBMS but not Oracle Queries and retrieval should consider synonyms, e.g., repair and maintenance Major difficulties of the model Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining Polysemy: The same keyword may mean different things in different contexts, e.g., mining Prof. Pier Luca Lanzi
  • 22. Similarity-Based Retrieval in Text Data 22 Finds similar documents based on a set of common keywords Answer should be based on the degree of relevance based on the nearness of the keywords, relative frequency of the keywords, etc. Basic techniques Stop list Set of words that are deemed “irrelevant”, even though they may appear frequently E.g., a, the, of, for, to, with, etc. Stop lists may vary when document set varies Prof. Pier Luca Lanzi
  • 23. Similarity-Based Retrieval in Text Data 23 Word stem Several words are small syntactic variants of each other since they share a common word stem E.g., drug, drugs, drugged A term frequency table Each entry frequent_table(i, j) = # of occurrences of the word ti in document di Usually, the ratio instead of the absolute number of occurrences is used Similarity metrics: measure the closeness of a document to a query (a set of keywords) Relative term occurrences Cosine distance: v1 ⋅ v2 sim(v1 , v2 ) = | v1 || v2 | Prof. Pier Luca Lanzi
  • 24. Indexing Techniques 24 Inverted index Maintains two hash- or B+-tree indexed tables: • document_table: a set of document records <doc_id, postings_list> • term_table: a set of term records, <term, postings_list> Answer query: Find all docs associated with one or a set of terms + easy to implement – do not handle well synonymy and polysemy, and posting lists could be too long (storage could be very large) Signature file Associate a signature with each document A signature is a representation of an ordered list of terms that describe the document Order is obtained by frequency analysis, stemming and stop lists Prof. Pier Luca Lanzi
  • 25. Vector Space Model 25 Documents and user queries are represented as m-dimensional vectors, where m is the total number of index terms in the document collection. The degree of similarity of the document d with regard to the query q is calculated as the correlation between the vectors that represent them, using measures such as the Euclidian distance or the cosine of the angle between these two vectors. Prof. Pier Luca Lanzi
  • 26. Latent Semantic Indexing (1) 26 Basic idea Similar documents have similar word frequencies Difficulty: the size of the term frequency matrix is very large Use a singular value decomposition (SVD) techniques to reduce the size of frequency table Retain the K most significant rows of the frequency table Method Create a term x document weighted frequency matrix A SVD construction: A = U * S * V’ Define K and obtain Uk ,, Sk , and Vk. Create query vector q’ . Project q’ into the term-document space: Dq = q’ * Uk * Sk-1 Calculate similarities: cos α = Dq . D / ||Dq|| * ||D|| Prof. Pier Luca Lanzi
  • 27. Latent Semantic Indexing (2) 27 Weighted Frequency Matrix Query Terms: - Insulation - Joint Prof. Pier Luca Lanzi
  • 28. Probabilistic Model 28 Basic assumption: Given a user query, there is a set of documents which contains exactly the relevant documents and no other (ideal answer set) Querying process as a process of specifying the properties of an ideal answer set. Since these properties are not known at query time, an initial guess is made This initial guess allows the generation of a preliminary probabilistic description of the ideal answer set which is used to retrieve the first set of documents An interaction with the user is then initiated with the purpose of improving the probabilistic description of the answer set Prof. Pier Luca Lanzi
  • 29. Types of Text Data Mining 29 Keyword-based association analysis Automatic document classification Similarity detection Cluster documents by a common author Cluster documents containing information from a common source Link analysis: unusual correlation between entities Sequence analysis: predicting a recurring event Anomaly detection: find information that violates usual patterns Hypertext analysis Patterns in anchors/links Anchor text correlations with linked objects Prof. Pier Luca Lanzi
  • 30. Keyword-Based Association Analysis 30 Motivation Collect sets of keywords or terms that occur frequently together and then find the association or correlation relationships among them Association Analysis Process Preprocess the text data by parsing, stemming, removing stop words, etc. Evoke association mining algorithms • Consider each document as a transaction • View a set of keywords in the document as a set of items in the transaction Term level association mining • No need for human effort in tagging documents • The number of meaningless results and the execution time is greatly reduced Prof. Pier Luca Lanzi
  • 31. Text Classification (1) 31 Motivation Automatic classification for the large number of on-line text documents (Web pages, e-mails, corporate intranets, etc.) Classification Process Data preprocessing Definition of training set and test sets Creation of the classification model using the selected classification algorithm Classification model validation Classification of new/unknown text documents Text document classification differs from the classification of relational data Document databases are not structured according to attribute-value pairs Prof. Pier Luca Lanzi
  • 32. Text Classification (2) 32 Classification Algorithms: Support Vector Machines K-Nearest Neighbors Naïve Bayes Neural Networks Decision Trees Association rule-based Boosting Prof. Pier Luca Lanzi
  • 33. Document Clustering 33 Motivation Automatically group related documents based on their contents No predetermined training sets or taxonomies Generate a taxonomy at runtime Clustering Process Data preprocessing: remove stop words, stem, feature extraction, lexical analysis, etc. Hierarchical clustering: compute similarities applying clustering algorithms. Model-Based clustering (Neural Network Approach): clusters are represented by “exemplars”. (e.g.: SOM) Prof. Pier Luca Lanzi
  • 34. Text Categorization 34 Pre-given categories and labeled document examples (Categories may form hierarchy) Classify new documents A standard classification (supervised learning ) problem Sports Categorization Business System Education … … Science Sports Business Education Prof. Pier Luca Lanzi
  • 35. Applications 35 News article classification Automatic email filtering Webpage classification Word sense disambiguation …… Prof. Pier Luca Lanzi
  • 36. Categorization Methods 36 Manual: Typically rule-based Does not scale up (labor-intensive, rule inconsistency) May be appropriate for special data on a particular domain Automatic: Typically exploiting machine learning techniques Vector space model based • Prototype-based (Rocchio) • K-nearest neighbor (KNN) • Decision-tree (learn rules) • Neural Networks (learn non-linear classifier) • Support Vector Machines (SVM) Probabilistic or generative model based • Naïve Bayes classifier Prof. Pier Luca Lanzi
  • 37. Vector Space Model 37 Represent a doc by a term vector Term: basic concept, e.g., word or phrase Each term defines one dimension N terms define a N-dimensional space Element of vector corresponds to term weight E.g., d = (x1,…,xN), xi is “importance” of term I New document is assigned to the most likely category based on vector similarity Prof. Pier Luca Lanzi
  • 38. Illustration of the Vector Space Model 38 Starbucks C2 Category 2 Category 3 C3 Java new doc C1 Category 1 Microsoft Prof. Pier Luca Lanzi
  • 39. What VS Model Does Not Specify? 39 How to select terms to capture “basic concepts” Word stopping • e.g. “a”, “the”, “always”, “along” Word stemming • e.g. “computer”, “computing”, “computerize” => “compute” Latent semantic indexing How to assign weights Not all words are equally important: Some are more indicative than others • e.g. “algebra” vs. “science” How to measure the similarity Prof. Pier Luca Lanzi
  • 40. How to Assign Weights 40 Two-fold heuristics based on frequency TF (Term frequency) More frequent within a document more relevant to semantics E.g., “query” vs. “commercial” IDF (Inverse document frequency) Less frequent among documents more discriminative E.g. “algebra” vs. “science” Prof. Pier Luca Lanzi
  • 41. Term Frequency Weighting 41 Weighting The more frequent, the more relevant to topic E.g. “query” vs. “commercial” Raw TF= f(t,d): how many times term t appears in doc d Normalization When document length varies, then relative frequency is preferred E.g., Maximum frequency normalization Prof. Pier Luca Lanzi
  • 42. Inverse Document Frequency Weighting 42 Idea: The less frequent terms among documents are the more discriminative Formula: Given n, total number of docs k, the number of docs with term t appearing (the DF document frequency) Prof. Pier Luca Lanzi
  • 43. TF-IDF Weighting 43 TF-IDF weighting: weight(t, d) = TF(t, d) * IDF(t) Frequent within doc high tf high weight Selective among docs high idf high weight Recall VS model Each selected term represents one dimension Each doc is represented by a feature vector Its t-term coordinate of document d is the TF-IDF weight This is more reasonable Just for illustration … Many complex and more effective weighting variants exist in practice Prof. Pier Luca Lanzi
  • 44. How to Measure Similarity? 44 Given two document Similarity definition dot product normalized dot product (or cosine) Prof. Pier Luca Lanzi
  • 45. Illustrative Example 45 text doc1 mining Sim(newdoc,doc1)=4.8*2.4+4.5*4.5 search engine Sim(newdoc,doc2)=2.4*2.4 text To whom is newdoc more similar? travel text Sim(newdoc,doc3)=0 map doc2 travel text mining travel map search engine govern president congress IDF(faked) 2.4 4.5 2.8 3.3 2.1 5.4 2.2 3.2 4.3 government doc1 2(4.8) 1(4.5) 1(2.1) 1(5.4) president doc2 1(2.4 ) 2 (5.6) 1(3.3) congress doc3 doc3 1 (2.2) 1(3.2) 1(4.3) newdoc 1(2.4) 1(4.5) …… Prof. Pier Luca Lanzi
  • 46. Vector Space Model-Based Classifiers 46 What do we have so far? A feature space with similarity measure This is a classic supervised learning problem Search for an approximation to classification hyper plane VS model based classifiers K-NN Decision tree based Neural networks Support vector machine Prof. Pier Luca Lanzi
  • 47. Probabilistic Model 47 Category C is modeled as a probability distribution of predefined random events Random events model the process of generating documents Therefore, how likely a document d belongs to category C is measured through the probability for category C to generate d. Prof. Pier Luca Lanzi
  • 48. Quick Revisit of Bayes’ Rule 48 Category Hypothesis space: H = {C1 , …, Cn} One document: D P ( D | Ci ) P(Ci ) P (Ci | D) = P( D) As we want to pick the most likely category C*, we can drop p(D) Posterior probability of Ci C* = arg max C P(C | D) = arg max C P ( D | C ) P(C ) Document model for category C Prof. Pier Luca Lanzi
  • 49. Probabilistic Model: Multi-Bernoulli 49 Event: word presence or absence D = (x1, …, x|V|) xi =1 for presence of word wi xi =0 for absence Parameters {p(wi=1|C), p(wi=0|C)} p(wi=1|C)+ p(wi=0|C)=1 |V | |V | |V | p( D = ( x1 ,..., x|V | ) | C )= ∏ p( wi = xi | C ) = ∏ ∏ p(wi = 1| C ) p( wi = 0 | C ) i =1 i =1, xi =1 i =1, xi =0 Prof. Pier Luca Lanzi
  • 50. Probabilistic Model: Multinomial 50 Event: word selection/sampling D = (n1, …, n|V|) ni: frequency of word wi n=n1,+…+ n|V| Parameters {p(wi|C)} p(w1|C)+… p(w|v||C) = 1 ⎛ ⎞ |V | n n1 ... n|V | ⎠ ∏ p ( D = ( n1 , ..., n|v | ) | C )= p ( n | C ) ⎜ p ( w i | C ) ni ⎟ ⎝ i =1 Prof. Pier Luca Lanzi
  • 51. Parameter Estimation 51 Category prior Training examples: | E (Ci ) | p (Ci ) = E(C2) E(C1) k ∑ | E (C ) | j j =1 C1 C2 Multi-Bernoulli Doc model Ck ∑ δ(w ,d)+0.5 ⎧1 if wj occursin d j d∈E(Ci ) δ(wj ,d) = ⎨ p(wj =1| Ci ) = | E(Ci )| +1 ⎩0 otherwise E(Ck) Multinomial doc model Vocabulary: V = {w1, …, w|V|} ∑ c(w ,d)+1 j d∈E(Ci ) p(wj | Ci ) = |V| c(wj ,d) =counts of wj in d ∑ ∑ c(w ,d)+|V | m m=1 d∈E(Ci ) Prof. Pier Luca Lanzi
  • 52. Classification of New Document 52 Multi-Bernoulli Multinomial d = (x1,..., x|V| ) x ∈{0,1 d = (n1,..., n|V| ) | d |= n = n1 +... + n|V| } C* = argmaxC P(D | C)P(C) C* = argmaxC P(D | C)P(C) |V| = argmaxC p(n | C)∏p(wi | C)ni P(C) |V| = argmaxC ∏p(wi = xi | C)P(C) i=1 i=1 |V| = argmaxC log p(n | C) + log p(C) + ∑ni log p(wi | C) |V| = argmaxC log p(C) + ∑log p(wi = xi | C) i=1 |V| ≈ argmaxC log p(C) + ∑ni log p(wi | C) i=1 i=1 Prof. Pier Luca Lanzi
  • 53. Categorization Methods 53 Vector space model K-NN Decision tree Neural network Support vector machine Probabilistic model Naïve Bayes classifier Many, many others and variants exist e.g. Bim, Nb, Ind, Swap-1, LLSF, Widrow-Hoff, Rocchio, Gis-W, … … Prof. Pier Luca Lanzi
  • 54. Evaluation (1) 54 Effectiveness measure Classic: Precision & Recall • Precision • Recall Prof. Pier Luca Lanzi
  • 55. Evaluation (2) 55 Benchmarks Classic: Reuters collection • A set of newswire stories classified under categories related to economics. Effectiveness Difficulties of strict comparison • different parameter setting • different “split” (or selection) between training and testing • various optimizations … … However widely recognizable • Best: Boosting-based committee classifier & SVM • Worst: Naïve Bayes classifier Need to consider other factors, especially efficiency Prof. Pier Luca Lanzi
  • 56. Summary: Text Categorization 56 Wide application domain Comparable effectiveness to professionals Manual Text Classification is not 100% and unlikely to improve substantially Automatic Text Classification is growing at a steady pace Prospects and extensions Very noisy text, such as text from O.C.R. Speech transcripts Prof. Pier Luca Lanzi
  • 57. Research Problems in Text Mining 57 Google: what is the next step? How to find the pages that match approximately the sophisticated documents, with incorporation of user-profiles or preferences? Look back of Google: inverted indicies Construction of indicies for the sophisticated documents, with incorporation of user-profiles or preferences Similarity search of such pages using such indicies Prof. Pier Luca Lanzi
  • 58. References for Text Mining 58 Fabrizio Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Surveys, Vol. 34, No.1, March 2002 Soumen Chakrabarti, “Data mining for hypertext: A tutorial survey”, ACM SIGKDD Explorations, 2000. Cleverdon, “Optimizing convenient online accesss to bibliographic databases”, Information Survey, Use4, 1, 37- 47, 1984 Yiming Yang, “An evaluation of statistical approaches to text categorization”, Journal of Information Retrieval, 1:67-88, 1999. Yiming Yang and Xin Liu “A re-examination of text categorization methods”. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99, pp 42--49), 1999. Prof. Pier Luca Lanzi
  • 59. World Wide Web: a brief history 59 Who invented the wheel is unknown Who invented the World-Wide Web ? (Sir) Tim Berners-Lee in 1989, while working at CERN, invented the World Wide Web, including URL scheme, HTML, and in 1990 wrote the first server and the first browser Mosaic browser developed by Marc Andreessen and Eric Bina at NCSA (National Center for Supercomputing Applications) in 1993; helped rapid web spread Mosaic was basis for Netscape … Prof. Pier Luca Lanzi
  • 60. What is Web Mining? 60 Discovering interesting and useful information from Web content and usage Examples Web search, e.g. Google, Yahoo, MSN, Ask, … Specialized search: e.g. Froogle (comparison shopping), job ads (Flipdog) eCommerce Recommendations (Netflix, Amazon, etc.) Improving conversion rate: next best product to offer Advertising, e.g. Google Adsense Fraud detection: click fraud detection, … Improving Web site design and performance Prof. Pier Luca Lanzi
  • 61. How does it differ from “classical” 61 Data Mining? The web is not a relation Textual information and linkage structure Usage data is huge and growing rapidly Google’s usage logs are bigger than their web crawl Data generated per day is comparable to largest conventional data warehouses Ability to react in real-time to usage patterns No human in the loop Reproduced from Ullman & Rajaraman with permission Prof. Pier Luca Lanzi
  • 62. How big is the Web? 62 The number of pages is technically, infinite Because of dynamically generated content Lots of duplication (30-40%) Best estimate of “unique” static HTML pages comes from search engine claims Google = 8 billion Yahoo = 20 billion Lots of marketing hype Reproduced from Ullman & Rajaraman with permission Prof. Pier Luca Lanzi
  • 63. Netcraft Survey: 63 76,184,000 web sites (Feb 2006) Netcraft survey http://news.netcraft.com/archives/web_server_survey.html Prof. Pier Luca Lanzi
  • 64. Mining the World-Wide Web 64 The WWW is huge, widely distributed, global information service center for Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc. Hyper-link information Access and usage information WWW provides rich sources for data mining Challenges Too huge for effective data warehousing and data mining Too complex and heterogeneous: no standards and structure Prof. Pier Luca Lanzi
  • 65. Web Mining: A more challenging task 65 Searches for Web access patterns Web structures Regularity and dynamics of Web contents Problems The “abundance” problem Limited coverage of the Web: hidden Web sources, majority of data in DBMS Limited query interface based on keyword-oriented search Limited customization to individual users Prof. Pier Luca Lanzi
  • 66. Web Mining Taxonomy 66 Web Mining Web Content Web Structure Web Usage Mining Mining Mining Web Page General Access Customized Search Result Content Mining Pattern Tracking Usage Tracking Mining Prof. Pier Luca Lanzi
  • 67. Mining the World-Wide Web 67 Web Mining Web Content Mining Web Structure Web Usage Mining Mining Web Page Content Mining Web Page Summarization General Access Customized Search Result WebLog , WebOQL …: Pattern Tracking Usage Tracking Mining Web Structuring query languages; Can identify information within given web pages Ahoy!:Uses heuristics to distinguish personal home pages from other web pages ShopBot: Looks for product prices within web pages Prof. Pier Luca Lanzi
  • 68. Mining the World-Wide Web 68 Web Mining Web Content Web Structure Mining Web Usage Mining Mining Web Page Content Mining Search Result Mining General Access Customized Pattern Tracking Usage Tracking Search Engine Result Summarization Clustering Search Result : Categorizes documents using phrases in titles and snippets Prof. Pier Luca Lanzi
  • 69. Mining the World-Wide Web 69 Web Mining Web Content Web Usage Mining Mining Web Structure Mining Using Links PageRank CLEVER Use interconnections between web pages to give General Access Search Result weight to pages. Pattern Tracking Mining Using Generalization Web Page Customized MLDB, VWV Content Mining Usage Tracking Uses a multi-level database representation of the Web. Counters (popularity) and link lists are used for capturing structure. Prof. Pier Luca Lanzi
  • 70. Mining the World-Wide Web 70 Web Mining Web Content Web Structure Web Usage Mining Mining Mining General Access Pattern Tracking Web Page Customized Content Mining Usage Tracking Web Log Mining Search Result Uses KDD techniques to understand Mining general access patterns and trends. Can shed light on better structure and grouping of resource providers. Prof. Pier Luca Lanzi
  • 71. Mining the World-Wide Web 71 Web Mining Web Content Web Structure Web Usage Mining Mining Mining Customized Usage Tracking Web Page General Access Content Mining Pattern Tracking Adaptive Sites Analyzes access patterns of each user at a Search Result time. Web site restructures itself automatically Mining by learning from user access patterns. Prof. Pier Luca Lanzi
  • 72. Web Usage Mining 72 Mining Web log records to discover user access patterns of Web pages Applications Target potential customers for electronic commerce Enhance the quality and delivery of Internet information services to the end user Improve Web server system performance Identify potential prime advertisement locations Web logs provide rich information about Web dynamics Typical Web log entry includes the URL requested, the IP address from which the request originated, and a timestamp Prof. Pier Luca Lanzi
  • 73. Web Usage Mining 73 Understanding is a pre-requisite to improvement 1 Google, but 70,000,000+ web sites What Applications? Simple and Basic • Monitor performance, bandwidth usage • Catch errors (404 errors- pages not found) • Improve web site design (shortcuts for frequent paths, remove links not used, etc) •… Advanced and Business Critical • eCommerce: improve conversion, sales, profit • Fraud detection: click stream fraud, … •… Prof. Pier Luca Lanzi
  • 74. Techniques for Web usage mining 74 Construct multidimensional view on the Weblog database Perform multidimensional OLAP analysis to find the top N users, top N accessed Web pages, most frequently accessed time periods, etc. Perform data mining on Weblog records Find association patterns, sequential patterns, and trends of Web accessing May need additional information,e.g., user browsing sequences of the Web pages in the Web server buffer Conduct studies to Analyze system performance, improve system design by Web caching, Web page prefetching, and Web page swapping Prof. Pier Luca Lanzi
  • 75. References 75 Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, “Extracting Content Structure for Web Pages based on Visual Representation”, The Fifth Asia Pacific Web Conference, 2003. Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, “VIPS: a Vision-based Page Segmentation Algorithm”, Microsoft Technical Report (MSR-TR-2003-79), 2003. Shipeng Yu, Deng Cai, Ji-Rong Wen and Wei-Ying Ma, “Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation”, 12th International World Wide Web Conference (WWW2003), May 2003. Ruihua Song, Haifeng Liu, Ji-Rong Wen and Wei-Ying Ma, “Learning Block Importance Models for Web Pages”, 13th International World Wide Web Conference (WWW2004), May 2004. Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, “Block-based Web Search”, SIGIR 2004, July 2004 . Deng Cai, Xiaofei He, Ji-Rong Wen and Wei-Ying Ma, “Block-Level Link Analysis”, SIGIR 2004, July 2004 . Deng Cai, Xiaofei He, Wei-Ying Ma, Ji-Rong Wen and Hong-Jiang Zhang, “Organizing WWW Images Based on The Analysis of Page Layout and Web Link Structure”, The IEEE International Conference on Multimedia and EXPO (ICME'2004) , June 2004 Deng Cai, Xiaofei He, Zhiwei Li, Wei-Ying Ma and Ji-Rong Wen, “Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Analysis”,12th ACM International Conference on Multimedia, Oct. 2004 . Prof. Pier Luca Lanzi