SlideShare a Scribd company logo
1 of 51
Search Engine Technology
           (1)

   Prof. Dragomir R. Radev
   radev@cs.columbia.edu
SET FALL 2009

…
2.Introduction
…
…
…
…
Examples of search engines
• Conventional (library catalog).
  Search by keyword, title, author, etc.
• Text-based (Lexis-Nexis, Google, Yahoo!).
  Search by keywords. Limited search using queries in natural language.
• Multimedia (QBIC, WebSeek, SaFe)
  Search by visual appearance (shapes, colors,… ).
• Question answering systems (Ask, NSIR, Answerbus)
  Search in (restricted) natural language
• Clustering systems (Vivísimo, Clusty)
• Research systems (Lemur, Nutch)
What does it take to build a search
             engine?
•   Decide what to index
•   Collect it
•   Index it (efficiently)
•   Keep the index up to date
•   Provide user-friendly query facilities
What else?
• Understand the structure of the web for
  efficient crawling
• Understand user information needs
• Preprocess text and other unstructured
  data
• Cluster data
• Classify data
• Evaluate performance
Goals of the course
•   Understand how search engines work
•   Understand the limits of existing search technology
•   Learn to appreciate the sheer size of the Web
•   Learn to wrote code for text indexing and retrieval
•   Learn about the state of the art in IR research
•   Learn to analyze textual and semi-structured data sets
•   Learn to appreciate the diversity of texts on the Web
•   Learn to evaluate information retrieval
•   Learn about standardized document collections
•   Learn about text similarity measures
•   Learn about semantic dimensionality reduction
•   Learn about the idiosyncracies of hyperlinked document collections
•   Learn about web crawling
•   Learn to use existing software
•   Understand the dynamics of the Web by building appropriate mathematical models
•   Build working systems that assist users in finding useful information on the Web
Course logistics
• Thursdays 6:10-8:00
• Office hours: TBA
• URL: http://www.cs.columbia.edu/~cs6998

• Instructor: Dragomir Radev
• Email: radev@cs.columbia.edu
• TA:
  – Yves Petinot (ypetinot@cs.columbia.edu)
  – Kaushal Lahankar (knl2102@columbia.edu)
Course outline
• Classic document retrieval: storing,
  indexing, retrieval.
• Web retrieval: crawling, query processing.
• Text and web mining: classification,
  clustering.
• Network analysis: random graph models,
  centrality, diameter and clustering
  coefficient.
Syllabus
•   Introduction.
•   Queries and Documents. Models of Information retrieval. The
    Boolean model. The Vector model.
•   Document preprocessing. Tokenization. Stemming. The Porter
    algorithm. Storing, indexing and searching text. Inverted indexes.
•   Word distributions. The Zipf distribution. The Benford distribution.
    Heap's law. TF*IDF. Vector space similarity and ranking.
•   Retrieval evaluation. Precision and Recall. F-measure. Reference
    collections. The TREC conferences.
•   Automated indexing/labeling. Compression and coding. Optimal
    codes.
•   String matching. Approximate matching.
•   Query expansion. Relevance feedback.
•   Text classification. Naive Bayes. Feature selection. Decision
    trees.
Syllabus
•   Linear classifiers. k-nearest neighbors. Perceptron. Kernel
    methods. Maximum-margin classifiers. Support vector machines.
    Semi-supervised learning.
•   Lexical semantics and Wordnet.
•   Latent semantic indexing. Singular value decomposition.
•   Vector space clustering. k-means clustering. EM clustering.
•   Random graph models. Properties of random graphs: clustering
    coefficient, betweenness, diameter, giant connected component,
    degree distribution.
•   Social network analysis. Small worlds and scale-free networks.
    Power law distributions. Centrality.
•   Graph-based methods. Harmonic functions. Random walks.
•   PageRank. Hubs and authorities. Bipartite graphs. HITS.
•   Models of the Web.
Syllabus
•   Crawling the web. Webometrics. Measuring the size of the web. The Bow-tie-
    method.
•   Hypertext retrieval. Web-based IR. Document closures. Focused crawling.
•   Question answering
•   Burstiness. Self-triggerability
•   Information extraction
•   Adversarial IR. Human behavior on the web.
•   Text summarization

POSSIBLE TOPICS

•   Discovering communities, spectral clustering
•   Semi-supervised retrieval
•   Natural language processing. XML retrieval. Text tiling. Human behavior on the
    web.
Readings
• required: Information Retrieval by Manning,
  Schuetze, and Raghavan (
  http://www-csli.stanford.edu/~schuetze/information-re
  ), freely available, hard copy for sale
• optional: Modeling the Internet and the Web:
  Probabilistic Methods and Algorithms by Pierre
  Baldi, Paolo Frasconi, Padhraic Smyth, Wiley,
  2003, ISBN: 0-470-84906-1 (
  http://ibook.ics.uci.edu).
• papers from SIGIR, WWW and journals (to be
  announced in class).
Prerequisites
• Linear algebra: vectors and matrices.
• Calculus: Finding extrema of functions.
• Probabilities: random variables, discrete and
  continuous distributions, Bayes theorem.
• Programming: experience with at least one web-
  aware programming language such as Perl
  (highly recommended) or Java in a UNIX
  environment.
• Required CS account
Course requirements
• Three assignments (30%)
  – Some of them will be in Perl. The rest can be done in
    any appropriate language. All will involve some data
    analysis and evaluation
• Final project (30%)
  – Research paper or software system.
• Class participation (10%)
• Final exam (30%)
Final project format
• Research paper - using the SIGIR format.
  Students will be in charge of problem
  formulation, literature survey, hypothesis
  formulation, experimental design,
  implementation, and possibly submission to a
  conference like SIGIR or WWW.
• Software system - develop a working system or
  API. Students will be responsible for identifying a
  niche problem, implementing it and deploying it,
  either on the Web or as an open-source
  downloadable tool. The system can be either
  stand alone or an extension to an existing one.
Project ideas
•   Build a question answering system.
•   Build a language identification system.
•   Social network analysis from the Web.
•   Participate in the Netflix challenge.
•   Query log analysis.
•   Build models of Web evolution.
•   Information diffusion in blogs or web.
•   Author-topic models of web pages.
•   Using the web for machine translation.
•   Building evolving models of web documents.
•   News recommendation system.
•   Compress the text of Wikipedia (losslessly).
•   Spelling correction using query logs.
•   Automatic query expansion.
List of projects from the past
•   Document Closures for Indexing
•   Tibet - Table Structure Recognition Library
•   Ruby Blog Memetracker
•   Sentence decomposition for more accurate information retrieval
•   Extracting Social Networks from LiveJournal
•   Google Suggest Programming Project (Java Swing Client and Lucene Bac
•   Leveraging Social Networks for Organizing and Browsing Shared Photogr
•   Media Bias and the Political Blogosphere
•   Measuring Similarity between search queries
•   Extracting Social Networks and Information about the people within them
•   LSI + dependency trees
Available corpora
•   Netflix challenge                          •   Timebank
•   AOL query logs                             •   Wikipedia
•   Blogs                                      •   wt2g/wt10g/wt100g
•   Bio papers                                 •   dotgov
•   AAN                                        •   RTE
•   Email                                      •   Paraphrases
•   Generifs                                   •   GENIA
•   Web pages                                  •   Generifs
•   Political science corpus                   •   Hansards
•   VAST                                       •   IMDB
•   del.icio.us                                •   MTA/MTC
•   SMS                                        •   nie
•   News data: aquaint, tdt, nantc, reuters,   •   cnnsumm
    setimes, trec, tipster                     •   Poliblog
•   Europarl multilingual                      •   Sentiment
•   US congressional data                      •   xml
•   DMOZ                                       •   epinions
•   Pubmedcentral                              •   Enron
•   DUC/TAC
Related courses elsewhere
• Stanford (Chris Manning, Prabhakar Raghavan, and
  Hinrich Schuetze)
• Cornell (Jon Kleinberg)
• CMU (Yiming Yang and Jamie Callan)
• UMass (James Allan)
• UTexas (Ray Mooney)
• Illinois (Chengxiang Zhai)
• Johns Hopkins (David Yarowsky)
•   For a long list of courses related to Search Engines, Natural Language
    Processing, Machine Learning look here:

    http://tangra.si.umich.edu/clair/clair/courses.html
SET FALL 2009

…
2. Models of Information retrieval
   The Vector model
   The Boolean model
…
…
The web is really large
• 100 B pages
• Dynamically generated content
• New pages get added all the time
• Technorati has 50M+ blogs
• The size of the blogosphere doubles every
  6 months
• Yahoo deals with 12TB of data per day
  (according to Ron Brachman)
Sample queries (from Excite)
In what year did baseball become an offical sport?
play station codes . com
birth control and depression
government
"WorkAbility I"+conference
kitchen appliances
where can I find a chines rosewood
tiger electronics
58 Plymouth Fury
How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a
hero?
emeril Lagasse
Hubble
M.S Subalaksmi
running
Fun things to do with search
             engines
• Googlewhack
• Reduce document set size to 1
• Find query that will bring given URL in the
  top 10
Key Terms Used in IR

• QUERY: a representation of what the user is looking
  for - can be a list of words or a phrase.
• DOCUMENT: an information entity that the user
  wants to retrieve
• COLLECTION: a set of documents
• INDEX: a representation of information that makes
  querying easier
• TERM: word or concept that appears in a document
  or a query
Mappings and abstractions

    Reality                  Data




Information need            Query



                   From Robert Korfhage’s book
Documents
• Not just printed paper
• Can be records, pages, sites, images,
  people, movies
• Document encoding (Unicode)
• Document representation
• Document preprocessing
Sample query sessions (from AOL)
• toley spies grames
  tolley spies games
  totally spies games
• tajmahal restaurant brooklyn ny
  taj mahal restaurant brooklyn ny
  taj mahal restaurant brooklyn ny 11209
• do you love me like you say
  do you love me like you say lyrics
  do you love me like you say lyrics marvin gaye
            M: /data4/corpora/AOL-user-ct-collection
Characteristics of user queries
• Sessions: users revisit their queries.
• Very short queries: typically 2 words long.
• A large number of typos.
• A small number of popular queries. A long
  tail of infrequent ones.
• Almost no use of advanced query
  operators with the exception of double
  quotes
Queries as documents
• Advantages:
  – Mathematically easier to manage


• Problems:
  – Different lengths
  – Syntactic differences
  – Repetitions of words (or lack thereof)
Document representations
• Term-document matrix (m x n)
• Document-document matrix (n x n)
• Typical example in a medium-sized
  collection: 3,000,000 documents (n) with
  50,000 terms (m)
• Typical example on the Web:
  n=30,000,000,000, m=1,000,000
• Boolean vs. integer-valued matrices
Storage issues
• Imagine a medium-sized collection with
  n=3,000,000 and m=50,000
• How large a term-document matrix will be
  needed?
• Is there any way to do better? Any
  heuristic?
Inverted index
• Instead of an incidence vector, use a
  posting table
• CLEVELAND: D1, D2, D6
• OHIO: D1, D5, D6, D7
• Use linked lists to be able to insert new
  document postings in order and to remove
  existing postings.
• Keep everything sorted! This gives you a
  logarithmic improvement in access.
Basic operations on inverted
              indexes
• Conjunction (AND) – iterative merge of the
  two postings: O(x+y)
• Disjunction (OR) – very similar
• Negation (NOT) – can we still do it in
  O(x+y)?
  – Example: MICHIGAN AND NOT OHIO
  – Example: MICHIGAN OR NOT OHIO
• Recursive operations
• Optimization: start with the smallest sets
Major IR models
•   Boolean
•   Vector
•   Probabilistic
•   Language modeling
•   Fuzzy retrieval
•   Latent semantic indexing
The Boolean model
Venn diagrams




                  x   w    y             z


           D1
                                    D2
Boolean queries
• Operators: AND, OR, NOT, parentheses
• Example:
  – CLEVELAND AND NOT OHIO
  – (MICHIGAN AND INDIANA) OR (TEXAS AND
    OKLAHOMA)
• Ambiguous uses of AND and OR in
  human language
  – Exclusive vs. inclusive OR
  – Restrictive operator: AND or OR?
Canonical forms of queries
• De Morgan’s Laws:

      NOT (A AND B) = (NOT A) OR (NOT B)

      NOT (A OR B) = (NOT A) AND (NOT B)


• Normal forms
  – Conjunctive normal form (CNF)
  – Disjunctive normal form (DNF)
  – Reference librarians prefer CNF - why?
Evaluating Boolean queries
• Incidence vectors:
  – CLEVELAND: 1100010
  – OHIO: 1000111
• Examples:
  – CLEVELAND AND OHIO
  – CLEVELAND AND NOT OHIO
  – CLEVALAND OR OHIO
Exercise
• D1 = “computer information retrieval”
• D2 = “computer retrieval”
• D3 = “information”
• D4 = “computer information”


• Q1 = “information AND retrieval”
• Q2 = “information AND NOT computer”
Exercise
                0
                1                                             Swift
                2                               Shakespeare
                3                               Shakespeare   Swift
                4                     Milton
                5                     Milton                  Swift
                6                     Milton    Shakespeare
                7                     Milton    Shakespeare   Swift
                8        Chaucer
                9        Chaucer                              Swift
                10       Chaucer                Shakespeare
                11       Chaucer                Shakespeare   Swift
                12       Chaucer      Milton
                13       Chaucer      Milton                  Swift
                14       Chaucer      Milton    Shakespeare
                15       Chaucer      Milton    Shakespeare   Swift

((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
How to deal with?
• Multi-word phrases?
• Document ranking?
The Vector model
      Term 1


                Doc 1

                 Doc 2
                         Term 3




 Term 2        Doc 3
Vector queries

• Each document is represented as a vector
• Non-efficient representation
• Dimensional compatibility
     W1   W2   W3   W4   W5   W6   W7   W8   W9   W10

     C1   C2   C3   C4   C5   C6   C7   C8   C9   C10
The matching process
• Document space
• Matching is done between a document
  and a query (or between two documents)
• Distance vs. similarity measures.
• Euclidean distance, Manhattan distance,
  Word overlap, Jaccard coefficient, etc.
Miscellaneous similarity measures
• The Cosine measure (normalized dot product)

             X·Y               Σ (di x qi)
 σ (D,Q) =               =
             |X| * |Y|       Σ (di)2 *   Σ (qi)2

• The Jaccard coefficient
                   |X ∩ Y|
      σ (D,Q) =
                   |X ∪ Y|
Exercise
• Compute the cosine scores σ (D1,D2) and
  σ (D1,D3) for the documents: D1 = <1,3>,
  D2 = <100,300> and D3 = <3,1>
• Compute the corresponding Euclidean
  distances, Manhattan distances, and
  Jaccard coefficients.
Readings
• (1): MRS1, MRS2, MRS5 (Zipf)
• (2): MRS7, MRS8

More Related Content

What's hot

Pratt sils knowledge organization spring 2014
Pratt sils knowledge organization spring 2014Pratt sils knowledge organization spring 2014
Pratt sils knowledge organization spring 2014
PrattSILS
 
ALA 2010 -- Jane Burke
ALA 2010 -- Jane BurkeALA 2010 -- Jane Burke
ALA 2010 -- Jane Burke
bisg
 
LIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation PostersLIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation Posters
PrattSILS
 

What's hot (20)

Ethnography in the virtual world: Methodological opportunities and challenges
Ethnography in the virtual world: Methodological opportunities and challengesEthnography in the virtual world: Methodological opportunities and challenges
Ethnography in the virtual world: Methodological opportunities and challenges
 
Workset Creation for Scholarly Analysis Project presentation at CNI 2013
Workset Creation for Scholarly Analysis Project presentation at CNI 2013Workset Creation for Scholarly Analysis Project presentation at CNI 2013
Workset Creation for Scholarly Analysis Project presentation at CNI 2013
 
User Engagement with Digital Archives: A Case Study of Emblematica Online
User Engagement with Digital Archives: A Case Study of Emblematica OnlineUser Engagement with Digital Archives: A Case Study of Emblematica Online
User Engagement with Digital Archives: A Case Study of Emblematica Online
 
ESWC2015 opening ceremony
ESWC2015 opening ceremonyESWC2015 opening ceremony
ESWC2015 opening ceremony
 
Building the Archive of DH Research
Building the Archive of DH ResearchBuilding the Archive of DH Research
Building the Archive of DH Research
 
MA in Digital Humanities
MA in Digital Humanities MA in Digital Humanities
MA in Digital Humanities
 
On the many graphs of the Web and the interest of adding their missing links.
On the many graphs of the Web and the interest of adding their missing links. On the many graphs of the Web and the interest of adding their missing links.
On the many graphs of the Web and the interest of adding their missing links.
 
Humanities Users in the Digital Age: Library Needs Assessment
Humanities Users in the Digital Age: Library Needs AssessmentHumanities Users in the Digital Age: Library Needs Assessment
Humanities Users in the Digital Age: Library Needs Assessment
 
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
 
Wimmics Overview 2021
Wimmics Overview 2021Wimmics Overview 2021
Wimmics Overview 2021
 
Digital Libraries, Digital Archives, Digital Humanities, Digital Scholarship:...
Digital Libraries, Digital Archives, Digital Humanities, Digital Scholarship:...Digital Libraries, Digital Archives, Digital Humanities, Digital Scholarship:...
Digital Libraries, Digital Archives, Digital Humanities, Digital Scholarship:...
 
Look What’s New at the DPLA!
Look What’s New at the DPLA!Look What’s New at the DPLA!
Look What’s New at the DPLA!
 
Pratt sils knowledge organization spring 2014
Pratt sils knowledge organization spring 2014Pratt sils knowledge organization spring 2014
Pratt sils knowledge organization spring 2014
 
Humanities as Data: Projects, Visualizations, and Emerging Methods
Humanities as Data: Projects, Visualizations, and Emerging MethodsHumanities as Data: Projects, Visualizations, and Emerging Methods
Humanities as Data: Projects, Visualizations, and Emerging Methods
 
ALA 2010 -- Jane Burke
ALA 2010 -- Jane BurkeALA 2010 -- Jane Burke
ALA 2010 -- Jane Burke
 
Facilitating Human Intervention in Coreference Resolution with Comparative En...
Facilitating Human Intervention in Coreference Resolution with Comparative En...Facilitating Human Intervention in Coreference Resolution with Comparative En...
Facilitating Human Intervention in Coreference Resolution with Comparative En...
 
How Do UK Students, Researchers and Academics use the Internet
How Do UK Students, Researchers and Academics use the InternetHow Do UK Students, Researchers and Academics use the Internet
How Do UK Students, Researchers and Academics use the Internet
 
on the ontological necessity of the multidisciplinary development of the web
on the ontological necessity of the multidisciplinary development of the webon the ontological necessity of the multidisciplinary development of the web
on the ontological necessity of the multidisciplinary development of the web
 
Digital Humanities: An Introduction
Digital Humanities: An IntroductionDigital Humanities: An Introduction
Digital Humanities: An Introduction
 
LIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation PostersLIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation Posters
 

Viewers also liked

Ergonomi̇c
Ergonomi̇cErgonomi̇c
Ergonomi̇c
emrsner
 
PERSONAL INFORMATION
PERSONAL INFORMATION PERSONAL INFORMATION
PERSONAL INFORMATION
emrsner
 
Spinal and-spinal-cord284
Spinal and-spinal-cord284Spinal and-spinal-cord284
Spinal and-spinal-cord284
cute_girl89
 
Ergonomi̇c
Ergonomi̇cErgonomi̇c
Ergonomi̇c
emrsner
 

Viewers also liked (14)

Break even calculation
Break even calculationBreak even calculation
Break even calculation
 
Ergonomi̇c
Ergonomi̇cErgonomi̇c
Ergonomi̇c
 
Setting retail prices
Setting retail pricesSetting retail prices
Setting retail prices
 
PERSONAL INFORMATION
PERSONAL INFORMATION PERSONAL INFORMATION
PERSONAL INFORMATION
 
Phire Power Business Analysis
Phire Power Business AnalysisPhire Power Business Analysis
Phire Power Business Analysis
 
Performance measures for key performance indicators
Performance measures for key performance indicatorsPerformance measures for key performance indicators
Performance measures for key performance indicators
 
Krachtig relevant zijn, hoe leer je dat? - Ger Nijkamp - Ricoh - B2B Marketin...
Krachtig relevant zijn, hoe leer je dat? - Ger Nijkamp - Ricoh - B2B Marketin...Krachtig relevant zijn, hoe leer je dat? - Ger Nijkamp - Ricoh - B2B Marketin...
Krachtig relevant zijn, hoe leer je dat? - Ger Nijkamp - Ricoh - B2B Marketin...
 
Hoe integreer je marketing en sales tot één afdeling? - Rogier Terhalle en Pe...
Hoe integreer je marketing en sales tot één afdeling? - Rogier Terhalle en Pe...Hoe integreer je marketing en sales tot één afdeling? - Rogier Terhalle en Pe...
Hoe integreer je marketing en sales tot één afdeling? - Rogier Terhalle en Pe...
 
Presentatie Erica Meijer - Nedap Security Management - B2B Marketing Event 2016
Presentatie Erica Meijer - Nedap Security Management - B2B Marketing Event 2016Presentatie Erica Meijer - Nedap Security Management - B2B Marketing Event 2016
Presentatie Erica Meijer - Nedap Security Management - B2B Marketing Event 2016
 
De learnings van VGZ - Jacob van Lier - VGZ - B2B Marketing Event 2016
De learnings van VGZ - Jacob van Lier - VGZ - B2B Marketing Event 2016De learnings van VGZ - Jacob van Lier - VGZ - B2B Marketing Event 2016
De learnings van VGZ - Jacob van Lier - VGZ - B2B Marketing Event 2016
 
Checklist of kpi's that might be suitable for a sme
Checklist of kpi's that might be suitable for a smeChecklist of kpi's that might be suitable for a sme
Checklist of kpi's that might be suitable for a sme
 
Spinal and-spinal-cord284
Spinal and-spinal-cord284Spinal and-spinal-cord284
Spinal and-spinal-cord284
 
De nieuwe marketing en sales - Laura Nuhaan - Andeta - B2B Marketing Event 2016
De nieuwe marketing en sales - Laura Nuhaan - Andeta - B2B Marketing Event 2016De nieuwe marketing en sales - Laura Nuhaan - Andeta - B2B Marketing Event 2016
De nieuwe marketing en sales - Laura Nuhaan - Andeta - B2B Marketing Event 2016
 
Ergonomi̇c
Ergonomi̇cErgonomi̇c
Ergonomi̇c
 

Similar to Ir1

Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
Stefanos Anastasiadis
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
lljohnston
 

Similar to Ir1 (20)

Web-Scale Discovery: Post Implementation
Web-Scale Discovery: Post ImplementationWeb-Scale Discovery: Post Implementation
Web-Scale Discovery: Post Implementation
 
Digital libraries
Digital librariesDigital libraries
Digital libraries
 
We've Got Issues: Issue Tracking and Workflow in the Digital Library
We've Got Issues: Issue Tracking and Workflow in the Digital LibraryWe've Got Issues: Issue Tracking and Workflow in the Digital Library
We've Got Issues: Issue Tracking and Workflow in the Digital Library
 
Marc and beyond: 3 Linked Data Choices
 Marc and beyond: 3 Linked Data Choices  Marc and beyond: 3 Linked Data Choices
Marc and beyond: 3 Linked Data Choices
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4j
 
Digital Humanities Workshop
Digital Humanities WorkshopDigital Humanities Workshop
Digital Humanities Workshop
 
2014_WWW_BTOR
2014_WWW_BTOR2014_WWW_BTOR
2014_WWW_BTOR
 
Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Searching of Web and Electronic Resources
Searching of Web and Electronic Resources
 
Our World is Socio-technical
Our World is Socio-technicalOur World is Socio-technical
Our World is Socio-technical
 
Cosi Usage Data
Cosi   Usage DataCosi   Usage Data
Cosi Usage Data
 
Visualising activity in learning networks using open data and educational ...
Visualising activity in learning networks   using open data and educational  ...Visualising activity in learning networks   using open data and educational  ...
Visualising activity in learning networks using open data and educational ...
 
Open science, open-source, and open data: Collaboration as an emergent property?
Open science, open-source, and open data: Collaboration as an emergent property?Open science, open-source, and open data: Collaboration as an emergent property?
Open science, open-source, and open data: Collaboration as an emergent property?
 
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Towards Research Engines: Supporting Search Stages in Web Archives (2015)Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
 
Internet and Its Applications
Internet and Its ApplicationsInternet and Its Applications
Internet and Its Applications
 
What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?
 
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalIndexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
 
Randall "MECA Project Update"
Randall "MECA Project Update"Randall "MECA Project Update"
Randall "MECA Project Update"
 
Writing The Research Paper A Handbook (7th ed) - Ch 5 computers and the resea...
Writing The Research Paper A Handbook (7th ed) - Ch 5 computers and the resea...Writing The Research Paper A Handbook (7th ed) - Ch 5 computers and the resea...
Writing The Research Paper A Handbook (7th ed) - Ch 5 computers and the resea...
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Ir1

  • 1. Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu
  • 3.
  • 4.
  • 5.
  • 6.
  • 7. Examples of search engines • Conventional (library catalog). Search by keyword, title, author, etc. • Text-based (Lexis-Nexis, Google, Yahoo!). Search by keywords. Limited search using queries in natural language. • Multimedia (QBIC, WebSeek, SaFe) Search by visual appearance (shapes, colors,… ). • Question answering systems (Ask, NSIR, Answerbus) Search in (restricted) natural language • Clustering systems (Vivísimo, Clusty) • Research systems (Lemur, Nutch)
  • 8. What does it take to build a search engine? • Decide what to index • Collect it • Index it (efficiently) • Keep the index up to date • Provide user-friendly query facilities
  • 9. What else? • Understand the structure of the web for efficient crawling • Understand user information needs • Preprocess text and other unstructured data • Cluster data • Classify data • Evaluate performance
  • 10. Goals of the course • Understand how search engines work • Understand the limits of existing search technology • Learn to appreciate the sheer size of the Web • Learn to wrote code for text indexing and retrieval • Learn about the state of the art in IR research • Learn to analyze textual and semi-structured data sets • Learn to appreciate the diversity of texts on the Web • Learn to evaluate information retrieval • Learn about standardized document collections • Learn about text similarity measures • Learn about semantic dimensionality reduction • Learn about the idiosyncracies of hyperlinked document collections • Learn about web crawling • Learn to use existing software • Understand the dynamics of the Web by building appropriate mathematical models • Build working systems that assist users in finding useful information on the Web
  • 11. Course logistics • Thursdays 6:10-8:00 • Office hours: TBA • URL: http://www.cs.columbia.edu/~cs6998 • Instructor: Dragomir Radev • Email: radev@cs.columbia.edu • TA: – Yves Petinot (ypetinot@cs.columbia.edu) – Kaushal Lahankar (knl2102@columbia.edu)
  • 12. Course outline • Classic document retrieval: storing, indexing, retrieval. • Web retrieval: crawling, query processing. • Text and web mining: classification, clustering. • Network analysis: random graph models, centrality, diameter and clustering coefficient.
  • 13. Syllabus • Introduction. • Queries and Documents. Models of Information retrieval. The Boolean model. The Vector model. • Document preprocessing. Tokenization. Stemming. The Porter algorithm. Storing, indexing and searching text. Inverted indexes. • Word distributions. The Zipf distribution. The Benford distribution. Heap's law. TF*IDF. Vector space similarity and ranking. • Retrieval evaluation. Precision and Recall. F-measure. Reference collections. The TREC conferences. • Automated indexing/labeling. Compression and coding. Optimal codes. • String matching. Approximate matching. • Query expansion. Relevance feedback. • Text classification. Naive Bayes. Feature selection. Decision trees.
  • 14. Syllabus • Linear classifiers. k-nearest neighbors. Perceptron. Kernel methods. Maximum-margin classifiers. Support vector machines. Semi-supervised learning. • Lexical semantics and Wordnet. • Latent semantic indexing. Singular value decomposition. • Vector space clustering. k-means clustering. EM clustering. • Random graph models. Properties of random graphs: clustering coefficient, betweenness, diameter, giant connected component, degree distribution. • Social network analysis. Small worlds and scale-free networks. Power law distributions. Centrality. • Graph-based methods. Harmonic functions. Random walks. • PageRank. Hubs and authorities. Bipartite graphs. HITS. • Models of the Web.
  • 15. Syllabus • Crawling the web. Webometrics. Measuring the size of the web. The Bow-tie- method. • Hypertext retrieval. Web-based IR. Document closures. Focused crawling. • Question answering • Burstiness. Self-triggerability • Information extraction • Adversarial IR. Human behavior on the web. • Text summarization POSSIBLE TOPICS • Discovering communities, spectral clustering • Semi-supervised retrieval • Natural language processing. XML retrieval. Text tiling. Human behavior on the web.
  • 16. Readings • required: Information Retrieval by Manning, Schuetze, and Raghavan ( http://www-csli.stanford.edu/~schuetze/information-re ), freely available, hard copy for sale • optional: Modeling the Internet and the Web: Probabilistic Methods and Algorithms by Pierre Baldi, Paolo Frasconi, Padhraic Smyth, Wiley, 2003, ISBN: 0-470-84906-1 ( http://ibook.ics.uci.edu). • papers from SIGIR, WWW and journals (to be announced in class).
  • 17. Prerequisites • Linear algebra: vectors and matrices. • Calculus: Finding extrema of functions. • Probabilities: random variables, discrete and continuous distributions, Bayes theorem. • Programming: experience with at least one web- aware programming language such as Perl (highly recommended) or Java in a UNIX environment. • Required CS account
  • 18. Course requirements • Three assignments (30%) – Some of them will be in Perl. The rest can be done in any appropriate language. All will involve some data analysis and evaluation • Final project (30%) – Research paper or software system. • Class participation (10%) • Final exam (30%)
  • 19. Final project format • Research paper - using the SIGIR format. Students will be in charge of problem formulation, literature survey, hypothesis formulation, experimental design, implementation, and possibly submission to a conference like SIGIR or WWW. • Software system - develop a working system or API. Students will be responsible for identifying a niche problem, implementing it and deploying it, either on the Web or as an open-source downloadable tool. The system can be either stand alone or an extension to an existing one.
  • 20. Project ideas • Build a question answering system. • Build a language identification system. • Social network analysis from the Web. • Participate in the Netflix challenge. • Query log analysis. • Build models of Web evolution. • Information diffusion in blogs or web. • Author-topic models of web pages. • Using the web for machine translation. • Building evolving models of web documents. • News recommendation system. • Compress the text of Wikipedia (losslessly). • Spelling correction using query logs. • Automatic query expansion.
  • 21. List of projects from the past • Document Closures for Indexing • Tibet - Table Structure Recognition Library • Ruby Blog Memetracker • Sentence decomposition for more accurate information retrieval • Extracting Social Networks from LiveJournal • Google Suggest Programming Project (Java Swing Client and Lucene Bac • Leveraging Social Networks for Organizing and Browsing Shared Photogr • Media Bias and the Political Blogosphere • Measuring Similarity between search queries • Extracting Social Networks and Information about the people within them • LSI + dependency trees
  • 22. Available corpora • Netflix challenge • Timebank • AOL query logs • Wikipedia • Blogs • wt2g/wt10g/wt100g • Bio papers • dotgov • AAN • RTE • Email • Paraphrases • Generifs • GENIA • Web pages • Generifs • Political science corpus • Hansards • VAST • IMDB • del.icio.us • MTA/MTC • SMS • nie • News data: aquaint, tdt, nantc, reuters, • cnnsumm setimes, trec, tipster • Poliblog • Europarl multilingual • Sentiment • US congressional data • xml • DMOZ • epinions • Pubmedcentral • Enron • DUC/TAC
  • 23. Related courses elsewhere • Stanford (Chris Manning, Prabhakar Raghavan, and Hinrich Schuetze) • Cornell (Jon Kleinberg) • CMU (Yiming Yang and Jamie Callan) • UMass (James Allan) • UTexas (Ray Mooney) • Illinois (Chengxiang Zhai) • Johns Hopkins (David Yarowsky) • For a long list of courses related to Search Engines, Natural Language Processing, Machine Learning look here: http://tangra.si.umich.edu/clair/clair/courses.html
  • 24. SET FALL 2009 … 2. Models of Information retrieval The Vector model The Boolean model … …
  • 25. The web is really large • 100 B pages • Dynamically generated content • New pages get added all the time • Technorati has 50M+ blogs • The size of the blogosphere doubles every 6 months • Yahoo deals with 12TB of data per day (according to Ron Brachman)
  • 26. Sample queries (from Excite) In what year did baseball become an offical sport? play station codes . com birth control and depression government "WorkAbility I"+conference kitchen appliances where can I find a chines rosewood tiger electronics 58 Plymouth Fury How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? emeril Lagasse Hubble M.S Subalaksmi running
  • 27. Fun things to do with search engines • Googlewhack • Reduce document set size to 1 • Find query that will bring given URL in the top 10
  • 28. Key Terms Used in IR • QUERY: a representation of what the user is looking for - can be a list of words or a phrase. • DOCUMENT: an information entity that the user wants to retrieve • COLLECTION: a set of documents • INDEX: a representation of information that makes querying easier • TERM: word or concept that appears in a document or a query
  • 29. Mappings and abstractions Reality Data Information need Query From Robert Korfhage’s book
  • 30. Documents • Not just printed paper • Can be records, pages, sites, images, people, movies • Document encoding (Unicode) • Document representation • Document preprocessing
  • 31. Sample query sessions (from AOL) • toley spies grames tolley spies games totally spies games • tajmahal restaurant brooklyn ny taj mahal restaurant brooklyn ny taj mahal restaurant brooklyn ny 11209 • do you love me like you say do you love me like you say lyrics do you love me like you say lyrics marvin gaye M: /data4/corpora/AOL-user-ct-collection
  • 32. Characteristics of user queries • Sessions: users revisit their queries. • Very short queries: typically 2 words long. • A large number of typos. • A small number of popular queries. A long tail of infrequent ones. • Almost no use of advanced query operators with the exception of double quotes
  • 33. Queries as documents • Advantages: – Mathematically easier to manage • Problems: – Different lengths – Syntactic differences – Repetitions of words (or lack thereof)
  • 34. Document representations • Term-document matrix (m x n) • Document-document matrix (n x n) • Typical example in a medium-sized collection: 3,000,000 documents (n) with 50,000 terms (m) • Typical example on the Web: n=30,000,000,000, m=1,000,000 • Boolean vs. integer-valued matrices
  • 35. Storage issues • Imagine a medium-sized collection with n=3,000,000 and m=50,000 • How large a term-document matrix will be needed? • Is there any way to do better? Any heuristic?
  • 36. Inverted index • Instead of an incidence vector, use a posting table • CLEVELAND: D1, D2, D6 • OHIO: D1, D5, D6, D7 • Use linked lists to be able to insert new document postings in order and to remove existing postings. • Keep everything sorted! This gives you a logarithmic improvement in access.
  • 37. Basic operations on inverted indexes • Conjunction (AND) – iterative merge of the two postings: O(x+y) • Disjunction (OR) – very similar • Negation (NOT) – can we still do it in O(x+y)? – Example: MICHIGAN AND NOT OHIO – Example: MICHIGAN OR NOT OHIO • Recursive operations • Optimization: start with the smallest sets
  • 38. Major IR models • Boolean • Vector • Probabilistic • Language modeling • Fuzzy retrieval • Latent semantic indexing
  • 39. The Boolean model Venn diagrams x w y z D1 D2
  • 40. Boolean queries • Operators: AND, OR, NOT, parentheses • Example: – CLEVELAND AND NOT OHIO – (MICHIGAN AND INDIANA) OR (TEXAS AND OKLAHOMA) • Ambiguous uses of AND and OR in human language – Exclusive vs. inclusive OR – Restrictive operator: AND or OR?
  • 41. Canonical forms of queries • De Morgan’s Laws: NOT (A AND B) = (NOT A) OR (NOT B) NOT (A OR B) = (NOT A) AND (NOT B) • Normal forms – Conjunctive normal form (CNF) – Disjunctive normal form (DNF) – Reference librarians prefer CNF - why?
  • 42. Evaluating Boolean queries • Incidence vectors: – CLEVELAND: 1100010 – OHIO: 1000111 • Examples: – CLEVELAND AND OHIO – CLEVELAND AND NOT OHIO – CLEVALAND OR OHIO
  • 43. Exercise • D1 = “computer information retrieval” • D2 = “computer retrieval” • D3 = “information” • D4 = “computer information” • Q1 = “information AND retrieval” • Q2 = “information AND NOT computer”
  • 44. Exercise 0 1 Swift 2 Shakespeare 3 Shakespeare Swift 4 Milton 5 Milton Swift 6 Milton Shakespeare 7 Milton Shakespeare Swift 8 Chaucer 9 Chaucer Swift 10 Chaucer Shakespeare 11 Chaucer Shakespeare Swift 12 Chaucer Milton 13 Chaucer Milton Swift 14 Chaucer Milton Shakespeare 15 Chaucer Milton Shakespeare Swift ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
  • 45. How to deal with? • Multi-word phrases? • Document ranking?
  • 46. The Vector model Term 1 Doc 1 Doc 2 Term 3 Term 2 Doc 3
  • 47. Vector queries • Each document is represented as a vector • Non-efficient representation • Dimensional compatibility W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
  • 48. The matching process • Document space • Matching is done between a document and a query (or between two documents) • Distance vs. similarity measures. • Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc.
  • 49. Miscellaneous similarity measures • The Cosine measure (normalized dot product) X·Y Σ (di x qi) σ (D,Q) = = |X| * |Y| Σ (di)2 * Σ (qi)2 • The Jaccard coefficient |X ∩ Y| σ (D,Q) = |X ∪ Y|
  • 50. Exercise • Compute the cosine scores σ (D1,D2) and σ (D1,D3) for the documents: D1 = <1,3>, D2 = <100,300> and D3 = <3,1> • Compute the corresponding Euclidean distances, Manhattan distances, and Jaccard coefficients.
  • 51. Readings • (1): MRS1, MRS2, MRS5 (Zipf) • (2): MRS7, MRS8