SlideShare une entreprise Scribd logo
1  sur  26
IMPLEMENTATION OF
INFORMATION RETRIEVAL
  SYSTEMS VIA RDBMS
Relational Database: Definitions

 Relational database: a set of relations
 Relation: made up of 2 parts:
     Instance : a table, with rows and columns.
      #Rows = cardinality, #fields = degree / arity.
     Schema : specifies name of relation, plus name and type of
      each column.
        E.G. Students(sid: string, name: string, login: string,
              age: integer, gpa: real).
 Can think of a relation as a set of rows or tuples (i.e.,
 all rows are distinct).
Example Instance of Students Relation


        sid     name      login            age   gpa
       53666    Jones jones@cs             18    3.4
       53688    Smith smith@eecs           18    3.2
       53650    Smith smith@math           19    3.8

Cardinality = 3, degree = 5, all rows distinct
Relational Query Languages

 A major strength of the relational model: supports
  simple, powerful querying of data.
 Queries can be written intuitively, and the DBMS is
  responsible for efficient evaluation.
The SQL Query Language

 Developed by IBM (system R) in the 1970s
 Need for a standard since it is used by many vendors
 Standards:
    SQL-86
    SQL-89 (minor revision)
    SQL-92 (major revision, current standard)
    SQL-99 (major extensions)
The SQL Query Language

 To find all 18 year old students, we can write:

  SELECT *               sid   name    login     age gpa
  FROM Students S      53666 Jones    jones@cs   18 3.4
  WHERE S.age=18       53688 Smith smith@ee 18 3.2


 •To find just names and logins, replace the first line:
   SELECT S.name, S.login
Querying Multiple Relations
     sid          cid   grade
    53831   Carnatic101  C
    53831   Reggae203    B
    53650   Topology112  A
    53666   History105   B

    SELECT S.name, E.cid
    FROM Students S, Enrolled E
    WHERE S.sid=E.sid AND E.grade=“A”



    S.name E.cid
    Smith  Topology112
Creating Relations in SQL
 Creates the Students relation. Observe
  that the type (domain) of each field      CREATE TABLE Students
   is specified, and enforced by the DBMS        (sid: CHAR(20),
  whenever tuples are added or modified.          name: CHAR(20),
 As another example, the Enrolled table          login: CHAR(10),
  holds information about courses that
  students take.                                  age: INTEGER,
                                                  gpa: REAL)


                                            CREATE TABLE Enrolled
                                                 (sid: CHAR(20),
                                                  cid: CHAR(20),
                                                  grade: CHAR(2))
Combining Separate Systems

  Use an IR and RDBMS systems which are
  independent.
  Divide the query into two:
      Structured part for the RDBMS
      Unstructured (text) part for the IR
  Combine the results from IR and RDBMS
  Good for letting each vendor develop its own system
  Bad for data integrity, recovery, portability, and
  performance
User Defined Operators

  Allow users to modify SQL by adding their own functions
  Some vendors used this approach (such as IBM DB2 text
  extender)
  Lynch and Stonebreaker defined “user defined operators” to
  implement information retrieval in 1988
      //Retrieves documents that contain term1, term2, term3
      SELECT Doc_Id
      FROM Doc
      WHERE SEARCH-TERM(Text, Term1, Term 2, Term3)

       //Retrieves documents that contain term1, term2, term3
       // within a window of 5 terms
       SELECT Doc_Id
       FROM Doc
       WHERE PROXIMITY(Text,5, Term1, Term 2, Term3)
Non-First Normal Form Approaches

  Capture the many-to-many relationships into sets via nested
  relations
  Hard to implement ad-hoc queries
  No standard yet
Using RDBMS for IR

  Benefits:
      Recovery
      Performance
      Data migration
      Concurrency Control
      Access control mechanism
      Logical and physical data independence
Using RDBMS for IR


  Example: A bibliography that includes both structured and
  unstructured information
      DIRECTORY (name, institution) : affiliation of the author
      AUTHOR(name,DocId) :authorship information
      INDEX (name, DocId) :terms that are used to index a document
Using RDBMS for IR

   Preprocessing
       SGML can be used as a starting point which is a standard for
        defining parts of documents

 <DOC>
 <DOCNO> WSJ834234234 </DOCNO>
  <HL> How to make students suffer in IR Course </HL>
 <DD> 03/23/87</DD>
 <DATELINE> Sabanci, Turkey </DATELINE>
 <TEXT>
 Crawler HW, Inverted Index, Querying
 </TEXT>
 </DOC>
Using RDBMS for IR
   Preprocessing
       SGML can be used as a starting point which is a standard for
        defining parts of documents
       Use a parser together with a hash function to identify terms
       Use STOP_TERM table for referencing stop words
       Produce three output tables
          INDEX (DocId, Term, TermFrequency) : Models the inverted index
          DOC (DocId, DocName, PubDate, DateLine) : Document metadata
          TERM (Term, Idf) : stored the weights of each term

 //Construct TERM table, N is the total number of documents
 INSERT INTO TERM
 SELECT Term,log(N/Count(*))
 FROM INDEX
 GROUP BY Term
Using RDBMS for IR
 An offset can be added together with the term to be able to answer proximity
    queries. For example “Vice President” should occur together in the same
    document for relevant documents etc.

 INDEX_PROX (DocId, Term, OffSet)

 //Construct TERM table, N is the total number of documents
 INSERT INTO INDEX
 SELECT DocId, Term, COUNT(*)
 FROM INDEX_PROX
 GROUP BY DocId, Term
Using RDBMS for IR

   Query can be modeled as a relation as well when it is a long
   document
       QUERY(Term,TermFreq)


   Ex: “Find all news documents written on 03/03/2005 about
   Sabanci University
       Data will be extracted from the structured fields
       Terms will be extracted using the inverted index


SELECT d.DocId
FROM DOC d, INDEX i
WHERE i.Term IN (“Sabanci”, “University”) AND d.PubDate = “03/03/2005”
      AND d.DocId = i.DocId
Using RDBMS for IR

    Boolean Queries: Consists of terms with boolean operators
    (AND, OR, and NOT)
    For a single inputTerm: retrieve the document texts that contain
    that term

SELECT d.Text
FROM DOC d,
WHERE d.DocId IN
     (SELECT DISTINCT (i.DocId)
      FROM INDEX i
      WHERE i.Term = inputTerm)


Note that we can store the text part of a document using BLOB or CLOG (
Binary or Character Large Object)
Using RDBMS for IR

   Boolean Queries that contain OR

SELECT DISTINCT (i.DocId)
FROM INDEX i
WHERE i.Term = inputTerm1 OR
      i.Term = inputTerm2 OR
      …..
      i.Term = inputTermn OR
Using RDBMS for IR

     Boolean Queries that contain AND

SELECT DISTINCT (i.DocId)
FROM INDEX i
WHERE i.Term = inputTerm1 AND
      i.Term = inputTerm2 AND
      …..
      i.Term = inputTermn AND

??
Using RDBMS for IR

   Boolean Queries that contain AND (Previous Answer Was
   Wrong)

SELECT DISTINCT (i.DocId)
FROM INDEX i1, INDEX i2, INDEX i3, …. INDEX in
WHERE i1.Term = inputTerm1 AND
       i2.Term = inputTerm2 AND
      …..
      in.Term = inputTermn AND
      i1.DocID = i2.DocId AND
      i2.DocID = i3.DocId AND
      …
      in-1 = in.DocID

OR YOU CAN USE INTERSECTION
Using RDBMS for IR

  Boolean Queries that contain AND
  Commercial DBMSs are not able to process more than a fixed number
  of joins.
  Solution


   SELECT i.DocId
   FROM INDEX i, Query q
   WHERE i.Term = q.term
   GROUP BY i.DocId
   HAVING COUNT(i.Term) = (SELECT COUNT(*) FROM QUERY)

   Works only when the INDEX contains only one occurrence of a given term
   Together with its frequency. No Proximity is recorded.
Using RDBMS for IR

  Boolean Queries that contain AND
  Commercial DBMSs are not able to process more than a fixed number
  of joins.
  Solution for terms appearing more than once in the INDEX


   SELECT i.DocId
   FROM INDEX i, Query q
   WHERE i.Term = q.term
   GROUP BY i.DocId
   HAVING COUNT(DISTINCT(i.Term)) = (SELECT COUNT(*) FROM QUERY)

   This is slower since DISTINC requires a sort for duplicate elimination.
Using RDBMS for IR

  Boolean Queries that contain AND
  Commercial DBMSs are not able to process more than a fixed number
  of joins.
  Implementation of TAND (Threshold AND) is also simple


   SELECT i.DocId
   FROM INDEX i, Query q
   WHERE i.Term = q.term
   GROUP BY i.DocId
   HAVING COUNT(DISTINCT(i.Term)) > k
Using RDBMS for IR

  Proximity Queries for terms within a specific window width


 SELECT a.DocId
 FROM INDEX_PROX a, INDEX_PROX b
 WHERE a.Term IN (SELECT q.Term FROM QUERY q) AND
        b.Term IN (SELECT q.Term FROM QUERY q) AND
        a.DocId = b.DocId AND
        (a.offset –b.offset) BETWEEN 0 AND (width-1)
 GROUP BY a.DocId, b.DocId, a.Term, a.offset
 HAVING COUNT(DISTINCT(b.Term)) = SELECT (COUNT(*) FROM QUERY)
Using RDBMS for IR

  Calculating Relevance

   SELECT i.DocId, SUM(q.tf*t.idf*t.tf*t.idf)
   FROM QUERY q, INDEX i, TERM t
   WHERE q.Term = t.term AND i.Term = t.Term
   GROUP BY i.DocId
   ORDER BY 2 DESC

Contenu connexe

Tendances

HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
ijseajournal
 
Database management system chapter12
Database management system chapter12Database management system chapter12
Database management system chapter12
Md. Mahedi Mahfuj
 
DATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEMDATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEM
Sonia Pahuja
 

Tendances (20)

Intro to Data warehousing lecture 19
Intro to Data warehousing   lecture 19Intro to Data warehousing   lecture 19
Intro to Data warehousing lecture 19
 
Intro to Data warehousing lecture 14
Intro to Data warehousing   lecture 14Intro to Data warehousing   lecture 14
Intro to Data warehousing lecture 14
 
Sql commands
Sql commandsSql commands
Sql commands
 
Bringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASBringing OpenClinica Data into SAS
Bringing OpenClinica Data into SAS
 
Optimizing Data Accessin Sq Lserver2005
Optimizing Data Accessin Sq Lserver2005Optimizing Data Accessin Sq Lserver2005
Optimizing Data Accessin Sq Lserver2005
 
Relational Model and Relational Algebra - Lecture 3 - Introduction to Databas...
Relational Model and Relational Algebra - Lecture 3 - Introduction to Databas...Relational Model and Relational Algebra - Lecture 3 - Introduction to Databas...
Relational Model and Relational Algebra - Lecture 3 - Introduction to Databas...
 
DBMS_INTRODUCTION OF SQL
DBMS_INTRODUCTION OF SQLDBMS_INTRODUCTION OF SQL
DBMS_INTRODUCTION OF SQL
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
 
BAS 150 Lesson 4 Lecture
BAS 150 Lesson 4 LectureBAS 150 Lesson 4 Lecture
BAS 150 Lesson 4 Lecture
 
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
 
BAS 150 Lesson 6 Lecture
BAS 150 Lesson 6 LectureBAS 150 Lesson 6 Lecture
BAS 150 Lesson 6 Lecture
 
Database management system chapter12
Database management system chapter12Database management system chapter12
Database management system chapter12
 
SQL
SQL SQL
SQL
 
Unit08 dbms
Unit08 dbmsUnit08 dbms
Unit08 dbms
 
DBMS _Relational model
DBMS _Relational modelDBMS _Relational model
DBMS _Relational model
 
SQL : introduction
SQL : introductionSQL : introduction
SQL : introduction
 
Sql fundamentals
Sql fundamentalsSql fundamentals
Sql fundamentals
 
Unit 08 dbms
Unit 08 dbmsUnit 08 dbms
Unit 08 dbms
 
DATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEMDATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEM
 
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
Aaa ped-6-Data manipulation:  Data Files, and Data Cleaning & PreparationAaa ped-6-Data manipulation:  Data Files, and Data Cleaning & Preparation
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
 

En vedette

Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
Nanthini Dominique
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
dalal404
 
Storage And Retrieval Of Information
Storage And Retrieval Of InformationStorage And Retrieval Of Information
Storage And Retrieval Of Information
Marcus9000
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
silambu111
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrieval
Sadaf Rafiq
 

En vedette (12)

Vector space classification
Vector space classificationVector space classification
Vector space classification
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
 
similarity measure
similarity measure similarity measure
similarity measure
 
Information retrieval system!
Information retrieval system!Information retrieval system!
Information retrieval system!
 
Storage And Retrieval Of Information
Storage And Retrieval Of InformationStorage And Retrieval Of Information
Storage And Retrieval Of Information
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrieval
 
Prefixes 2
Prefixes 2Prefixes 2
Prefixes 2
 
Prefixes
PrefixesPrefixes
Prefixes
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 

Similaire à 2005 fall cs523_lecture_4

Vsam interview questions and answers.
Vsam interview questions and answers.Vsam interview questions and answers.
Vsam interview questions and answers.
Sweta Singh
 

Similaire à 2005 fall cs523_lecture_4 (20)

PT- Oracle session01
PT- Oracle session01 PT- Oracle session01
PT- Oracle session01
 
Ch3_Rel_Model-95.ppt
Ch3_Rel_Model-95.pptCh3_Rel_Model-95.ppt
Ch3_Rel_Model-95.ppt
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
 
Introduction to SQL
Introduction to SQLIntroduction to SQL
Introduction to SQL
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
 
Ch 3.pdf
Ch 3.pdfCh 3.pdf
Ch 3.pdf
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
 
Cassandra20141009
Cassandra20141009Cassandra20141009
Cassandra20141009
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
 
Database Management Lab -SQL Queries
Database Management Lab -SQL Queries Database Management Lab -SQL Queries
Database Management Lab -SQL Queries
 
MongoDB
MongoDBMongoDB
MongoDB
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
 
Cassandra20141113
Cassandra20141113Cassandra20141113
Cassandra20141113
 
Ado.net by Awais Majeed
Ado.net by Awais MajeedAdo.net by Awais Majeed
Ado.net by Awais Majeed
 
Vsam interview questions and answers.
Vsam interview questions and answers.Vsam interview questions and answers.
Vsam interview questions and answers.
 
2 rel-algebra
2 rel-algebra2 rel-algebra
2 rel-algebra
 
DBMS summer 19.pdf
DBMS summer 19.pdfDBMS summer 19.pdf
DBMS summer 19.pdf
 
Dbms relational model
Dbms relational modelDbms relational model
Dbms relational model
 
Mongodb Introduction
Mongodb IntroductionMongodb Introduction
Mongodb Introduction
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

2005 fall cs523_lecture_4

  • 2. Relational Database: Definitions Relational database: a set of relations Relation: made up of 2 parts:  Instance : a table, with rows and columns. #Rows = cardinality, #fields = degree / arity.  Schema : specifies name of relation, plus name and type of each column.  E.G. Students(sid: string, name: string, login: string, age: integer, gpa: real). Can think of a relation as a set of rows or tuples (i.e., all rows are distinct).
  • 3. Example Instance of Students Relation sid name login age gpa 53666 Jones jones@cs 18 3.4 53688 Smith smith@eecs 18 3.2 53650 Smith smith@math 19 3.8 Cardinality = 3, degree = 5, all rows distinct
  • 4. Relational Query Languages  A major strength of the relational model: supports simple, powerful querying of data.  Queries can be written intuitively, and the DBMS is responsible for efficient evaluation.
  • 5. The SQL Query Language Developed by IBM (system R) in the 1970s Need for a standard since it is used by many vendors Standards:  SQL-86  SQL-89 (minor revision)  SQL-92 (major revision, current standard)  SQL-99 (major extensions)
  • 6. The SQL Query Language To find all 18 year old students, we can write: SELECT * sid name login age gpa FROM Students S 53666 Jones jones@cs 18 3.4 WHERE S.age=18 53688 Smith smith@ee 18 3.2 •To find just names and logins, replace the first line: SELECT S.name, S.login
  • 7. Querying Multiple Relations sid cid grade 53831 Carnatic101 C 53831 Reggae203 B 53650 Topology112 A 53666 History105 B SELECT S.name, E.cid FROM Students S, Enrolled E WHERE S.sid=E.sid AND E.grade=“A” S.name E.cid Smith Topology112
  • 8. Creating Relations in SQL  Creates the Students relation. Observe that the type (domain) of each field CREATE TABLE Students is specified, and enforced by the DBMS (sid: CHAR(20), whenever tuples are added or modified. name: CHAR(20),  As another example, the Enrolled table login: CHAR(10), holds information about courses that students take. age: INTEGER, gpa: REAL) CREATE TABLE Enrolled (sid: CHAR(20), cid: CHAR(20), grade: CHAR(2))
  • 9. Combining Separate Systems Use an IR and RDBMS systems which are independent. Divide the query into two:  Structured part for the RDBMS  Unstructured (text) part for the IR Combine the results from IR and RDBMS Good for letting each vendor develop its own system Bad for data integrity, recovery, portability, and performance
  • 10. User Defined Operators Allow users to modify SQL by adding their own functions Some vendors used this approach (such as IBM DB2 text extender) Lynch and Stonebreaker defined “user defined operators” to implement information retrieval in 1988 //Retrieves documents that contain term1, term2, term3 SELECT Doc_Id FROM Doc WHERE SEARCH-TERM(Text, Term1, Term 2, Term3) //Retrieves documents that contain term1, term2, term3 // within a window of 5 terms SELECT Doc_Id FROM Doc WHERE PROXIMITY(Text,5, Term1, Term 2, Term3)
  • 11. Non-First Normal Form Approaches Capture the many-to-many relationships into sets via nested relations Hard to implement ad-hoc queries No standard yet
  • 12. Using RDBMS for IR Benefits:  Recovery  Performance  Data migration  Concurrency Control  Access control mechanism  Logical and physical data independence
  • 13. Using RDBMS for IR Example: A bibliography that includes both structured and unstructured information  DIRECTORY (name, institution) : affiliation of the author  AUTHOR(name,DocId) :authorship information  INDEX (name, DocId) :terms that are used to index a document
  • 14. Using RDBMS for IR Preprocessing  SGML can be used as a starting point which is a standard for defining parts of documents <DOC> <DOCNO> WSJ834234234 </DOCNO> <HL> How to make students suffer in IR Course </HL> <DD> 03/23/87</DD> <DATELINE> Sabanci, Turkey </DATELINE> <TEXT> Crawler HW, Inverted Index, Querying </TEXT> </DOC>
  • 15. Using RDBMS for IR Preprocessing  SGML can be used as a starting point which is a standard for defining parts of documents  Use a parser together with a hash function to identify terms  Use STOP_TERM table for referencing stop words  Produce three output tables  INDEX (DocId, Term, TermFrequency) : Models the inverted index  DOC (DocId, DocName, PubDate, DateLine) : Document metadata  TERM (Term, Idf) : stored the weights of each term //Construct TERM table, N is the total number of documents INSERT INTO TERM SELECT Term,log(N/Count(*)) FROM INDEX GROUP BY Term
  • 16. Using RDBMS for IR An offset can be added together with the term to be able to answer proximity queries. For example “Vice President” should occur together in the same document for relevant documents etc. INDEX_PROX (DocId, Term, OffSet) //Construct TERM table, N is the total number of documents INSERT INTO INDEX SELECT DocId, Term, COUNT(*) FROM INDEX_PROX GROUP BY DocId, Term
  • 17. Using RDBMS for IR Query can be modeled as a relation as well when it is a long document  QUERY(Term,TermFreq) Ex: “Find all news documents written on 03/03/2005 about Sabanci University  Data will be extracted from the structured fields  Terms will be extracted using the inverted index SELECT d.DocId FROM DOC d, INDEX i WHERE i.Term IN (“Sabanci”, “University”) AND d.PubDate = “03/03/2005” AND d.DocId = i.DocId
  • 18. Using RDBMS for IR Boolean Queries: Consists of terms with boolean operators (AND, OR, and NOT) For a single inputTerm: retrieve the document texts that contain that term SELECT d.Text FROM DOC d, WHERE d.DocId IN (SELECT DISTINCT (i.DocId) FROM INDEX i WHERE i.Term = inputTerm) Note that we can store the text part of a document using BLOB or CLOG ( Binary or Character Large Object)
  • 19. Using RDBMS for IR Boolean Queries that contain OR SELECT DISTINCT (i.DocId) FROM INDEX i WHERE i.Term = inputTerm1 OR i.Term = inputTerm2 OR ….. i.Term = inputTermn OR
  • 20. Using RDBMS for IR Boolean Queries that contain AND SELECT DISTINCT (i.DocId) FROM INDEX i WHERE i.Term = inputTerm1 AND i.Term = inputTerm2 AND ….. i.Term = inputTermn AND ??
  • 21. Using RDBMS for IR Boolean Queries that contain AND (Previous Answer Was Wrong) SELECT DISTINCT (i.DocId) FROM INDEX i1, INDEX i2, INDEX i3, …. INDEX in WHERE i1.Term = inputTerm1 AND i2.Term = inputTerm2 AND ….. in.Term = inputTermn AND i1.DocID = i2.DocId AND i2.DocID = i3.DocId AND … in-1 = in.DocID OR YOU CAN USE INTERSECTION
  • 22. Using RDBMS for IR Boolean Queries that contain AND Commercial DBMSs are not able to process more than a fixed number of joins. Solution SELECT i.DocId FROM INDEX i, Query q WHERE i.Term = q.term GROUP BY i.DocId HAVING COUNT(i.Term) = (SELECT COUNT(*) FROM QUERY) Works only when the INDEX contains only one occurrence of a given term Together with its frequency. No Proximity is recorded.
  • 23. Using RDBMS for IR Boolean Queries that contain AND Commercial DBMSs are not able to process more than a fixed number of joins. Solution for terms appearing more than once in the INDEX SELECT i.DocId FROM INDEX i, Query q WHERE i.Term = q.term GROUP BY i.DocId HAVING COUNT(DISTINCT(i.Term)) = (SELECT COUNT(*) FROM QUERY) This is slower since DISTINC requires a sort for duplicate elimination.
  • 24. Using RDBMS for IR Boolean Queries that contain AND Commercial DBMSs are not able to process more than a fixed number of joins. Implementation of TAND (Threshold AND) is also simple SELECT i.DocId FROM INDEX i, Query q WHERE i.Term = q.term GROUP BY i.DocId HAVING COUNT(DISTINCT(i.Term)) > k
  • 25. Using RDBMS for IR Proximity Queries for terms within a specific window width SELECT a.DocId FROM INDEX_PROX a, INDEX_PROX b WHERE a.Term IN (SELECT q.Term FROM QUERY q) AND b.Term IN (SELECT q.Term FROM QUERY q) AND a.DocId = b.DocId AND (a.offset –b.offset) BETWEEN 0 AND (width-1) GROUP BY a.DocId, b.DocId, a.Term, a.offset HAVING COUNT(DISTINCT(b.Term)) = SELECT (COUNT(*) FROM QUERY)
  • 26. Using RDBMS for IR Calculating Relevance SELECT i.DocId, SUM(q.tf*t.idf*t.tf*t.idf) FROM QUERY q, INDEX i, TERM t WHERE q.Term = t.term AND i.Term = t.Term GROUP BY i.DocId ORDER BY 2 DESC