2. Relational Database: Definitions
Relational database: a set of relations
Relation: made up of 2 parts:
Instance : a table, with rows and columns.
#Rows = cardinality, #fields = degree / arity.
Schema : specifies name of relation, plus name and type of
each column.
E.G. Students(sid: string, name: string, login: string,
age: integer, gpa: real).
Can think of a relation as a set of rows or tuples (i.e.,
all rows are distinct).
3. Example Instance of Students Relation
sid name login age gpa
53666 Jones jones@cs 18 3.4
53688 Smith smith@eecs 18 3.2
53650 Smith smith@math 19 3.8
Cardinality = 3, degree = 5, all rows distinct
4. Relational Query Languages
A major strength of the relational model: supports
simple, powerful querying of data.
Queries can be written intuitively, and the DBMS is
responsible for efficient evaluation.
5. The SQL Query Language
Developed by IBM (system R) in the 1970s
Need for a standard since it is used by many vendors
Standards:
SQL-86
SQL-89 (minor revision)
SQL-92 (major revision, current standard)
SQL-99 (major extensions)
6. The SQL Query Language
To find all 18 year old students, we can write:
SELECT * sid name login age gpa
FROM Students S 53666 Jones jones@cs 18 3.4
WHERE S.age=18 53688 Smith smith@ee 18 3.2
•To find just names and logins, replace the first line:
SELECT S.name, S.login
7. Querying Multiple Relations
sid cid grade
53831 Carnatic101 C
53831 Reggae203 B
53650 Topology112 A
53666 History105 B
SELECT S.name, E.cid
FROM Students S, Enrolled E
WHERE S.sid=E.sid AND E.grade=“A”
S.name E.cid
Smith Topology112
8. Creating Relations in SQL
Creates the Students relation. Observe
that the type (domain) of each field CREATE TABLE Students
is specified, and enforced by the DBMS (sid: CHAR(20),
whenever tuples are added or modified. name: CHAR(20),
As another example, the Enrolled table login: CHAR(10),
holds information about courses that
students take. age: INTEGER,
gpa: REAL)
CREATE TABLE Enrolled
(sid: CHAR(20),
cid: CHAR(20),
grade: CHAR(2))
9. Combining Separate Systems
Use an IR and RDBMS systems which are
independent.
Divide the query into two:
Structured part for the RDBMS
Unstructured (text) part for the IR
Combine the results from IR and RDBMS
Good for letting each vendor develop its own system
Bad for data integrity, recovery, portability, and
performance
10. User Defined Operators
Allow users to modify SQL by adding their own functions
Some vendors used this approach (such as IBM DB2 text
extender)
Lynch and Stonebreaker defined “user defined operators” to
implement information retrieval in 1988
//Retrieves documents that contain term1, term2, term3
SELECT Doc_Id
FROM Doc
WHERE SEARCH-TERM(Text, Term1, Term 2, Term3)
//Retrieves documents that contain term1, term2, term3
// within a window of 5 terms
SELECT Doc_Id
FROM Doc
WHERE PROXIMITY(Text,5, Term1, Term 2, Term3)
11. Non-First Normal Form Approaches
Capture the many-to-many relationships into sets via nested
relations
Hard to implement ad-hoc queries
No standard yet
12. Using RDBMS for IR
Benefits:
Recovery
Performance
Data migration
Concurrency Control
Access control mechanism
Logical and physical data independence
13. Using RDBMS for IR
Example: A bibliography that includes both structured and
unstructured information
DIRECTORY (name, institution) : affiliation of the author
AUTHOR(name,DocId) :authorship information
INDEX (name, DocId) :terms that are used to index a document
14. Using RDBMS for IR
Preprocessing
SGML can be used as a starting point which is a standard for
defining parts of documents
<DOC>
<DOCNO> WSJ834234234 </DOCNO>
<HL> How to make students suffer in IR Course </HL>
<DD> 03/23/87</DD>
<DATELINE> Sabanci, Turkey </DATELINE>
<TEXT>
Crawler HW, Inverted Index, Querying
</TEXT>
</DOC>
15. Using RDBMS for IR
Preprocessing
SGML can be used as a starting point which is a standard for
defining parts of documents
Use a parser together with a hash function to identify terms
Use STOP_TERM table for referencing stop words
Produce three output tables
INDEX (DocId, Term, TermFrequency) : Models the inverted index
DOC (DocId, DocName, PubDate, DateLine) : Document metadata
TERM (Term, Idf) : stored the weights of each term
//Construct TERM table, N is the total number of documents
INSERT INTO TERM
SELECT Term,log(N/Count(*))
FROM INDEX
GROUP BY Term
16. Using RDBMS for IR
An offset can be added together with the term to be able to answer proximity
queries. For example “Vice President” should occur together in the same
document for relevant documents etc.
INDEX_PROX (DocId, Term, OffSet)
//Construct TERM table, N is the total number of documents
INSERT INTO INDEX
SELECT DocId, Term, COUNT(*)
FROM INDEX_PROX
GROUP BY DocId, Term
17. Using RDBMS for IR
Query can be modeled as a relation as well when it is a long
document
QUERY(Term,TermFreq)
Ex: “Find all news documents written on 03/03/2005 about
Sabanci University
Data will be extracted from the structured fields
Terms will be extracted using the inverted index
SELECT d.DocId
FROM DOC d, INDEX i
WHERE i.Term IN (“Sabanci”, “University”) AND d.PubDate = “03/03/2005”
AND d.DocId = i.DocId
18. Using RDBMS for IR
Boolean Queries: Consists of terms with boolean operators
(AND, OR, and NOT)
For a single inputTerm: retrieve the document texts that contain
that term
SELECT d.Text
FROM DOC d,
WHERE d.DocId IN
(SELECT DISTINCT (i.DocId)
FROM INDEX i
WHERE i.Term = inputTerm)
Note that we can store the text part of a document using BLOB or CLOG (
Binary or Character Large Object)
19. Using RDBMS for IR
Boolean Queries that contain OR
SELECT DISTINCT (i.DocId)
FROM INDEX i
WHERE i.Term = inputTerm1 OR
i.Term = inputTerm2 OR
…..
i.Term = inputTermn OR
20. Using RDBMS for IR
Boolean Queries that contain AND
SELECT DISTINCT (i.DocId)
FROM INDEX i
WHERE i.Term = inputTerm1 AND
i.Term = inputTerm2 AND
…..
i.Term = inputTermn AND
??
21. Using RDBMS for IR
Boolean Queries that contain AND (Previous Answer Was
Wrong)
SELECT DISTINCT (i.DocId)
FROM INDEX i1, INDEX i2, INDEX i3, …. INDEX in
WHERE i1.Term = inputTerm1 AND
i2.Term = inputTerm2 AND
…..
in.Term = inputTermn AND
i1.DocID = i2.DocId AND
i2.DocID = i3.DocId AND
…
in-1 = in.DocID
OR YOU CAN USE INTERSECTION
22. Using RDBMS for IR
Boolean Queries that contain AND
Commercial DBMSs are not able to process more than a fixed number
of joins.
Solution
SELECT i.DocId
FROM INDEX i, Query q
WHERE i.Term = q.term
GROUP BY i.DocId
HAVING COUNT(i.Term) = (SELECT COUNT(*) FROM QUERY)
Works only when the INDEX contains only one occurrence of a given term
Together with its frequency. No Proximity is recorded.
23. Using RDBMS for IR
Boolean Queries that contain AND
Commercial DBMSs are not able to process more than a fixed number
of joins.
Solution for terms appearing more than once in the INDEX
SELECT i.DocId
FROM INDEX i, Query q
WHERE i.Term = q.term
GROUP BY i.DocId
HAVING COUNT(DISTINCT(i.Term)) = (SELECT COUNT(*) FROM QUERY)
This is slower since DISTINC requires a sort for duplicate elimination.
24. Using RDBMS for IR
Boolean Queries that contain AND
Commercial DBMSs are not able to process more than a fixed number
of joins.
Implementation of TAND (Threshold AND) is also simple
SELECT i.DocId
FROM INDEX i, Query q
WHERE i.Term = q.term
GROUP BY i.DocId
HAVING COUNT(DISTINCT(i.Term)) > k
25. Using RDBMS for IR
Proximity Queries for terms within a specific window width
SELECT a.DocId
FROM INDEX_PROX a, INDEX_PROX b
WHERE a.Term IN (SELECT q.Term FROM QUERY q) AND
b.Term IN (SELECT q.Term FROM QUERY q) AND
a.DocId = b.DocId AND
(a.offset –b.offset) BETWEEN 0 AND (width-1)
GROUP BY a.DocId, b.DocId, a.Term, a.offset
HAVING COUNT(DISTINCT(b.Term)) = SELECT (COUNT(*) FROM QUERY)
26. Using RDBMS for IR
Calculating Relevance
SELECT i.DocId, SUM(q.tf*t.idf*t.tf*t.idf)
FROM QUERY q, INDEX i, TERM t
WHERE q.Term = t.term AND i.Term = t.Term
GROUP BY i.DocId
ORDER BY 2 DESC