2. Contents
Introduction
IDF Similarity
QF Similarity
Breaking Ties
Implementation
ITA Algorithm
Conclusion
3. Introduction
Database is Boolean Query Model
E.g.. Select * WHERE MFR_Country = “Germany”
AND Type = “Sports” AND Manufacture =
“Volkswagon”
Problems in Database
Empty Answers
Too selective query leading to Null Result Set
Many Answers
General query leading to too many results
4. Introduction
Ranking of Database Query Results using IR
techniques.
Applying TF-IDF concept to database that is
based on the frequency of the attribute values.
Need to extend the TF-IDF to Numerical Domains
IDF Similarity is discussed in paper
Collecting WORKLOAD and using it for ranking.
QF Similarity, leveraging Workload Information
5. Introduction
Many Answers Problem is solved using Top-K
Query Processing
Index-based Threshold Algorithm (ITA)
developed exploiting IDF/QF Similarity.
6. IDF Similarity
What is TF-IDF Technique?
Given a set of documents and a query,
documents are ranked based on TF and IDF of
the words of the document.
Adapting IDF concept to Database
containing only categorical Attributes
t=<t1,……tm> values of Attribute A
n Number of tuples in the database
7. IDF Similarity
For all the values of t:
Frequency F(t) is defined as no. of tuples having
Attribute A = t
IDF is calculated as:
IDF(t) = log(n/F(t))
For pair of values u and v in Attribute A domain
S(u,v) = IDF (u) if u=v otherwise 0
For tuple T and Query Q for all the Attributes
(A1…Ak) m
SIM(T,Q) = S ( t , q )
k k k
k 1
8. IDF Similarity
Example:
CAR_ID MODEL MFR MFR_Country Type
1 SLR Mercedes Germany Sports
2 A6 Audi Germany Executive
3 R8 Audi Germany Sports
4 Gallardo Lamborghini Italy Sports
Query Q: Select * WHERE MFR_Country =
“Germany” AND Type = “Sports” AND MFR =
“Volkswagon”
10. IDF Similarity
Consider a Numeric Attribute in DB e.g. PRICE
SIMPLE SOLUTION: Discretize the data between ranges
Consider two Range: (0, 50) and (51, 100)
Values 49 and 52 are considered completely dissimilar.
Frequency of a numeric value t of an attribute is defined as
2
ti t
n 1/ 2
h
sum of contributions to t
F(t) = e from every ti database.
i
IDF(t) = log(n/F(t)) h = bandwidth parameter
S(t,q) = density at t of a Gaussian Distribution centered q.
2
ti t
1/ 2
h
S(t,q) = e IDF ( q )
11. IDF Similarity
Consider following Query:
Select * where MFR IN (“Germany”, “Italy”,
”Japan”) m
SIM(T,Q) = max S k ( t k , q )
q Qk
k 1
12. QF Similarity
Problems with IDF:
In a realtor database, more homes are built in
recent years such as 2007 and 2008 as
compared to 1980 and 1981.Thus recent years
have small IDF. Yet newer homes have higher
demand.
In a bookstore DB, demand for an author is due
to factor other than no. of books he has written
13. QF Similarity
WORKLOAD: Past Queries
Importance of attribute values is determined
by frequency of their occurrence in workload.
As in above eg, frequency of queries
requesting homes in 2010 are more than of
the year 1981
14. QF Similarity
For categorical data
RQF(q) = raw frequency of occurrence of value q of
attribute A in query strings of workload
RQFMax = raw frequency of most frequently occurring
value in workload
Query frequency QF(q) = RQF(q)/RQFMax
s(t, q) = QF(q), if q = t otherwise 0
QF resembles TF
16. QF Similarity
Similarity between pairs of different categorical
attribute values can also be derived from workload
eg. To find S(Audi, Mercedes)
Similarity coefficient between t and q in this case is
defined by jaccard coefficient scaled by QF factor
as shown below.
S(t,q)=J(W(t),W(q))/QF(q)
W(t) = Subset of queries in workload W in which
categorical value t occurs in an IN clause
17. QF-IDF
For QF-IDF Similarity
S(t,q)=QF(q) *IDF(q) when t=q otherwise 0
18. BREAKING TIES
IF SIM(t1, q) = SIM (t2, q)
Which Should be ranked Higher??
QF and IDF partitions database into classes
CAR_ID MODEL MFR MFR_Country Type
1 SLR Mercedes Germany Sports
2 A6 Audi Germany Executive
3 R8 Audi Germany Sports
4 Gallardo Lamborghini Italy Sports
Q: SELECT * WHERE Type = “Sports” AND MFR_Country
= “Germany”
19. Breaking Ties with QF
Determine weights of missing attribute values that
reflect their “global importance” using workload.
Global Imp = log( QF ( t k )) tk= missing attribute
k
Missing Attributes for Q: MFR and Model
20. Breaking Ties with QF
Considering Workload with following values of MFR and
Model
MFR{Audi, Audi, Lamborghini, Mercedes, Lamborghini, Audi}
Model{R8, A6, Gallardo, SLR, Gallardo, A6}
QF(SLR) = ½ = 0.5 QF(Mercedes) = 1/3 = 0.33
1 SLR Mercedes Germany Sports
Global Imp = log(0.5) + log(0.33).
NEGATIVE VALUES of Global Imp ??
21. Breaking Ties with IDF
Tuples with large IDF(occuring infequently) of
missing attributes are ranked higher
Cars which are not popular are ranked higher
Tuples with small IDF of missing attributes
are ranked higher
Cars having Moonroof will be ranked less which
is a desirable feature.
23. Implementation
Pre Processing Component
Compute and store a representation of similarity
function(QF-IDF, QF, IDF) in auxiliary database
tables
24. Implementation
Query Processing Component
Job: Retrieving Top-K results from Database
ITA Algorithm: Use of Fagin’s Threshold Algorithm
and Similarity function
Sorted Access: Along any attribute Ak, TIDs of tuples
are retrieved.
Random Access: entire tuple corresponding to a TID
is retrieved.
25. ITA Algorithm
Repeat
Initialize Top-K Buffer to empty
For each k = 1 to p
TID = Index of the next Tuple is retrieved from the ordered
Lists
T = Complete Tuple is retrieved for TID
Compute value of Ranking Function
If Rank of T is higher than the rank of lowest ranking tuple in
Top-K Buffer, then update Top-K Buffer
If Stopping Condition has been reached then Exit
End For
Until all index of the tuples have been seen.
26. ITA Algorithm
Stopping Condition
Hypothetical tuple – current value a1,…, ap
for A1,… Ap, corresponding to index seeks on
L1,…, Lp and qp+1,….. qm for remaining
columns from the query directly.
Termination – Similarity of hypothetical tuple
to the query< tuple in Top-k buffer with least
similarity.
27. ITA for Numeric columns
Consider a query has condition Ak = qk for a
numeric column Ak.
Two index scan is performed on Ak.
First retrieve TID’s > qk in incresing order.
Second retrieve TID’s < qk in decreasing order.
We then pick TID’s from the merged stream.
28. Conclusion
Automated Ranking Infrastructure for SQL
databases.
Extended TF-IDF based techniques from
Information retrieval to numeric and mixed
data.
Implementation of Ranking function that
exploited Fagin’s TA