Similarity join is most important technique to
involve many applications such as data integration, record
linkage and pattern recognition. Here we introduce new
algorithm for similarity join with edit distance constraints.
Currently extracting overlapping grams from string and consider
only string that share certain gram as candidate. Now we propose
extracting non-overlapping substring or chunk from string.
Chunk scheme based on tail-restricted chunk boundary
dictionary (CBD). This approach integrated existing approach
for calculating similarity with several new filters unique to chunk
based method. Greedy algorithm automatically select good
chunking scheme from given data set. Then show the result our
method occupies less space and faster performance to compute
the value
Processing & Properties of Floor and Wall Tiles.pptx
Vchunk join an efficient algorithm for edit similarity joins
1. VCHUNK JOIN
An efficient algorithm for Edit similarity join
Vignesh Babu, M.R.J.
Computer Science and Engineering
K.L.N. College of Information Technology,
Sivagangai, India
vigneshbabumrj@gmail.com
Thanush Kodi, N. and Vijay Koushik, S.
Computer Science and Engineering
K.L.N. College of Information Technology,
Sivagangai, India
nthanushkodi@gmail.com
svijaykoushik@gmail.com
Abstract— Similarity join is most important technique to
involve many applications such as data integration, record
linkage and pattern recognition. Here we introduce new
algorithm for similarity join with edit distance constraints.
Currently extracting overlapping grams from string and consider
only string that share certain gram as candidate. Now we propose
extracting non-overlapping substring or chunk from string.
Chunk scheme based on tail-restricted chunk boundary
dictionary (CBD). This approach integrated existing approach
for calculating similarity with several new filters unique to chunk
based method. Greedy algorithm automatically select good
chunking scheme from given data set. Then show the result our
method occupies less space and faster performance to compute
the value
Keywords—chunk bound dictionary; Edit similarity joins,
approximate string matching, prefix filtering, q-gram, edit distance
I. INTRODUCTION
This project will design of an efficient algorithm to reduce
the space requirements for edit similarity join using the rank
method to reduce the overlapping of filtered data form the
dataset during data integration process.
A similarity join finds pairs of objects from two datasets
such that the pair‟s similarity value is no less than a given
threshold. Early research in this area focuses on similarities
defined in the Euclidean space; current research has spread to
various similarity (or distance) functions. Similarity joins with
edit distance constraint of no more than a constant. The major
technical difficulty lies in the fact that the edit distance metric
is complex and costly to compute. In order to scale to large
datasets, all existing algorithms adopt the filter-and-verify
paradigm. This first eliminates non-promising string pairs with
relatively inexpensive computation and then performs
verification by the edit distance computation.
Gram-based methods extract all substrings of certain length
(q-grams), or according to a dictionary (VGRAMs), for each
string. Therefore, subsequent grams are overlapping and are
highly redundant. This results in large index size and incurs
high query processing costs when the index cannot be entirely
accommodated in the main memory. The total size of the q-
gram six time more than text string.
Our observation is that the main redundancy in q-gram
based approach in existing overlapping gram. In each string,
we divide it into several disjoint substrings, or chunks, and we
only need to process and index these chunks to improve the
space usage. Chunk based robust chunking scheme that have
salient property for each string. Robust chunking scheme based
on tail-restrict Chunk Boundary Dictionary (CBD). Edit
distance not more than the constant threshold. Here applying
prefix filtering and location-based mismatch filtering methods,
the number of signatures generated by our algorithm is
guaranteed to be fewer than or equal to that of the q-grams-
based method, q>= 2. Another feature of chunk based method
that use existing filter and several new filtering unique to
chunk based method including rank, chunk number and virtual
CBD filtering. We also consider the problem of selecting a
good chunking scheme for a given dataset.
II. PROBLEM DEFINITION
A. Existing System
Previously similarity join with edit distance was used to
extract overlapping grams from strings. Q-Gram is continuous
substring of the length Q of the given string; we then find the
position of the string to be matching. If they have same
content and positions are within the edit distance, it performs
an operation to calculate similarity through length, count
position, and prefix filtering of the data from the dataset. The
Q-Gram method index value are very large and then fixed
length chunk based method‟s index values are huge, and
variable gram(VGRAM) dictionary based method chunk does
not predict easily and this complicates the work to implement
this approach.
B. Proposed System
In our proposed system edit similarity join based on
extracting non-overlapping substring, or chunk from string. A
class of chunk based method notion tail-restricted chunk
boundary dictionary uses several existing filtering and new
powerful filtering techniques to reduce the size of index. Our
proposed algorithm performs length, position, prefix filtering,
content-based filtering and location-based filtering approach
help to reduce the amount of space required to calculate the
similarity of only the satisfied similar data. Resulting algorithm
we call it as VChunkJoin-NoLC. Here we introduce rank,
2. chunk number and virtual CBD algorithm to improve the
efficiency of the problem. This approach should satisfy the edit
distance value which is less than the threshold value will be
calculated as the similarity value.
III. INITIAL OPERATIONS
To perform similarity join operation on data strings, first
datasets must be gathered and should be parsed. Then parsed
data is added to the database.
A. Dataset Gathering
The first step is to gather datasets for the similarity
calculation. It was decided to use XML datasets for this
purpose. XML data sets are generic in nature and can be used
by software components of different platforms.
A data dataset (or data set) is a collection of data.
Most commonly a dataset corresponds to the contents of a
single database table, or a single statistical data matrix, where
every column of the table represents a particular variable, and
each row corresponds to a given member of the dataset in
question. The dataset lists values for each of the variables, such
as height and weight of an object, for each member of the
dataset. Each value is known as a datum. The dataset may
comprise data for one or more members, corresponding to the
number of rows.
The term dataset may also be used more loosely, to refer to
the data in a collection of closely related tables, corresponding
to a particular experiment or event.
Here the term dataset refers to collection of data in an XML
file. It was decided to use IMDB, DBLB, TREC, UNIREF,
UNRON datasets for this algorithm which usually have large
number of similar values.
B. Xml Parsing
XML parsing or Pull parsing treats the document as a
series of items which are read in sequence using the Iterator
design pattern. This allows for writing of recursive-descent
parsers in which the structure of the code performing the
parsing mirrors the structure of the XML being parsed, and
intermediate parsed results can be used and accessed as
local variables within the methods performing the parsing,
or passed down (as method parameters) into lower-level
methods, or returned (as method return values) to higher-
level methods.
Here we use XML parsing or pull parsing to extract the
data from the datasets chosen. These extracted data are then
inserted into the database using SQL queries. At least two
datasets are needed for similarity calculation.
IV. DATA FILTERING
The next step is to filter out the data that are likely to be
similar. Two data are considered similar by different methods
like the most used Prefix filtering. But on analysis prefix
filtering method produces overlapping data as filtered result.
This overlapped data occupy a large space in the memory. So
to overcome this problem it was decided to use rank based
filtering method to reduce the overlapping of data.
A. Prefix Filtering
Prefix Filtering method filters documents that
have fields containing terms with a specified prefix. It is
similar to phrase query, except that it acts as a filter. It can
be placed within queries that accept a filter.
This approach is applied to two data sets that we
have taken before. In this approach the first dataset is
considered ad prefix and the second dataset is considered
as the suffix. For the filtering the length of the prefix
string is compared with that of the suffix string. If the
prefix length matches the suffix length, then they are
filtered out and placed in a separately.
Consider two datasets P and Q on which prefix
filtering is to be applied. The filtered elements are placed
in the set F such that
F= {x | length(P(m)) is equal to length (Q(n))}
Where,
x – The filtered element.
lenth(P(m)) – The length of the string m of the
set P.
length(Q(n)) – The length of the string n of the
set Q.
Prefix filtering method was fast but it yielded a
large number of overlapping data which occupied too
much space in memory.
B. Rank Based Filtering
The existing prefix filtering method occupied a large
amount of memory, so we needed to use a new approach to
reduce space occupancy by the filtered data. Thus we
introduced a rank based method to reduce the overlapping data.
In Rank based approach we consider only the overlapped
data. Other filtered elements are kept as such. Here we apply
rank to the overlapped data on both datasets. The rank of both
elements whose difference is greater than the chosen threshold
is taken for further processing and rest are ignored. Rank
calculation involves certain Combination operations to obtain
rank of the string.
F={x│x is equal to rank}
3. Where,
rank = │rank(p(m)) – rank(Q(n))│≥ τ
τ- Threshold value.
rank(P(m)) – Rank of the string m of the dataset P.
rank(Q(n)) – Rank of the string n of the dataset Q.
By this approach most of the overlapping data are ignored
for the next process.
V. SIMILARITY CALCULATION
The final step is to calculate the similarity between
the strings of the filtered strings. The Similarity calculation
involves the following steps.
1. Edit distance calculation.
2. Chunk the strings.
3. Similarity calculation.
A. Edit Distance Calculation
In this module we calculate the edit distance
value from the matched text string from two dataset
values. Edit distance calculation have to be performed by
dynamic programming concept to get the minimum
number of operation required to perform insert, delete,
substitute and transform the string. Consider two string
where both the strings have seen the length and position
of the string are same the edit distance value as zero
otherwise the value taken as how many operation need to
transform one string to another as the edit distance value.
( )
{
( )
( )
( )
( )
Where,
D (i,j) – The value of the leading diagonal
element in matrix.
si – ith
element of the string s.
tj – jth
element of the string t.
The various operations are.
Insertion.
Deletion.
Substitution.
Transformation.
B. String Chunking
Chunks are very basic and necessary condition to
perform similarity value between the two selected similar data
from dataset. Chunks are the disjoint substring of the given
string. Our proposed dataset TREC as book oriented dataset in
this paper CBD selection have to be done by the character set
{a,e,g,l,s,v}. We split the string or substring into the given set
of data as chunk and taken the number of chunk in the string
as input condition to perform similarity value calculation. It
should be less than the constant threshold value.
Example:
Consider the string „Adolescence‟. According to the
chunking rule above the string would be chunked as follows
(underlined).
A dol e s cence
Thus the string Adolescence has been chunked into 5
parts.
C. Similarity Calculation
Similarity calculation is the final result of this
algorithm. We have to perform the similarity value
calculation between the two datasets. It depends on the
edit distance value. This similarity value calculation is
performed by which data are less than threshold and
between the two set ranks should less than the threshold
value.
Similarity value is calculated by subtracting Edit
distance value from the bigger length then divided by the
same value as similarity value between the data
[ ]
VI. EXPERIMENTS
We use the following five categories of algorithms in
the experiment, all of which are implemented as in-memory
algorithms, with all their inputs loaded into memory before
running:
Fixed length gram based: Ed-Join [8] and Winnowing
[16].
VGRAM based: VGram-Join [11], [12].
Tree based: Bed
-tree [21] and Trie-Join [22].
Enumeration based: PartEnum [7] and NGPP [23].
Vchunk based: VChunkJoin and VChunkJoin-NoLC.
VChunkJoin is our proposed algorithm equipped with all the
filterings and optimizations. When comparing with the
VGRAM-based method, we remove location-based and
content-based filterings, and name the resulting algorithm
VChunkJoin-NoLC.
4. VII. CONCLUSION
Here we have introduced Novel approach to process edit
similarity join efficiently based on partition strings into non-
overlapping chunks. We have designed an efficient edit
similarity join algorithm to incorporate all existing filter which
occupies less space than the existing filtering methods and it‟s
more powerful than the existing algorithm to calculate the
similarity value between the two set of strings. The existing q-
gram and VGRAM approaches occupies more amount of
index value and unnecessary condition strings. To perform
similarity value calculation it has to overcome through our
proposed algorithm VChunkJoin. CBD selection algorithm is
helpful to reduce the index size of the string and dynamic
program is used to perform the edit distance value of the two
dataset.
REFERENCES
[1] Arasu, A. Ganti, V. and Kaushik, R. “Efficient exact set-similarity
joins,” in VLDB, 2006.
[2] Behm, A. Ji, S. Li, S. and Lu, J. “Space-constrained gram-based
indexing for efficient approximate string search,” in ICDE, 2009.
[3] Elmagarmid, A. Ipeirotis, K. P. G. and Verykios, V. S. “Duplicate
record detection: A survey,” TKDE, vol. 19, no. 1, pp. 1–16, 2007
[4] Li, C. Wang, B. and Yang, X. “VGRAM: Improving performance of
approximate queries on string collections using variable-length grams,”
in VLDB, 2007.
[5] Manber, U. “Finding similar files in a large file system,” in USENIX
Winter, 1994, pp. 1–10.
[6] Ramasamy, Patel, J. M. Naughton, J. F. and Kaushik, R. “Set
containment joins: The good, the bad and the ugly,” in VLDB, 2000, pp.
351–362.
[7] Sarawagi, S. and Kirpal, A. “Efficient set joins on similarity predicates,”
in SIGMOD, 2004.
[8] Wang, J. Feng, J. and Li, G. “Trie-join: Efficient trie-based string
similarity joins with edit,” in VLDB, 2010.
[9] Xiao, C. Wang, W. Lin, X. and Yu, J. X. U. “Efficient similarity joins
for near duplicate detection,” in WWW, 2008.
[10] Xiao, C. Wang, W. and Lin, X. “Ed-Join: an efficient algorithm for
similarity joins with edit distance constraints,” PVLDB, vol. 1, no. 1, pp.
933–944, 2008.
[11] Yang, X. Wang, B. and Li, C. “Cost-based variable-length-gram
selection for string collections to support approximate queries
efficiently,” in SIGMOD Conference, 2008, pp. 353–364.