SlideShare une entreprise Scribd logo
1  sur  3
Télécharger pour lire hors ligne
International Association of Scientific Innovation and Research (IASIR) 
(An Association Unifying the Sciences, Engineering, and Applied Research) 
International Journal of Emerging Technologies in Computational 
and Applied Sciences (IJETCAS) 
www.iasir.net 
IJETCAS 14-624; © 2014, IJETCAS All Rights Reserved Page 286 
ISSN (Print): 2279-0047 
ISSN (Online): 2279-0055 
A Survey on String Similarity Matching Search Techniques 
S.Balan1, Dr. P.Ponmuthuramalingam2 
1Ph.D. Research Scholar, 2Associate Professor & Head, 
Department of Computer Science, Government Arts College (Autonomous), Coimbatore, Tamilnadu, INDIA. 
Abstract: String similarity matching search Problem is mainly used to find text which is present in the documents. In thousands of years many features are available in the modern world but yet people not realized to find the information correctly. Because of huge amount of information’s stored in the World Wide Web. The field of information retrieval was born in the year 1950 and H.P. Luhun in the year of 1957 find the basic idea of searching text with computer. The problem of string matching is to find errors .for example in online searching, user faces different problems and irrelevant information’s. The goal of this survey is to present overview of string similarity matching and comparison of different algorithms to conclude the better performance on searching the text. There are many areas where this problem appears and one of the most demanding is information retrieval to find relevant information in text collection and the important tool is named as string matching. 
Keywords: Information retrieval, String Matching, Similarity Search, Approximate String Match 
I. Introduction 
In recent years the problem is growing communities of information retrieval and computational biology. The field of information retrieval problem can be addressed into different views. A string is a sequence of characters over a finite set of alphabet. Similarity search provides a list of input data similar to an input query. In the context of search engines such as Google or yahoo search is based on document similarity and query similarity. Document similarity is nothing but overall similarity of an entire document to the given query. Query similarity suggests many query strings while searching is based on machine learning. [Thomas Bocek, et al., 1997]. At first 1992, text retrieval conference or TREC [Harman 1993] sponsored by US government which aims to encouraging research in information retrieval from large text collections. 
In that many old techniques are modified and many new techniques are identified to retrieve over large number of text collections. The first algorithms developed in information retrieval for searching the World Wide Web during the year 1996 to 1998. Early there are various models and implementations are available for information retrieval system. Boolean system is used to specify the user information based on combination of And, Or, Not’s. Using this system they are not overcome to produce the relevant information. Several models are proposed for these process in that three most models are vector space model, the probabilistic models, and inference network model [Amit Singhal 2001]. Vector space model is represented by a vector of terms [Gerard Salton, 1975]. Terms are typically words or phrases. Any text can be represented by a vector in high dimensional space. Text belongs to non-zero value. Most vector term processed in a positive value to assign a numeric score to a document for a query. In the year of 1960 maron and kuhun proposed many Probabilistic model and it is based on the general principle that document in a collection should be ranked by decreasing probability of their relevance to a query [Amit Singhal 2001]. Estimation is the key part of this model. Inference network model is a document retrieval model as an inference process in an inference network. [Van Rijsbergen1979] Most techniques implemented under this model. Similarity search is important for time- sensitive applications. The increasing amounts of electronic information available on the web in order to improve data quality or find all information based on the user request. To provide a similarity search in the dictionary size may be too slow for many applications. There are various existing methods are available for fast similarity search for example English dictionary and a randomly generated dictionary and compared search performance for dynamic programming, a keyword tree, neighborhood generation and n-grams with index lookup extraction [Amit Chandel, 2006]. The extraction of structured and unstructured text is a challenging problem in many applications such as data warehousing, web data integration and bio-informatics. 
For example, to identify book author from html pages, match of text string with book author is displayed and found the accuracy of the string extraction [Amit Chandel, 2006]. This paper categorized into four sections. Section-1 contain the introduction to information retrieval and string similarity search, Section-2 contain the literature survey, Section-3 contain Analysis of string similarity search Section-4 includes conclusion while references mentioned in the last section. 
II. Literature Survey 
It is defined as a finite state pattern matching machine from the keywords to process the text string in a single pass. To improve the speed of a library bibliographic search program by factor of 5 to 10. The main purpose of
S. Balan et al., International Journal of Emerging Technologies in Computational and Applied Sciences, 9(3), June-August, 2014, pp. 286- 
288 
IJETCAS 14-624; © 2014, IJETCAS All Rights Reserved Page 287 
this technique is to allow a bibliographer to find in a citation index all titles and satisfying some Boolean function of keywords and phrases. If m is a program which takes as input the text string s and produces as output the locations in p at which keywords y appear as substrings. It consists of a set of states and it is represented by a number. The behavior of the pattern matching machine is carried out by three functions named as go to function go, a failure function fa and an output function out [Alfred V. Aho, et al.,1975]. 
Edit distance [Levenstein V.I, 1966] is the minimum number of operations required to transform one string into another with operations being a deletion, an insertion or a replacement. Navarro’s NR-grep [Navarro.G, 2000] is an exhaustive online similarity search algorithm. NR stands for non-deterministic reverse pattern matching. It uses bit-parallelism and forward and backward searching. An n-gram is created by sliding a window of length g over the data and noting the content and position of all such windows. An extension of this approach for large text collections uses cosine similarity [Koudas, et al., 2004], t is a global measure to represent a vector of their frequencies. 
Approximate similarity search based on hashing is to hash the points from the database from the probability of higher objects that close to another. It is based on hierarchical tree decomposition for large number of dimensions. There are various algorithms such as locality-sensitive hashing, analysis of locality-sensitive hashing and nearest neighbor search. Approximate string matching is about finding a pattern in a text where one or both of them have suffered some kind of undesirable corruption. The classification and the existing schemes in context of data structure are suffix tree, suffix array, Q-grams, Q-samples. Search approach method is classified into two ways namely partitioning into exact searching and intermediate partitioning based on text and patterns [Kaushik Chakrabartie, et al., 2000]. 
The existing algorithms are hamming distance, reversals, block distance, Q-gram distance, allowing swaps, approximate searching in multidimensional texts, in graphs, multi pattern approximate matching , non standard algorithms such as approximate or parallel algorithms, indexed searching, these are the other surveys on string similarity matching. There are various string matching types namely multiple string match, extended string matching, regular expression matching and approximate matching. The approximate matching contains various algorithms to find the similarity of given string such as dynamic programming algorithms, computing edit distance, text searching, improving the average case, other algorithm based on dynamic programming, algorithms based on automata, bit-parallel algorithms, parallelizing the NFA, parallelizing the DP matrix, algorithm for fast filtering the text, partitioning into k + 1 pieces, approximate BNDM, other filtration algorithms, multi pattern approximate searching, a hashing based algorithm for one error, searching for extended strings and regular expressions. 
III. Analysis of String Similarity Matching Techniques 
Sno 
Author Name 
Title 
Methods 
Advantages 
Dis Advantages 
1 
Alfred V. Aho and Margaret J. Corasick 
Efficient String Matching An Aid to Bibliographic Search 
Pattern matching algorithm 
Construction of go to, output and failure functions 
Time complexity of algorithms 
Locates keyword in a text string 
Directed graph begins at the state 0 
Time complexity is large 
Substrings may overlap with one another 
Partially computed output function 
Failure function stored in one dimensional array 
2 
Arvind Arasu, Venkatesh Ganti, et al.; 
Efficient Exact-Set Similarity Joins 
Threshold based SSJoin 
Hamming SSJoin 
Jaccard SSJoin 
Threshold parameter is high 
Vector representation between two sets 
Similarity value is 0 or 1. 
Different similarity sets 
Dimension is differ 
Common elements 
3 
Thomas Bocek, Burkhard Stiller, et al., 
Fast Similarity Search in Large Dictionaries 
Edit distance 
NR|-Grep 
N-grams and Cosine Similarity 
Minimum operations required from one string to one string to another 
Reverse pattern matching 
Offline approach 
Dictionary size is low 
Avoids number of searching words in NR- grep method 
Similarity is shared 
4 
Kaushik Chakrabarti, Dong Xin, et al., 
An Efficient Filter for Approximate Membership Checking 
Pruning condition 
Filtering by ISH 
Weighted signatures 
Three similarity measures are identified 
Sub string search is quick 
Weighted signature is in decreasing order 
Lower bound value is not identified 
String similarity is less 
Different number of signatures
S. Balan et al., International Journal of Emerging Technologies in Computational and Applied Sciences, 9(3), June-August, 2014, pp. 286- 
288 
IJETCAS 14-624; © 2014, IJETCAS All Rights Reserved Page 288 
5 
Amit Chandel, P.C.Nagesh, et al., 
Efficient Batch Top-k for Dictionary-based Entity Recognition 
Batch Top-K 
Simple Top-K 
Segmented Algorithm 
Finding the most top-k score 
Decreasing IDF Values 
A token of a the sub query is strong or weak 
Increasing run time for threshold values 
Upper bound scoreless is removed 
Existing tight features is not unique 
6 
Aristides Gionis, Piotr Indyk, et al., 
Similarity Search in High Dimensions via Hashing 
Locality Sensitive Hashing 
Color Histograms 
Texture Features 
Better run time 
Dependence on data size 
To measure the performance 
Value is small and there is resort needed 
One index is not sufficient 
Compare with SR-tree is low 
7 
Daniel Karch,Dennis Luxen,etal., 
Improved Fast Similarity Search in Dictionaries 
Preprocessing Space 
Preprocessing Time 
Query Performance 
String Split Parameter based on query time 
Ten Times Faster 
Maximum Distance calculated 
Speed is low 
Does not Store any information’s 
Query time and search space size is average. 
8 
Amit Singhal 
Modern Information Retrieval: A Brief Overview 
Vector Space Model 
Probabilistic Model 
Inference Network Model 
Calculate using the Term Weighting 
Relevance feedback based on user queries 
Retrieval effectiveness 
Boolean systems are less effective 
Poor stemming 
Style of phrase generation is not critical 
IV. Conclusion 
In this paper, survey focus on various algorithms for string similarity matching based on search techniques. Some of the algorithm for set similarity with its property value is 0 or 1. It indicates the previous algorithms matches more than in many cases. The performance of the algorithm is analyzed and stated in a table manner. Additionally it focuses on information retrieval and search engine in World Wide Web. To improve the quality of a word search similarity, next the exact similarity is finer based on semantic relationship of a word. This further reduces the time size for a large database. 
V. References 
[1]. Alfred V. Aho and Margaret J. Corasick Bell Laboratories, Efficient String Matching An Aid to Bibliographic Search, communications of the ACM, Vol. 18 No.6, June 1975. 
[2]. Amit Chandel, P.C.Nagesh, Suita Sarawagi, Efficient Batch Top-k for Dictionary-based Entity Recognition, Proc. 22nd International Conference Data Engineering., pp.28, 2006. 
[3]. Amit Singhal, Modern Information Retrieval: A Brief Overview, IEEE Computer Society Technical Committee on Data Engineering, pp 1-9, 2001. 
[4]. Aristides Gionis, Piotr Indyk, Rajeev Motwani, Similarity Search in High Dimensions via Hashing, Proceedings of the 25th VLDB Conference,Edinburgh, Scotland, pp 518, 1999. 
[5]. Arvind Arasu, Venkatesh Ganti, Raghav Kaushik, Efficient Exact-Set Similarity Joins, VLDB ’06, September 12-15, 2006, Seoul, Korea,VLDB Endowment, ACM 1-59593-385-9/06/09. 
[6]. Daniel Karch,Dennis Luxen, Peter Sanders, Improved Fast Similarity Search in Dictionaries, presented at the 17th Symposium on String Processing and Information Retrieval, 2010. 
[7]. Gerard Salton, A.Wong, and C. S. Yang. A vector space model for information retrieval. Communications of the ACM, 18(11):613–620, November 1975. 
[8]. Harman D.K, Overview of the first Text Retrieval Conference (TREC-1). In Proceedings of the First Text REtrieval Conference (TREC-1), pages 1–20. NIST Special Publication 500-207, March 1993. 
[9]. Kaushik Chakrabarti, Dong Xin, et al., An Efficient Filter for Approximate Membership Checking, SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada, 2008 ACM 9781605581026/08/06. 
[10]. Koudas D.S.N, A. Marathe. Flexible String Matching Against Large Databases in Practice. In VLDB, pages 1078–1086, 2004. 
[11]. Levenstein V.I, Binary codes capable of correcting insertions and reversals. Sov. Phys. Dokl., 10:707–101966. 
[12]. Navarro.G, NR-grep: A Fast and Flexible Pattern Matching Tool, Technical Report TR/DCC-2000-3 Technical report, University of Chile, Departmento de Ciencias de la Computacion, Santiago, 2000, http://www.dcc.uchile.cl/gnavarro. 
[13]. Thomas Bocek, Burkhard Stiller, et al., Fast Similarity Search in Large Dictionaries, University of Zurich, Department of Informatics (IFI), Binzmühl estrasse 14, CH-8050 Zürich, Switzerland, 2007. 
[14]. Van Rijsbergen C.J, Information Retrieval. Butter worths, London, 1979.

Contenu connexe

Tendances

Sentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clustersSentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clustersMOHDSAIFWAJID1
 
A DOMAIN INDEPENDENT APPROACH FOR ONTOLOGY SEMANTIC ENRICHMENT
A DOMAIN INDEPENDENT APPROACH FOR ONTOLOGY SEMANTIC ENRICHMENTA DOMAIN INDEPENDENT APPROACH FOR ONTOLOGY SEMANTIC ENRICHMENT
A DOMAIN INDEPENDENT APPROACH FOR ONTOLOGY SEMANTIC ENRICHMENTcscpconf
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations onijistjournal
 
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...IOSR Journals
 
Information extraction using discourse
Information extraction using discourseInformation extraction using discourse
Information extraction using discourseijitcs
 
Ontology-based Data Integration
Ontology-based Data IntegrationOntology-based Data Integration
Ontology-based Data IntegrationJanna Hastings
 
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
Indexing Automated Vs Automatic Galvan1
Indexing Automated Vs Automatic   Galvan1Indexing Automated Vs Automatic   Galvan1
Indexing Automated Vs Automatic Galvan1CorinaF
 
Ontology engineering: Ontology alignment
Ontology engineering: Ontology alignmentOntology engineering: Ontology alignment
Ontology engineering: Ontology alignmentGuus Schreiber
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextUniversity of Bari (Italy)
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsShubhangi Tandon
 
Conceptual similarity measurement algorithm for domain specific ontology[
Conceptual similarity measurement algorithm for domain specific ontology[Conceptual similarity measurement algorithm for domain specific ontology[
Conceptual similarity measurement algorithm for domain specific ontology[Zac Darcy
 
Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)IJERA Editor
 
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...dannyijwest
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) ijceronline
 
SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING
SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING
SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING cscpconf
 

Tendances (18)

Sentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clustersSentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clusters
 
A DOMAIN INDEPENDENT APPROACH FOR ONTOLOGY SEMANTIC ENRICHMENT
A DOMAIN INDEPENDENT APPROACH FOR ONTOLOGY SEMANTIC ENRICHMENTA DOMAIN INDEPENDENT APPROACH FOR ONTOLOGY SEMANTIC ENRICHMENT
A DOMAIN INDEPENDENT APPROACH FOR ONTOLOGY SEMANTIC ENRICHMENT
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations on
 
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
 
Information extraction using discourse
Information extraction using discourseInformation extraction using discourse
Information extraction using discourse
 
Ontology-based Data Integration
Ontology-based Data IntegrationOntology-based Data Integration
Ontology-based Data Integration
 
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
 
G04124041046
G04124041046G04124041046
G04124041046
 
Indexing Automated Vs Automatic Galvan1
Indexing Automated Vs Automatic   Galvan1Indexing Automated Vs Automatic   Galvan1
Indexing Automated Vs Automatic Galvan1
 
Ontology engineering: Ontology alignment
Ontology engineering: Ontology alignmentOntology engineering: Ontology alignment
Ontology engineering: Ontology alignment
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
 
Conceptual similarity measurement algorithm for domain specific ontology[
Conceptual similarity measurement algorithm for domain specific ontology[Conceptual similarity measurement algorithm for domain specific ontology[
Conceptual similarity measurement algorithm for domain specific ontology[
 
Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)
 
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
 
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING
SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING
SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING
 

En vedette (9)

Ijetcas14 643
Ijetcas14 643Ijetcas14 643
Ijetcas14 643
 
Ijetcas14 648
Ijetcas14 648Ijetcas14 648
Ijetcas14 648
 
Ijetcas14 641
Ijetcas14 641Ijetcas14 641
Ijetcas14 641
 
Ijetcas14 632
Ijetcas14 632Ijetcas14 632
Ijetcas14 632
 
ijetcas14 650
ijetcas14 650ijetcas14 650
ijetcas14 650
 
Ijetcas14 337
Ijetcas14 337Ijetcas14 337
Ijetcas14 337
 
Ijetcas14 647
Ijetcas14 647Ijetcas14 647
Ijetcas14 647
 
Ijetcas14 619
Ijetcas14 619Ijetcas14 619
Ijetcas14 619
 
BITSAA 30 under 30 Awards 2005
BITSAA 30 under 30 Awards 2005BITSAA 30 under 30 Awards 2005
BITSAA 30 under 30 Awards 2005
 

Similaire à Ijetcas14 624

Text databases and information retrieval
Text databases and information retrievalText databases and information retrieval
Text databases and information retrievalunyil96
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewINFOGAIN PUBLICATION
 
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...IOSR Journals
 
Combination of Levenshtein Distance and Rabin-Karp to Improve the Accuracy of...
Combination of Levenshtein Distance and Rabin-Karp to Improve the Accuracy of...Combination of Levenshtein Distance and Rabin-Karp to Improve the Accuracy of...
Combination of Levenshtein Distance and Rabin-Karp to Improve the Accuracy of...Universitas Pembangunan Panca Budi
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
 
An Improved Similarity Matching based Clustering Framework for Short and Sent...
An Improved Similarity Matching based Clustering Framework for Short and Sent...An Improved Similarity Matching based Clustering Framework for Short and Sent...
An Improved Similarity Matching based Clustering Framework for Short and Sent...IJECEIAES
 
Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification?
Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification? Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification?
Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification? IJORCS
 
P036401020107
P036401020107P036401020107
P036401020107theijes
 
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
 
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...cscpconf
 
Computing semantic similarity measure between words using web search engine
Computing semantic similarity measure between words using web search engineComputing semantic similarity measure between words using web search engine
Computing semantic similarity measure between words using web search enginecsandit
 
Semantic Search of E-Learning Documents Using Ontology Based System
Semantic Search of E-Learning Documents Using Ontology Based SystemSemantic Search of E-Learning Documents Using Ontology Based System
Semantic Search of E-Learning Documents Using Ontology Based Systemijcnes
 
Context Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsContext Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsIJCSIS Research Publications
 
An optimal unsupervised text data segmentation 3
An optimal unsupervised text data segmentation 3An optimal unsupervised text data segmentation 3
An optimal unsupervised text data segmentation 3prj_publication
 

Similaire à Ijetcas14 624 (20)

Text databases and information retrieval
Text databases and information retrievalText databases and information retrieval
Text databases and information retrieval
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
 
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
 
C017161925
C017161925C017161925
C017161925
 
A0210110
A0210110A0210110
A0210110
 
Combination of Levenshtein Distance and Rabin-Karp to Improve the Accuracy of...
Combination of Levenshtein Distance and Rabin-Karp to Improve the Accuracy of...Combination of Levenshtein Distance and Rabin-Karp to Improve the Accuracy of...
Combination of Levenshtein Distance and Rabin-Karp to Improve the Accuracy of...
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 
An Improved Similarity Matching based Clustering Framework for Short and Sent...
An Improved Similarity Matching based Clustering Framework for Short and Sent...An Improved Similarity Matching based Clustering Framework for Short and Sent...
An Improved Similarity Matching based Clustering Framework for Short and Sent...
 
Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification?
Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification? Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification?
Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification?
 
Ju3517011704
Ju3517011704Ju3517011704
Ju3517011704
 
P036401020107
P036401020107P036401020107
P036401020107
 
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
 
String Searching and Matching
String Searching and MatchingString Searching and Matching
String Searching and Matching
 
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
 
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
 
Computing semantic similarity measure between words using web search engine
Computing semantic similarity measure between words using web search engineComputing semantic similarity measure between words using web search engine
Computing semantic similarity measure between words using web search engine
 
Semantic Search of E-Learning Documents Using Ontology Based System
Semantic Search of E-Learning Documents Using Ontology Based SystemSemantic Search of E-Learning Documents Using Ontology Based System
Semantic Search of E-Learning Documents Using Ontology Based System
 
Context Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsContext Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word Pairs
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
An optimal unsupervised text data segmentation 3
An optimal unsupervised text data segmentation 3An optimal unsupervised text data segmentation 3
An optimal unsupervised text data segmentation 3
 

Plus de Iasir Journals (20)

Ijetcas14 615
Ijetcas14 615Ijetcas14 615
Ijetcas14 615
 
Ijetcas14 608
Ijetcas14 608Ijetcas14 608
Ijetcas14 608
 
Ijetcas14 605
Ijetcas14 605Ijetcas14 605
Ijetcas14 605
 
Ijetcas14 604
Ijetcas14 604Ijetcas14 604
Ijetcas14 604
 
Ijetcas14 598
Ijetcas14 598Ijetcas14 598
Ijetcas14 598
 
Ijetcas14 594
Ijetcas14 594Ijetcas14 594
Ijetcas14 594
 
Ijetcas14 593
Ijetcas14 593Ijetcas14 593
Ijetcas14 593
 
Ijetcas14 591
Ijetcas14 591Ijetcas14 591
Ijetcas14 591
 
Ijetcas14 589
Ijetcas14 589Ijetcas14 589
Ijetcas14 589
 
Ijetcas14 585
Ijetcas14 585Ijetcas14 585
Ijetcas14 585
 
Ijetcas14 584
Ijetcas14 584Ijetcas14 584
Ijetcas14 584
 
Ijetcas14 583
Ijetcas14 583Ijetcas14 583
Ijetcas14 583
 
Ijetcas14 580
Ijetcas14 580Ijetcas14 580
Ijetcas14 580
 
Ijetcas14 578
Ijetcas14 578Ijetcas14 578
Ijetcas14 578
 
Ijetcas14 577
Ijetcas14 577Ijetcas14 577
Ijetcas14 577
 
Ijetcas14 575
Ijetcas14 575Ijetcas14 575
Ijetcas14 575
 
Ijetcas14 572
Ijetcas14 572Ijetcas14 572
Ijetcas14 572
 
Ijetcas14 571
Ijetcas14 571Ijetcas14 571
Ijetcas14 571
 
Ijetcas14 567
Ijetcas14 567Ijetcas14 567
Ijetcas14 567
 
Ijetcas14 562
Ijetcas14 562Ijetcas14 562
Ijetcas14 562
 

Dernier

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 

Dernier (20)

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 

Ijetcas14 624

  • 1. International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS) www.iasir.net IJETCAS 14-624; © 2014, IJETCAS All Rights Reserved Page 286 ISSN (Print): 2279-0047 ISSN (Online): 2279-0055 A Survey on String Similarity Matching Search Techniques S.Balan1, Dr. P.Ponmuthuramalingam2 1Ph.D. Research Scholar, 2Associate Professor & Head, Department of Computer Science, Government Arts College (Autonomous), Coimbatore, Tamilnadu, INDIA. Abstract: String similarity matching search Problem is mainly used to find text which is present in the documents. In thousands of years many features are available in the modern world but yet people not realized to find the information correctly. Because of huge amount of information’s stored in the World Wide Web. The field of information retrieval was born in the year 1950 and H.P. Luhun in the year of 1957 find the basic idea of searching text with computer. The problem of string matching is to find errors .for example in online searching, user faces different problems and irrelevant information’s. The goal of this survey is to present overview of string similarity matching and comparison of different algorithms to conclude the better performance on searching the text. There are many areas where this problem appears and one of the most demanding is information retrieval to find relevant information in text collection and the important tool is named as string matching. Keywords: Information retrieval, String Matching, Similarity Search, Approximate String Match I. Introduction In recent years the problem is growing communities of information retrieval and computational biology. The field of information retrieval problem can be addressed into different views. A string is a sequence of characters over a finite set of alphabet. Similarity search provides a list of input data similar to an input query. In the context of search engines such as Google or yahoo search is based on document similarity and query similarity. Document similarity is nothing but overall similarity of an entire document to the given query. Query similarity suggests many query strings while searching is based on machine learning. [Thomas Bocek, et al., 1997]. At first 1992, text retrieval conference or TREC [Harman 1993] sponsored by US government which aims to encouraging research in information retrieval from large text collections. In that many old techniques are modified and many new techniques are identified to retrieve over large number of text collections. The first algorithms developed in information retrieval for searching the World Wide Web during the year 1996 to 1998. Early there are various models and implementations are available for information retrieval system. Boolean system is used to specify the user information based on combination of And, Or, Not’s. Using this system they are not overcome to produce the relevant information. Several models are proposed for these process in that three most models are vector space model, the probabilistic models, and inference network model [Amit Singhal 2001]. Vector space model is represented by a vector of terms [Gerard Salton, 1975]. Terms are typically words or phrases. Any text can be represented by a vector in high dimensional space. Text belongs to non-zero value. Most vector term processed in a positive value to assign a numeric score to a document for a query. In the year of 1960 maron and kuhun proposed many Probabilistic model and it is based on the general principle that document in a collection should be ranked by decreasing probability of their relevance to a query [Amit Singhal 2001]. Estimation is the key part of this model. Inference network model is a document retrieval model as an inference process in an inference network. [Van Rijsbergen1979] Most techniques implemented under this model. Similarity search is important for time- sensitive applications. The increasing amounts of electronic information available on the web in order to improve data quality or find all information based on the user request. To provide a similarity search in the dictionary size may be too slow for many applications. There are various existing methods are available for fast similarity search for example English dictionary and a randomly generated dictionary and compared search performance for dynamic programming, a keyword tree, neighborhood generation and n-grams with index lookup extraction [Amit Chandel, 2006]. The extraction of structured and unstructured text is a challenging problem in many applications such as data warehousing, web data integration and bio-informatics. For example, to identify book author from html pages, match of text string with book author is displayed and found the accuracy of the string extraction [Amit Chandel, 2006]. This paper categorized into four sections. Section-1 contain the introduction to information retrieval and string similarity search, Section-2 contain the literature survey, Section-3 contain Analysis of string similarity search Section-4 includes conclusion while references mentioned in the last section. II. Literature Survey It is defined as a finite state pattern matching machine from the keywords to process the text string in a single pass. To improve the speed of a library bibliographic search program by factor of 5 to 10. The main purpose of
  • 2. S. Balan et al., International Journal of Emerging Technologies in Computational and Applied Sciences, 9(3), June-August, 2014, pp. 286- 288 IJETCAS 14-624; © 2014, IJETCAS All Rights Reserved Page 287 this technique is to allow a bibliographer to find in a citation index all titles and satisfying some Boolean function of keywords and phrases. If m is a program which takes as input the text string s and produces as output the locations in p at which keywords y appear as substrings. It consists of a set of states and it is represented by a number. The behavior of the pattern matching machine is carried out by three functions named as go to function go, a failure function fa and an output function out [Alfred V. Aho, et al.,1975]. Edit distance [Levenstein V.I, 1966] is the minimum number of operations required to transform one string into another with operations being a deletion, an insertion or a replacement. Navarro’s NR-grep [Navarro.G, 2000] is an exhaustive online similarity search algorithm. NR stands for non-deterministic reverse pattern matching. It uses bit-parallelism and forward and backward searching. An n-gram is created by sliding a window of length g over the data and noting the content and position of all such windows. An extension of this approach for large text collections uses cosine similarity [Koudas, et al., 2004], t is a global measure to represent a vector of their frequencies. Approximate similarity search based on hashing is to hash the points from the database from the probability of higher objects that close to another. It is based on hierarchical tree decomposition for large number of dimensions. There are various algorithms such as locality-sensitive hashing, analysis of locality-sensitive hashing and nearest neighbor search. Approximate string matching is about finding a pattern in a text where one or both of them have suffered some kind of undesirable corruption. The classification and the existing schemes in context of data structure are suffix tree, suffix array, Q-grams, Q-samples. Search approach method is classified into two ways namely partitioning into exact searching and intermediate partitioning based on text and patterns [Kaushik Chakrabartie, et al., 2000]. The existing algorithms are hamming distance, reversals, block distance, Q-gram distance, allowing swaps, approximate searching in multidimensional texts, in graphs, multi pattern approximate matching , non standard algorithms such as approximate or parallel algorithms, indexed searching, these are the other surveys on string similarity matching. There are various string matching types namely multiple string match, extended string matching, regular expression matching and approximate matching. The approximate matching contains various algorithms to find the similarity of given string such as dynamic programming algorithms, computing edit distance, text searching, improving the average case, other algorithm based on dynamic programming, algorithms based on automata, bit-parallel algorithms, parallelizing the NFA, parallelizing the DP matrix, algorithm for fast filtering the text, partitioning into k + 1 pieces, approximate BNDM, other filtration algorithms, multi pattern approximate searching, a hashing based algorithm for one error, searching for extended strings and regular expressions. III. Analysis of String Similarity Matching Techniques Sno Author Name Title Methods Advantages Dis Advantages 1 Alfred V. Aho and Margaret J. Corasick Efficient String Matching An Aid to Bibliographic Search Pattern matching algorithm Construction of go to, output and failure functions Time complexity of algorithms Locates keyword in a text string Directed graph begins at the state 0 Time complexity is large Substrings may overlap with one another Partially computed output function Failure function stored in one dimensional array 2 Arvind Arasu, Venkatesh Ganti, et al.; Efficient Exact-Set Similarity Joins Threshold based SSJoin Hamming SSJoin Jaccard SSJoin Threshold parameter is high Vector representation between two sets Similarity value is 0 or 1. Different similarity sets Dimension is differ Common elements 3 Thomas Bocek, Burkhard Stiller, et al., Fast Similarity Search in Large Dictionaries Edit distance NR|-Grep N-grams and Cosine Similarity Minimum operations required from one string to one string to another Reverse pattern matching Offline approach Dictionary size is low Avoids number of searching words in NR- grep method Similarity is shared 4 Kaushik Chakrabarti, Dong Xin, et al., An Efficient Filter for Approximate Membership Checking Pruning condition Filtering by ISH Weighted signatures Three similarity measures are identified Sub string search is quick Weighted signature is in decreasing order Lower bound value is not identified String similarity is less Different number of signatures
  • 3. S. Balan et al., International Journal of Emerging Technologies in Computational and Applied Sciences, 9(3), June-August, 2014, pp. 286- 288 IJETCAS 14-624; © 2014, IJETCAS All Rights Reserved Page 288 5 Amit Chandel, P.C.Nagesh, et al., Efficient Batch Top-k for Dictionary-based Entity Recognition Batch Top-K Simple Top-K Segmented Algorithm Finding the most top-k score Decreasing IDF Values A token of a the sub query is strong or weak Increasing run time for threshold values Upper bound scoreless is removed Existing tight features is not unique 6 Aristides Gionis, Piotr Indyk, et al., Similarity Search in High Dimensions via Hashing Locality Sensitive Hashing Color Histograms Texture Features Better run time Dependence on data size To measure the performance Value is small and there is resort needed One index is not sufficient Compare with SR-tree is low 7 Daniel Karch,Dennis Luxen,etal., Improved Fast Similarity Search in Dictionaries Preprocessing Space Preprocessing Time Query Performance String Split Parameter based on query time Ten Times Faster Maximum Distance calculated Speed is low Does not Store any information’s Query time and search space size is average. 8 Amit Singhal Modern Information Retrieval: A Brief Overview Vector Space Model Probabilistic Model Inference Network Model Calculate using the Term Weighting Relevance feedback based on user queries Retrieval effectiveness Boolean systems are less effective Poor stemming Style of phrase generation is not critical IV. Conclusion In this paper, survey focus on various algorithms for string similarity matching based on search techniques. Some of the algorithm for set similarity with its property value is 0 or 1. It indicates the previous algorithms matches more than in many cases. The performance of the algorithm is analyzed and stated in a table manner. Additionally it focuses on information retrieval and search engine in World Wide Web. To improve the quality of a word search similarity, next the exact similarity is finer based on semantic relationship of a word. This further reduces the time size for a large database. V. References [1]. Alfred V. Aho and Margaret J. Corasick Bell Laboratories, Efficient String Matching An Aid to Bibliographic Search, communications of the ACM, Vol. 18 No.6, June 1975. [2]. Amit Chandel, P.C.Nagesh, Suita Sarawagi, Efficient Batch Top-k for Dictionary-based Entity Recognition, Proc. 22nd International Conference Data Engineering., pp.28, 2006. [3]. Amit Singhal, Modern Information Retrieval: A Brief Overview, IEEE Computer Society Technical Committee on Data Engineering, pp 1-9, 2001. [4]. Aristides Gionis, Piotr Indyk, Rajeev Motwani, Similarity Search in High Dimensions via Hashing, Proceedings of the 25th VLDB Conference,Edinburgh, Scotland, pp 518, 1999. [5]. Arvind Arasu, Venkatesh Ganti, Raghav Kaushik, Efficient Exact-Set Similarity Joins, VLDB ’06, September 12-15, 2006, Seoul, Korea,VLDB Endowment, ACM 1-59593-385-9/06/09. [6]. Daniel Karch,Dennis Luxen, Peter Sanders, Improved Fast Similarity Search in Dictionaries, presented at the 17th Symposium on String Processing and Information Retrieval, 2010. [7]. Gerard Salton, A.Wong, and C. S. Yang. A vector space model for information retrieval. Communications of the ACM, 18(11):613–620, November 1975. [8]. Harman D.K, Overview of the first Text Retrieval Conference (TREC-1). In Proceedings of the First Text REtrieval Conference (TREC-1), pages 1–20. NIST Special Publication 500-207, March 1993. [9]. Kaushik Chakrabarti, Dong Xin, et al., An Efficient Filter for Approximate Membership Checking, SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada, 2008 ACM 9781605581026/08/06. [10]. Koudas D.S.N, A. Marathe. Flexible String Matching Against Large Databases in Practice. In VLDB, pages 1078–1086, 2004. [11]. Levenstein V.I, Binary codes capable of correcting insertions and reversals. Sov. Phys. Dokl., 10:707–101966. [12]. Navarro.G, NR-grep: A Fast and Flexible Pattern Matching Tool, Technical Report TR/DCC-2000-3 Technical report, University of Chile, Departmento de Ciencias de la Computacion, Santiago, 2000, http://www.dcc.uchile.cl/gnavarro. [13]. Thomas Bocek, Burkhard Stiller, et al., Fast Similarity Search in Large Dictionaries, University of Zurich, Department of Informatics (IFI), Binzmühl estrasse 14, CH-8050 Zürich, Switzerland, 2007. [14]. Van Rijsbergen C.J, Information Retrieval. Butter worths, London, 1979.