1. Abstract:
In this age of global interconnectivity, Internet and electronic communication medium have
become more essential. For utilizing the resources available on internet a number of
applications are available. Among them Search Engines is most frequently used application.
The Search Engine enables us to identify the required information on web from different web
databases and repositories.
Though Internet can be called huge repository of information but most of this information is
unevenly distributed. This information is also available in unstructured and structured format.
Such diverse kinds of format poses huge obstacle for existing techniques of search. It is the
foremost challenge that needs to be addressed for improving the user query relevance in
search.
There are two major contributions proposed for optimizing the performance of exiting search
techniques.
1. Construction of named schema matching and use of schema structures
2. Strategy is used to narrow down the search space to list the limited amount of relevant
documents
The proposed Schema matching techniques identify meaningful objects and essential features
of data from both kinds of formats. It helps to reduce the user efforts for obtaining the
relevant data omitted as results. Therefore two different approaches for structured and
unstructured data sources are implemented using Schema Matching Technique. During the
processing of unstructured data requires incorporating the Wrapper Generation process. It is a
process to obtain common format of data from different data sources. To extract the data this
process also implements a query engine which estimated the relevance data from target
sources. Finally the named entities are used to prepare the mappings on semantically
equivalent attributes to transforms data form source to target data source during data
retrieval.
The implementations of the proposed techniques are delivered using the interactive
simulations for more than one data sources at the same time. After implementation of the
proposed concept the performance of system is measured in terms of precision, recall and f-
measures. The experimental results show the effective and accurate results for the estimated
parameters and also improve the time and space complexity of information retrieval systems.
2. Introduction
1.1 Motivation
World Wide Web (WWW) is an ocean of information additionally that is multiplying at a
rapid rate. It has turn into enormous platform, for billions of people, in last couple of years
[1]. It’s a platform for buying and selling; for teaching and learning; for uploading and
downloading an array of information, fact and data from all over the world. It has become a
hub to perform transactions over web-platform similar to eBay (www.ebay.com), Amazon
(www.amazon.com) and Future shop (www.futureshop.ca), which increasingly utilize higher
technologies from schema matching, semantic web and web services. When the word WWW
came into existence, one question arises in researcher’s mind: “How to find swift and
accurate information on the Internet one is looking for”?
From a broader perspective, information finding is part of the learning process through which
humans enlarge their knowledge and intelligence [7]. Huge amount of raw data and links are
available on Web Database. Raw data cannot itself respond to any queries, but information
mined from raw data can provide adequate response to the queries such as when, where,
what, and who. From a broader perspective, information finding is element of the learning
method through which humans increase their knowledge and intelligence [4]. Many smart
tools are available (such as directories, search engines, and web portals) for information
finding and they have been continuously improved and successfully deployed. Still, a
researcher continues to look for novel, more intelligent and faster ways for information
search.
On the Internet, the huge Web data is available to the users. This Web data can be classified
into the following classes:
1. Find useful information along with their unrelated contents of web pages (eg. text,
image audio etc,).
2. Use the hyperlink structure of the web data as a (additional) source of information.
3. The data regarding user and content of exploration on the web site. It includes IP
addresses, date, time, navigated URLs, and others.
On web the content based data is available in structured and unstructured formats.
Unstructured data that resides as free text in HTML pages, and structured data that resides in
3. databases and knowledge bases. Unstructured data are easily accessed as human-readable
text in browser, while structured data is hidden behind web forms, web services, and custom
database APIs. To provide relevant information to the users, we need to structure this
unstructured data.
To find the data from web available as unstructured text – the IR (information retrieval) and
IE (Information Extraction) techniques are used. Information Extraction is used for extracting
targeted information from the unstructured data sources i.e. events, entities or relationships.
Information Extraction has been successfully used in new organization, domain-specific area.
Primary Web-based information extraction is especially focused on utilizing structured and
semi-structured text (e.g., [57, 5, 105]).
On the other hand the Search engine is one of the IR tools to explore much information on
web data sources. It is designed for information discovery on the WWW, inside close or
group network, or in a personal computer. However it helps in information retrieval but still
some issues are remaining to fix. Existing Search system has been implemented with three
different modules.
In the Fig 1 shows the architecture of existing search system. In first user put query on the
query interface. It supports user to express his requirements in form of input query and
submit it to find on the web database. In search methodology, the system recognizes the input
query and then performs search operation on the available data. The search results generated
are sorted or ranked for providing the relevant outcomes to end user. But sometimes it will
return a few irrelevant results too that may be caused by insufficient query and semantic gap
between query keywords and database knowledge.
The search engines become very popular and useful for searching data in recent years. But
users face many problems where data is not retrieved in accurate form. The search result
contains many web pages or bulky data, thus users spend unnecessary time to find accurate
Query
Interface
Query
Interface
Search
Methodology
Search
Methodology
DBDB
File
System
File
System
WebWeb
Fig 1: Existing System Architecture of Search Engine
User query
Ranking Result
4. content from the available results. Surveys indicate that almost 25% of Web searchers are
unable to find useful results in the first set of data returned [6]. These problems fall into two
broad categories:
(1) First, Textual or Syntactic Issues. The Syntactic problems are correspondence to
structuring of query rather than to meaning. This deals with the issues related to input
query placed for search such as query representation and keywords used. Let a user fires
a query in the web and accurate result is not obtained. Because particular query is
technically not related to data on the Internet. The basic reason is that the user does not
know about the structure of data and the keywords associated with the data.
(2) Second problems are Semantic Issues. Semantic problems are corresponding to the
meaning of data. This problem occurs when there is discrepancy about the meaning,
interpretation or use of keywords that are used to represent actual meaning of required
data. This observable fact is also known as semantic deviation. When this increasing, the
probability of error in searching also increases. Users try to minimize this deviation to
get the accurate results. In order to minimize the semantic deviation, researcher focus
following two approaches
• To design intelligent tools, this can accept the queries from users and analyze
meaning of query and behave like human to solve queries.
• To develop a way to organize data in such manner that it can provide significance of
data to the user explicitly.
A researcher continues to find a novel method for more intelligent and faster ways for
information search. We are using the first approach to developing an intelligent tool for
minimizing semantic deviation and try to find accurate results.
In hidden web, it is very difficult to find out exact data object from web sources. Many
researchers agree on one point, the major obstacle in semantic integration is schema
matching problem. In its place, the web contains two different schemas and each schema
contains instance data (data object) [14]. Instance data are transformed between sources to
target data when schema matching techniques are applied. In schema matching process the
system takes two input schemas, each consisting of a set of entities (e.g., tables, XML
elements, classes, properties, rules, predicates), and output the relationships (called mapping)
between these entities. Matching techniques are important in many applications, such as
5. ontology integration, data integration, or data warehouse. The different data models can be
used to differentiate above mentioned applications by analyzing and matching it either
manually or semi-automatically.
So, from figure 2, we can easily classify information in two classes - Input and Output. The
input schema provides information: element names, data types, description, constraints and
so on. These information or data is characterized by the content and semantics of schema
elements. The match operation produces outputs and that is called match result or mapping. A
mapping is defined as a set of mapping elements each of which specifies that certain
elements.
Ontology and schema matching is a classical domain of research, and several approaches and
tools have been available some of them are automatic and some of them semi-automatic but
these methods are doesn’t provide satisfactory results. Therefore, a new sophisticated
approach will be required for automatic matching process of the instance data for
applications.
Problems arise due to the semantic heterogeneity, i.e. dissimilarity in the meaning of the
schema element. From the available literature we observe three major issues in Web
databases. First, improper queries often cause search failure or no returned results. Second,
when a proper query that returns a result web page is submitted through the input elements of
a Web database, the keywords of proper queries that return results very likely reappear in the
returned results’ corresponding attributes. For example, when we submit query “Harry
Potter” through the “Title” element, the three returned book instances all contain the query
keywords (i.e., “Harry Potter”) in their Title attribute. Third, there is an underlying target
schema for related Web databases in the same domain (proposed and verified in [3, 4]).
However, most of these systems such as auxiliary information [3, 4], including, iMAP[9] ,
LSD [13], Corpus-based schema matching[10], SCROL[12], CUPID [11], COMA [1] and
COMA++[2] produce scores schema elements, which results in discovering only simple
Schema
Matching
Schema
Matching
Input output
Fig 2: Schema Matching
6. (one-to-one) matching. Such results solve the schema matching problem partially.
In order to completely solve the problem, the matching system should discover complex
matches as well as simple ones. Few work has addressed the problem of discovering complex
matching [3, 4], because of the greater complexity of finding complex matches than of
discovering simple ones. All this technique are related to Schema Matching techniques that
overcome the concerned issues by applying different techniques, which bridges the semantic
gap between user query and database knowledge. Instance Based Schema Matching is more
efficient method of Schema Matching which enhances search outcome and provides more
accurate result [1].
In this proposed work the data search using the unstructured and structured database is
presented. The proposed approach describes how the structured and unstructured data is
processed by instance based schema matching. This also includes components such as
Wrapper Generation, Query Engine and Schema Mapping. Thus the entire implementation of
system is given in two major modules, first query interface by which qualified input elements
are located by element identification. After query submission, the result set is collected from
heterogeneous format.
During search process wrapper generation [8], supports heterogeneous information collection
from web pages and convert into a general model that can be recognized easily in common
schema format. This common format used as input to query engine for query optimization
process. In the query engine, instance-based matchers are implemented which includes five
components i.e. Similarity Matcher, Tokenizer, Formal Ontology, Instance Recognition
Process and Annotation Generation Process.
Using all these operations, search results with semantic meaning are preserved and eliminate
meaningless information. The combined outcome of the query engine will recognize with
various mapping process. After mapping process, accurate search results are reported
according to end user query.
7. (one-to-one) matching. Such results solve the schema matching problem partially.
In order to completely solve the problem, the matching system should discover complex
matches as well as simple ones. Few work has addressed the problem of discovering complex
matching [3, 4], because of the greater complexity of finding complex matches than of
discovering simple ones. All this technique are related to Schema Matching techniques that
overcome the concerned issues by applying different techniques, which bridges the semantic
gap between user query and database knowledge. Instance Based Schema Matching is more
efficient method of Schema Matching which enhances search outcome and provides more
accurate result [1].
In this proposed work the data search using the unstructured and structured database is
presented. The proposed approach describes how the structured and unstructured data is
processed by instance based schema matching. This also includes components such as
Wrapper Generation, Query Engine and Schema Mapping. Thus the entire implementation of
system is given in two major modules, first query interface by which qualified input elements
are located by element identification. After query submission, the result set is collected from
heterogeneous format.
During search process wrapper generation [8], supports heterogeneous information collection
from web pages and convert into a general model that can be recognized easily in common
schema format. This common format used as input to query engine for query optimization
process. In the query engine, instance-based matchers are implemented which includes five
components i.e. Similarity Matcher, Tokenizer, Formal Ontology, Instance Recognition
Process and Annotation Generation Process.
Using all these operations, search results with semantic meaning are preserved and eliminate
meaningless information. The combined outcome of the query engine will recognize with
various mapping process. After mapping process, accurate search results are reported
according to end user query.