Call Girls in Tughlakabad Delhi 9654467111 Shot 2000 Night 7000
novel and efficient approch for detection of duplicate pages in web crawling
1. A Novel And Efficient Approach
For Near Duplicate Page
Detection In Web Crawling
VIPIN KP Guided by: Mr . Aneesh M Haneef
08103066 Asst . Professor
S7 CSE A Department of
CSE,MESCE
2. Presentation Outline
Introduction
What are near duplicates
Drawbacks of near duplicate pages
What is a Web crawler
Simplified Crawl Architecture
Near duplicate detection
Advantages
Conclusion
Reference
1/2/2012 2
3. Introduction
The main gateways for access of a information in
the web are search engines .
A search engine operates in the following order:
Web crawling
Indexing
Searching
Web crawling ,a process that create a indexed
repository utilized by the search engines.
The large amount of web documents in the web
have huge challenges to the search engine making
their results less relevant to the user.
1/2/2012 3
4. Introduction cont‟d…
Web search engines face additional problems
due to near duplicate web pages.
It is an important requirements for search
engines to provide users with relevant results
without duplication.
Near duplicate page detection is a challenging
problem.
1/2/2012 4
5. What are near duplicates ?
The near duplicates are not considered as “exact
duplicates ” , but are files with minute
differences .
They differ slightly in advertisement, counters ,
timestamps , etc…
Most of the web sites have boiler plate codes.
1/2/2012 5
6. What are near duplicates ?
http://shop.asus.co.uk/shop/gb/en-gb/home.aspx
1/2/2012 6
7. What are near duplicates ?
http://shop.asus.es/shop/gb/en-gb/home.aspx
1/2/2012 7
8. Drawbacks of Near Duplicate web
pages
Waste network bandwidth
Increase storage cost
Affect the quality of search indexes
Increase the load on the remote host that is
serving such web pages
Affect customer satisfaction
1/2/2012 8
9. Web Crawler
A Web crawler is a computer program that browses
the World Wide Web in an orderly fashion.
Other terms for Web crawlers are ants, automatic
indexers, bots , Web spiders, Web robots.
Search engines uses web crawlers to create a
copy of all the visited pages for later processing by
a search engine that will index the downloaded
pages to provide fast searches.
This indexed database will use for searching
process.
A crawler may examine the URL if it ends with
certain characters such as .html, .htm, .asp, .aspx,
.php, .jsp, .jspx or a slash.
Some crawlers may also avoid requesting any
resources that have a "?"1/2/2012
in them. 9
10. Simplified Crawl Architecture
one document HTML traverse
Documen
t links
Web
Index Web
entire index Near-
duplicate
? newly-crawled
document(s)
insert
trash
1/2/2012 10
11. Near Duplicate Detection
The Steps Involved In This Approach Are,
Web document parsing
Stemming algorithm
Keyword representation
Similarity score calculation
1/2/2012 11
12. Near Duplicate Detection
cont‟d…
Web Document Parsing:
• It may either be simple as URL extraction or complex
as removing the HTML tags and java scripts from a web
page.
•Stop Word Removal
Remove commonly used words such as „an', „and‟
, ‟the‟ ,‟to‟ , ‟with‟ , ‟by‟ , ‟for‟ etc…It helps to reduce the
size of the indexing file.
1/2/2012 12
13. Near Duplicate Detection
cont‟d…
Stemming Algorithm:
•Stemming is the process for reducing derived words to
their stem, base or root form—generally a written word
form.
•The relation between a query and a document is
determined by the number and frequency of terms
which they have common.
•Affix removal algorithms remove suffixes and/or
prefixes from terms leaving a stem.
eg : “connect”, “connected”,” connecting” are all
condensed to connect.
1/2/2012 13
14. Near Duplicate Detection
cont‟d…
Stemming Algorithm cont’d..
•The prefix removal algorithm removes:
anti,bi,co,contra,de,di,des,en,inter,intra,mini,multi,pre,pro
•The suffix removal algorithm removes:
ly,ness,ioc,iez,able,ance,ary,ce,y,dom,ee,eer,ence,ory,o
• The derivation are converted to their stems which are rela
to original in both form and semantics.
1/2/2012 14
15. Near Duplicate Detection
cont‟d…
Key Word Representation:
• Keywords and their counts in each crawled page
is the result of stemming
• Keywords are sorted in descending order based
on the counts
• Keywords with highest counts are called prime
keywords stored in table and the remaining indexed
and stored in another table.
1/2/2012 15
16. Near Duplicate Detection
cont‟d…
Similarity score calculation:
• If prime keywords of the new web page do not match
with the prime keywords of the pages in the table then new
page is added to the repository.
• If all the keywords of the both pages are same then new
page is a duplicate.
• If prime keywords of the both pages are same then
similarity score (SSM) is calculated as follows.
1/2/2012 16
17. Near Duplicate Detection
cont‟d…
K1 K2 ……….. Kn
C1 C2 ……….. Cn
Table of web page in the repository containing keywords and count
K1 K2 ………… Kn
C1 C2 …………. Cn
Table of new web page containing keywords and count
If a key word present in both tables then
a=Δ[ki]T1
b=Δ[ki]T2
Using the formula
SDc=log(count(a)/count(b))*Abs(1+(a-b))
1/2/2012 17
18. Near Duplicate Detection
cont‟d…
• If keywords present in T1 but not in T2 and amount of keywords prese
is NT1 then
SDT1 =log(count(a))*Abs(1+|T2|)
• If keywords present in T2 but not in T1 and amount of keywords prese
is NT2 then
SDT2 =log(count(b))*Abs(1+|T1|)
• The similarity score of page against another page is calculated by
|NC| |NT1| |NT@|
ΣSDC + ΣSDT1 + ΣSDT2
i=1 i=1 i=1
SSM =
N
Where N=(|T1|+|T2|)/2
1/2/2012 18
19. Near Duplicate Detection
cont‟d…
• The web documents with similarity score greater than
a predefined threshold are considered as near
duplicates
• These near duplicated pages are not added to the
repository of search engine
1/2/2012 19
20. Advantages
• Save the network bandwidth
• Reduce storage cost of search engines
• Improve the quality of search index
1/2/2012 20
21. Conclusion
• The proposed method solve the difficulties of
information retrieval from the web.
• The approach has detected the near duplicate web
pages efficiently based on the keywords extracted from
the web pages.
• It reduces the memory space for web repositories.
• The near duplicate detection increases the search
engines quality.
1/2/2012 21
22. Reference
• Brin, S., Davis, J. and Garcia-Molina, H. (1995) "Copy detection
mechanisms for digital documents", In Proceedings of the Special
Interest Group on Management of Data (SIGMOD 1995), ACM Press.
• Pandey, S.; Olston, C., (2005) "User-centric Web crawling",
Proceedings
of the 14th international conference on World Wide Web, pp: 401 - 41
• Xiao, C., Wang, W., Lin, X., Xu Yu, J.,(2008) "Efficient Similarity Joins
for Near Duplicate Detection", Proceeding of the 17th international
443 - 452. conference on World Wide Web, pp:131--140.
• Lovins, J.B. (1968) "Development of a stemming algorithm".
Mechanical Translation and Computational Linguistics.
1/2/2012 22