SlideShare une entreprise Scribd logo
1  sur  24
A Novel And Efficient Approach
For Near Duplicate Page
Detection In Web Crawling

VIPIN KP       Guided by: Mr . Aneesh M Haneef
08103066                Asst . Professor
S7 CSE A                 Department of
CSE,MESCE
Presentation Outline
   Introduction
   What are near duplicates
   Drawbacks of near duplicate pages
   What is a Web crawler
   Simplified Crawl Architecture
   Near duplicate detection
   Advantages
   Conclusion
   Reference

                         1/2/2012       2
Introduction
   The main gateways for access of a information in
    the web are search engines .
   A search engine operates in the following order:
     Web crawling
     Indexing
     Searching
   Web crawling ,a process that create a indexed
    repository utilized by the search engines.
   The large amount of web documents in the web
    have huge challenges to the search engine making
    their results less relevant to the user.

                            1/2/2012                   3
Introduction cont‟d…
  Web search engines face additional problems
   due to near duplicate web pages.
 It is an important requirements for search
   engines to provide users with relevant results
   without duplication.
  Near duplicate page detection is a challenging
   problem.




                          1/2/2012                  4
What are near duplicates ?
 The near duplicates are not considered as “exact
  duplicates ” , but are files with minute
  differences .
 They differ slightly in advertisement, counters ,
  timestamps , etc…
 Most of the web sites have boiler plate codes.




                         1/2/2012                 5
What are near duplicates ?




   http://shop.asus.co.uk/shop/gb/en-gb/home.aspx

                           1/2/2012                 6
What are near duplicates ?




   http://shop.asus.es/shop/gb/en-gb/home.aspx
                           1/2/2012              7
Drawbacks of Near Duplicate web
pages

   Waste network bandwidth
   Increase storage cost
   Affect the quality of search indexes
   Increase the load on the remote host that is
    serving such web pages
   Affect customer satisfaction




                            1/2/2012               8
Web Crawler
 A Web crawler is a computer program that browses
  the World Wide Web in an orderly fashion.
 Other terms for Web crawlers are ants, automatic
  indexers, bots , Web spiders, Web robots.
 Search engines uses web crawlers to create a
  copy of all the visited pages for later processing by
  a search engine that will index the downloaded
  pages to provide fast searches.
 This indexed database will use for searching
  process.
 A crawler may examine the URL if it ends with
  certain characters such as .html, .htm, .asp, .aspx,
  .php, .jsp, .jspx or a slash.
 Some crawlers may also avoid requesting any
  resources that have a "?"1/2/2012
                               in them.                   9
Simplified Crawl Architecture
         one document    HTML              traverse
                        Documen
                           t                links



  Web
 Index                                                 Web

         entire index    Near-
                        duplicate
                           ?           newly-crawled
                                       document(s)


            insert
                                    trash

                                1/2/2012                     10
Near Duplicate Detection
 The Steps Involved In This Approach Are,

 Web document parsing
 Stemming algorithm
 Keyword representation
 Similarity score calculation




                          1/2/2012          11
Near Duplicate Detection
  cont‟d…
Web Document Parsing:

• It may either be simple as URL extraction or complex
as removing the HTML tags and java scripts from a web
page.

•Stop Word Removal
       Remove commonly used words such as „an', „and‟
, ‟the‟ ,‟to‟ , ‟with‟ , ‟by‟ , ‟for‟ etc…It helps to reduce the
size of the indexing file.




                                 1/2/2012                          12
Near Duplicate Detection
 cont‟d…
Stemming Algorithm:

•Stemming is the process for reducing derived words to
their stem, base or root form—generally a written word
form.
•The relation between a query and a document is
determined by the number and frequency of terms
which they have common.
•Affix removal algorithms remove suffixes and/or
prefixes from terms leaving a stem.
        eg : “connect”, “connected”,” connecting” are all
condensed to          connect.

                             1/2/2012                       13
Near Duplicate Detection
 cont‟d…
Stemming Algorithm cont’d..
•The prefix removal algorithm removes:
   anti,bi,co,contra,de,di,des,en,inter,intra,mini,multi,pre,pro

•The suffix removal algorithm removes:
   ly,ness,ioc,iez,able,ance,ary,ce,y,dom,ee,eer,ence,ory,o

• The derivation are converted to their stems which are rela
  to original in both form and semantics.




                              1/2/2012                      14
Near Duplicate Detection
cont‟d…
Key Word Representation:

• Keywords and their counts in each crawled page
is the result of stemming

• Keywords are sorted in descending order based
on the counts

• Keywords with highest counts are called prime
keywords stored in table and the remaining indexed
and stored in another table.


                          1/2/2012                   15
Near Duplicate Detection
  cont‟d…
Similarity score calculation:
• If prime keywords of the new web page do not match
with the prime keywords of the pages in the table then new
page is added to the repository.

• If all the keywords of the both pages are same then new
page is a duplicate.

• If prime keywords of the both pages are same then
similarity score (SSM) is calculated as follows.




                            1/2/2012                    16
Near Duplicate Detection
cont‟d…
                    K1                K2            ………..                   Kn
                     C1               C2            ………..                   Cn
        Table of web page in the repository containing keywords and count



                    K1                K2            …………                    Kn
                     C1               C2            ………….                   Cn
              Table of new web page containing keywords and count



 If a key word present in both tables then
          a=Δ[ki]T1
          b=Δ[ki]T2

  Using the formula
         SDc=log(count(a)/count(b))*Abs(1+(a-b))

                                         1/2/2012                                17
Near Duplicate Detection
   cont‟d…
• If keywords present in T1 but not in T2 and amount of keywords prese
   is NT1 then
        SDT1 =log(count(a))*Abs(1+|T2|)

• If keywords present in T2 but not in T1 and amount of keywords prese
   is NT2 then
        SDT2 =log(count(b))*Abs(1+|T1|)

• The similarity score of page against another page is calculated by

             |NC|      |NT1|       |NT@|

             ΣSDC + ΣSDT1 + ΣSDT2
              i=1      i=1         i=1
     SSM =
                               N
                                     Where N=(|T1|+|T2|)/2




                                     1/2/2012                      18
Near Duplicate Detection
cont‟d…
• The web documents with similarity score greater than
  a predefined threshold are considered as near
  duplicates

• These near duplicated pages are not added to the
  repository of search engine




                         1/2/2012                    19
Advantages
• Save the network bandwidth

• Reduce storage cost of search engines

• Improve the quality of search index




                           1/2/2012       20
Conclusion
• The proposed method solve the difficulties of
  information retrieval from the web.

• The approach has detected the near duplicate web
  pages efficiently based on the keywords extracted from
  the web pages.

• It reduces the memory space for web repositories.

• The near duplicate detection increases the search
  engines quality.


                            1/2/2012                  21
Reference
•   Brin, S., Davis, J. and Garcia-Molina, H. (1995) "Copy detection
    mechanisms for digital documents", In Proceedings of the Special
    Interest Group on Management of Data (SIGMOD 1995), ACM Press.

•   Pandey, S.; Olston, C., (2005) "User-centric Web crawling",
    Proceedings
    of the 14th international conference on World Wide Web, pp: 401 - 41

•   Xiao, C., Wang, W., Lin, X., Xu Yu, J.,(2008) "Efficient Similarity Joins
    for Near Duplicate Detection", Proceeding of the 17th international
     443 - 452. conference on World Wide Web, pp:131--140.

•    Lovins, J.B. (1968) "Development of a stemming algorithm".
    Mechanical Translation and Computational Linguistics.


                                      1/2/2012                           22
Questions




1/2/2012         23
Thank you

    1/2/2012   24

Contenu connexe

En vedette

Record matching over query results from Web Databases
Record matching over query results from Web DatabasesRecord matching over query results from Web Databases
Record matching over query results from Web Databasestusharjadhav2611
 
An adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsAn adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsLikan Patra
 
Site Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look ForSite Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look ForOutspoken Media
 
Avito Duplicate Ads Detection @ kaggle
Avito Duplicate Ads Detection @ kaggleAvito Duplicate Ads Detection @ kaggle
Avito Duplicate Ads Detection @ kaggleAlexey Grigorev
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiersLars Marius Garshol
 
Working of a Web Crawler
Working of a Web CrawlerWorking of a Web Crawler
Working of a Web CrawlerSanchit Saini
 
Outbrain Click Prediction
Outbrain Click PredictionOutbrain Click Prediction
Outbrain Click PredictionAlexey Grigorev
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawlingDenis Shestakov
 

En vedette (13)

Record matching over query results from Web Databases
Record matching over query results from Web DatabasesRecord matching over query results from Web Databases
Record matching over query results from Web Databases
 
Progressive Texture
Progressive TextureProgressive Texture
Progressive Texture
 
An adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsAn adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate records
 
Site Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look ForSite Crawling: What To Do & What To Look For
Site Crawling: What To Do & What To Look For
 
Avito Duplicate Ads Detection @ kaggle
Avito Duplicate Ads Detection @ kaggleAvito Duplicate Ads Detection @ kaggle
Avito Duplicate Ads Detection @ kaggle
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiers
 
Deduplication
DeduplicationDeduplication
Deduplication
 
Working of a Web Crawler
Working of a Web CrawlerWorking of a Web Crawler
Working of a Web Crawler
 
Outbrain Click Prediction
Outbrain Click PredictionOutbrain Click Prediction
Outbrain Click Prediction
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 

Similaire à novel and efficient approch for detection of duplicate pages in web crawling

Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
 
Www Search Engine But Not In Perl
Www Search Engine But Not In PerlWww Search Engine But Not In Perl
Www Search Engine But Not In PerlKonstantin Ivinsky
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...IOSR Journals
 
IRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine OptimizationIRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine OptimizationIRJET Journal
 
JUG Poznan - 2017.01.31
JUG Poznan - 2017.01.31 JUG Poznan - 2017.01.31
JUG Poznan - 2017.01.31 Omnilogy
 
Accra MongoDB User Group
Accra MongoDB User GroupAccra MongoDB User Group
Accra MongoDB User GroupMongoDB
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersSATOSHI TAGOMORI
 
Introduction to MongoDB Basics from SQL to NoSQL
Introduction to MongoDB Basics from SQL to NoSQLIntroduction to MongoDB Basics from SQL to NoSQL
Introduction to MongoDB Basics from SQL to NoSQLMayur Patil
 
MongoDB.local Sydney: An Introduction to Document Databases with MongoDB
MongoDB.local Sydney: An Introduction to Document Databases with MongoDBMongoDB.local Sydney: An Introduction to Document Databases with MongoDB
MongoDB.local Sydney: An Introduction to Document Databases with MongoDBMongoDB
 
Web and DAMS - NC ECHO Dig Institute
Web and DAMS - NC ECHO Dig InstituteWeb and DAMS - NC ECHO Dig Institute
Web and DAMS - NC ECHO Dig Instituteegore
 
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...cscpconf
 
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...csandit
 
Relevant updated data retrieval architectural model for continous text extrac...
Relevant updated data retrieval architectural model for continous text extrac...Relevant updated data retrieval architectural model for continous text extrac...
Relevant updated data retrieval architectural model for continous text extrac...csandit
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Conceptijceronline
 
Three Steps to Link Analysis Insight
Three Steps to Link Analysis InsightThree Steps to Link Analysis Insight
Three Steps to Link Analysis InsightSdushko
 
Couchbase - Yet Another Introduction
Couchbase - Yet Another IntroductionCouchbase - Yet Another Introduction
Couchbase - Yet Another IntroductionKelum Senanayake
 

Similaire à novel and efficient approch for detection of duplicate pages in web crawling (20)

Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
Www Search Engine But Not In Perl
Www Search Engine But Not In PerlWww Search Engine But Not In Perl
Www Search Engine But Not In Perl
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 
IRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine OptimizationIRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine Optimization
 
JUG Poznan - 2017.01.31
JUG Poznan - 2017.01.31 JUG Poznan - 2017.01.31
JUG Poznan - 2017.01.31
 
Isset Presentation @ EECI2009
Isset Presentation @ EECI2009Isset Presentation @ EECI2009
Isset Presentation @ EECI2009
 
Accra MongoDB User Group
Accra MongoDB User GroupAccra MongoDB User Group
Accra MongoDB User Group
 
H017554148
H017554148H017554148
H017554148
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Introduction to MongoDB Basics from SQL to NoSQL
Introduction to MongoDB Basics from SQL to NoSQLIntroduction to MongoDB Basics from SQL to NoSQL
Introduction to MongoDB Basics from SQL to NoSQL
 
MongoDB.local Sydney: An Introduction to Document Databases with MongoDB
MongoDB.local Sydney: An Introduction to Document Databases with MongoDBMongoDB.local Sydney: An Introduction to Document Databases with MongoDB
MongoDB.local Sydney: An Introduction to Document Databases with MongoDB
 
Lecture 6 Data Driven Design
Lecture 6  Data Driven DesignLecture 6  Data Driven Design
Lecture 6 Data Driven Design
 
Web and DAMS - NC ECHO Dig Institute
Web and DAMS - NC ECHO Dig InstituteWeb and DAMS - NC ECHO Dig Institute
Web and DAMS - NC ECHO Dig Institute
 
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
 
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRA...
 
Relevant updated data retrieval architectural model for continous text extrac...
Relevant updated data retrieval architectural model for continous text extrac...Relevant updated data retrieval architectural model for continous text extrac...
Relevant updated data retrieval architectural model for continous text extrac...
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Concept
 
Three Steps to Link Analysis Insight
Three Steps to Link Analysis InsightThree Steps to Link Analysis Insight
Three Steps to Link Analysis Insight
 
Couchbase - Yet Another Introduction
Couchbase - Yet Another IntroductionCouchbase - Yet Another Introduction
Couchbase - Yet Another Introduction
 

Dernier

8 Easy Ways to Keep Your Heart Healthy this Summer | Amit Kakkar Healthyway
8 Easy Ways to Keep Your Heart Healthy this Summer | Amit Kakkar Healthyway8 Easy Ways to Keep Your Heart Healthy this Summer | Amit Kakkar Healthyway
8 Easy Ways to Keep Your Heart Healthy this Summer | Amit Kakkar HealthywayAmit Kakkar Healthyway
 
8377877756 Full Enjoy @24/7 Call Girls In Mayur Vihar Delhi Ncr
8377877756 Full Enjoy @24/7 Call Girls In Mayur Vihar Delhi Ncr8377877756 Full Enjoy @24/7 Call Girls In Mayur Vihar Delhi Ncr
8377877756 Full Enjoy @24/7 Call Girls In Mayur Vihar Delhi Ncrdollysharma2066
 
《QUT毕业文凭网-认证昆士兰科技大学毕业证成绩单》
《QUT毕业文凭网-认证昆士兰科技大学毕业证成绩单》《QUT毕业文凭网-认证昆士兰科技大学毕业证成绩单》
《QUT毕业文凭网-认证昆士兰科技大学毕业证成绩单》rnrncn29
 
Call Girls in New Friends Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in New Friends Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in New Friends Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in New Friends Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
83778-876O7, Cash On Delivery Call Girls In South- EX-(Delhi) Escorts Service...
83778-876O7, Cash On Delivery Call Girls In South- EX-(Delhi) Escorts Service...83778-876O7, Cash On Delivery Call Girls In South- EX-(Delhi) Escorts Service...
83778-876O7, Cash On Delivery Call Girls In South- EX-(Delhi) Escorts Service...dollysharma2066
 
Virat Kohli Centuries In Career Age Awards and Facts.pdf
Virat Kohli Centuries In Career Age Awards and Facts.pdfVirat Kohli Centuries In Career Age Awards and Facts.pdf
Virat Kohli Centuries In Career Age Awards and Facts.pdfkigaya33
 
Unlocking Radiant Skin: The Ultimate Skincare Guide( beyonist)
Unlocking Radiant Skin: The Ultimate Skincare Guide( beyonist)Unlocking Radiant Skin: The Ultimate Skincare Guide( beyonist)
Unlocking Radiant Skin: The Ultimate Skincare Guide( beyonist)beyonistskincare
 
'the Spring 2024- popular Fashion trends
'the Spring 2024- popular Fashion trends'the Spring 2024- popular Fashion trends
'the Spring 2024- popular Fashion trendsTangledThoughtsCO
 
Call In girls Delhi Safdarjung Enclave/WhatsApp 🔝 97111⇛⇛47426
Call In girls Delhi Safdarjung Enclave/WhatsApp 🔝  97111⇛⇛47426Call In girls Delhi Safdarjung Enclave/WhatsApp 🔝  97111⇛⇛47426
Call In girls Delhi Safdarjung Enclave/WhatsApp 🔝 97111⇛⇛47426jennyeacort
 
BOOK NIGHT-Call Girls In Noida City Centre Delhi ☎️ 8377877756
BOOK NIGHT-Call Girls In Noida City Centre Delhi ☎️ 8377877756BOOK NIGHT-Call Girls In Noida City Centre Delhi ☎️ 8377877756
BOOK NIGHT-Call Girls In Noida City Centre Delhi ☎️ 8377877756dollysharma2066
 
Traditional vs. Modern Parenting: Unveiling the Pros and Cons for Your Child’...
Traditional vs. Modern Parenting: Unveiling the Pros and Cons for Your Child’...Traditional vs. Modern Parenting: Unveiling the Pros and Cons for Your Child’...
Traditional vs. Modern Parenting: Unveiling the Pros and Cons for Your Child’...bluetroyvictorVinay
 
labradorite energetic gems for well beings.pdf
labradorite energetic gems for well beings.pdflabradorite energetic gems for well beings.pdf
labradorite energetic gems for well beings.pdfAkrati jewels inc
 
Uttoxeter & Cheadle Voice, Issue 122.pdf
Uttoxeter & Cheadle Voice, Issue 122.pdfUttoxeter & Cheadle Voice, Issue 122.pdf
Uttoxeter & Cheadle Voice, Issue 122.pdfNoel Sergeant
 
Call Girls in Tughlakabad Delhi 9654467111 Shot 2000 Night 7000
Call Girls in Tughlakabad Delhi 9654467111 Shot 2000 Night 7000Call Girls in Tughlakabad Delhi 9654467111 Shot 2000 Night 7000
Call Girls in Tughlakabad Delhi 9654467111 Shot 2000 Night 7000Sapana Sha
 

Dernier (16)

8 Easy Ways to Keep Your Heart Healthy this Summer | Amit Kakkar Healthyway
8 Easy Ways to Keep Your Heart Healthy this Summer | Amit Kakkar Healthyway8 Easy Ways to Keep Your Heart Healthy this Summer | Amit Kakkar Healthyway
8 Easy Ways to Keep Your Heart Healthy this Summer | Amit Kakkar Healthyway
 
8377877756 Full Enjoy @24/7 Call Girls In Mayur Vihar Delhi Ncr
8377877756 Full Enjoy @24/7 Call Girls In Mayur Vihar Delhi Ncr8377877756 Full Enjoy @24/7 Call Girls In Mayur Vihar Delhi Ncr
8377877756 Full Enjoy @24/7 Call Girls In Mayur Vihar Delhi Ncr
 
Call Girls 9953525677 Call Girls In Delhi Call Girls 9953525677 Call Girls In...
Call Girls 9953525677 Call Girls In Delhi Call Girls 9953525677 Call Girls In...Call Girls 9953525677 Call Girls In Delhi Call Girls 9953525677 Call Girls In...
Call Girls 9953525677 Call Girls In Delhi Call Girls 9953525677 Call Girls In...
 
《QUT毕业文凭网-认证昆士兰科技大学毕业证成绩单》
《QUT毕业文凭网-认证昆士兰科技大学毕业证成绩单》《QUT毕业文凭网-认证昆士兰科技大学毕业证成绩单》
《QUT毕业文凭网-认证昆士兰科技大学毕业证成绩单》
 
Call Girls in New Friends Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in New Friends Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in New Friends Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in New Friends Colony Delhi 💯Call Us 🔝8264348440🔝
 
83778-876O7, Cash On Delivery Call Girls In South- EX-(Delhi) Escorts Service...
83778-876O7, Cash On Delivery Call Girls In South- EX-(Delhi) Escorts Service...83778-876O7, Cash On Delivery Call Girls In South- EX-(Delhi) Escorts Service...
83778-876O7, Cash On Delivery Call Girls In South- EX-(Delhi) Escorts Service...
 
Virat Kohli Centuries In Career Age Awards and Facts.pdf
Virat Kohli Centuries In Career Age Awards and Facts.pdfVirat Kohli Centuries In Career Age Awards and Facts.pdf
Virat Kohli Centuries In Career Age Awards and Facts.pdf
 
Unlocking Radiant Skin: The Ultimate Skincare Guide( beyonist)
Unlocking Radiant Skin: The Ultimate Skincare Guide( beyonist)Unlocking Radiant Skin: The Ultimate Skincare Guide( beyonist)
Unlocking Radiant Skin: The Ultimate Skincare Guide( beyonist)
 
Stunning ➥8448380779▻ Call Girls In Jasola Vihar Delhi NCR
Stunning ➥8448380779▻ Call Girls In Jasola Vihar Delhi NCRStunning ➥8448380779▻ Call Girls In Jasola Vihar Delhi NCR
Stunning ➥8448380779▻ Call Girls In Jasola Vihar Delhi NCR
 
'the Spring 2024- popular Fashion trends
'the Spring 2024- popular Fashion trends'the Spring 2024- popular Fashion trends
'the Spring 2024- popular Fashion trends
 
Call In girls Delhi Safdarjung Enclave/WhatsApp 🔝 97111⇛⇛47426
Call In girls Delhi Safdarjung Enclave/WhatsApp 🔝  97111⇛⇛47426Call In girls Delhi Safdarjung Enclave/WhatsApp 🔝  97111⇛⇛47426
Call In girls Delhi Safdarjung Enclave/WhatsApp 🔝 97111⇛⇛47426
 
BOOK NIGHT-Call Girls In Noida City Centre Delhi ☎️ 8377877756
BOOK NIGHT-Call Girls In Noida City Centre Delhi ☎️ 8377877756BOOK NIGHT-Call Girls In Noida City Centre Delhi ☎️ 8377877756
BOOK NIGHT-Call Girls In Noida City Centre Delhi ☎️ 8377877756
 
Traditional vs. Modern Parenting: Unveiling the Pros and Cons for Your Child’...
Traditional vs. Modern Parenting: Unveiling the Pros and Cons for Your Child’...Traditional vs. Modern Parenting: Unveiling the Pros and Cons for Your Child’...
Traditional vs. Modern Parenting: Unveiling the Pros and Cons for Your Child’...
 
labradorite energetic gems for well beings.pdf
labradorite energetic gems for well beings.pdflabradorite energetic gems for well beings.pdf
labradorite energetic gems for well beings.pdf
 
Uttoxeter & Cheadle Voice, Issue 122.pdf
Uttoxeter & Cheadle Voice, Issue 122.pdfUttoxeter & Cheadle Voice, Issue 122.pdf
Uttoxeter & Cheadle Voice, Issue 122.pdf
 
Call Girls in Tughlakabad Delhi 9654467111 Shot 2000 Night 7000
Call Girls in Tughlakabad Delhi 9654467111 Shot 2000 Night 7000Call Girls in Tughlakabad Delhi 9654467111 Shot 2000 Night 7000
Call Girls in Tughlakabad Delhi 9654467111 Shot 2000 Night 7000
 

novel and efficient approch for detection of duplicate pages in web crawling

  • 1. A Novel And Efficient Approach For Near Duplicate Page Detection In Web Crawling VIPIN KP Guided by: Mr . Aneesh M Haneef 08103066 Asst . Professor S7 CSE A Department of CSE,MESCE
  • 2. Presentation Outline  Introduction  What are near duplicates  Drawbacks of near duplicate pages  What is a Web crawler  Simplified Crawl Architecture  Near duplicate detection  Advantages  Conclusion  Reference 1/2/2012 2
  • 3. Introduction  The main gateways for access of a information in the web are search engines .  A search engine operates in the following order: Web crawling Indexing Searching  Web crawling ,a process that create a indexed repository utilized by the search engines.  The large amount of web documents in the web have huge challenges to the search engine making their results less relevant to the user. 1/2/2012 3
  • 4. Introduction cont‟d…  Web search engines face additional problems due to near duplicate web pages.  It is an important requirements for search engines to provide users with relevant results without duplication.  Near duplicate page detection is a challenging problem. 1/2/2012 4
  • 5. What are near duplicates ?  The near duplicates are not considered as “exact duplicates ” , but are files with minute differences .  They differ slightly in advertisement, counters , timestamps , etc…  Most of the web sites have boiler plate codes. 1/2/2012 5
  • 6. What are near duplicates ? http://shop.asus.co.uk/shop/gb/en-gb/home.aspx 1/2/2012 6
  • 7. What are near duplicates ? http://shop.asus.es/shop/gb/en-gb/home.aspx 1/2/2012 7
  • 8. Drawbacks of Near Duplicate web pages  Waste network bandwidth  Increase storage cost  Affect the quality of search indexes  Increase the load on the remote host that is serving such web pages  Affect customer satisfaction 1/2/2012 8
  • 9. Web Crawler  A Web crawler is a computer program that browses the World Wide Web in an orderly fashion.  Other terms for Web crawlers are ants, automatic indexers, bots , Web spiders, Web robots.  Search engines uses web crawlers to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.  This indexed database will use for searching process.  A crawler may examine the URL if it ends with certain characters such as .html, .htm, .asp, .aspx, .php, .jsp, .jspx or a slash.  Some crawlers may also avoid requesting any resources that have a "?"1/2/2012 in them. 9
  • 10. Simplified Crawl Architecture one document HTML traverse Documen t links Web Index Web entire index Near- duplicate ? newly-crawled document(s) insert trash 1/2/2012 10
  • 11. Near Duplicate Detection The Steps Involved In This Approach Are, Web document parsing Stemming algorithm Keyword representation Similarity score calculation 1/2/2012 11
  • 12. Near Duplicate Detection cont‟d… Web Document Parsing: • It may either be simple as URL extraction or complex as removing the HTML tags and java scripts from a web page. •Stop Word Removal Remove commonly used words such as „an', „and‟ , ‟the‟ ,‟to‟ , ‟with‟ , ‟by‟ , ‟for‟ etc…It helps to reduce the size of the indexing file. 1/2/2012 12
  • 13. Near Duplicate Detection cont‟d… Stemming Algorithm: •Stemming is the process for reducing derived words to their stem, base or root form—generally a written word form. •The relation between a query and a document is determined by the number and frequency of terms which they have common. •Affix removal algorithms remove suffixes and/or prefixes from terms leaving a stem. eg : “connect”, “connected”,” connecting” are all condensed to connect. 1/2/2012 13
  • 14. Near Duplicate Detection cont‟d… Stemming Algorithm cont’d.. •The prefix removal algorithm removes: anti,bi,co,contra,de,di,des,en,inter,intra,mini,multi,pre,pro •The suffix removal algorithm removes: ly,ness,ioc,iez,able,ance,ary,ce,y,dom,ee,eer,ence,ory,o • The derivation are converted to their stems which are rela to original in both form and semantics. 1/2/2012 14
  • 15. Near Duplicate Detection cont‟d… Key Word Representation: • Keywords and their counts in each crawled page is the result of stemming • Keywords are sorted in descending order based on the counts • Keywords with highest counts are called prime keywords stored in table and the remaining indexed and stored in another table. 1/2/2012 15
  • 16. Near Duplicate Detection cont‟d… Similarity score calculation: • If prime keywords of the new web page do not match with the prime keywords of the pages in the table then new page is added to the repository. • If all the keywords of the both pages are same then new page is a duplicate. • If prime keywords of the both pages are same then similarity score (SSM) is calculated as follows. 1/2/2012 16
  • 17. Near Duplicate Detection cont‟d… K1 K2 ……….. Kn C1 C2 ……….. Cn Table of web page in the repository containing keywords and count K1 K2 ………… Kn C1 C2 …………. Cn Table of new web page containing keywords and count If a key word present in both tables then a=Δ[ki]T1 b=Δ[ki]T2 Using the formula SDc=log(count(a)/count(b))*Abs(1+(a-b)) 1/2/2012 17
  • 18. Near Duplicate Detection cont‟d… • If keywords present in T1 but not in T2 and amount of keywords prese is NT1 then SDT1 =log(count(a))*Abs(1+|T2|) • If keywords present in T2 but not in T1 and amount of keywords prese is NT2 then SDT2 =log(count(b))*Abs(1+|T1|) • The similarity score of page against another page is calculated by |NC| |NT1| |NT@| ΣSDC + ΣSDT1 + ΣSDT2 i=1 i=1 i=1 SSM = N Where N=(|T1|+|T2|)/2 1/2/2012 18
  • 19. Near Duplicate Detection cont‟d… • The web documents with similarity score greater than a predefined threshold are considered as near duplicates • These near duplicated pages are not added to the repository of search engine 1/2/2012 19
  • 20. Advantages • Save the network bandwidth • Reduce storage cost of search engines • Improve the quality of search index 1/2/2012 20
  • 21. Conclusion • The proposed method solve the difficulties of information retrieval from the web. • The approach has detected the near duplicate web pages efficiently based on the keywords extracted from the web pages. • It reduces the memory space for web repositories. • The near duplicate detection increases the search engines quality. 1/2/2012 21
  • 22. Reference • Brin, S., Davis, J. and Garcia-Molina, H. (1995) "Copy detection mechanisms for digital documents", In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), ACM Press. • Pandey, S.; Olston, C., (2005) "User-centric Web crawling", Proceedings of the 14th international conference on World Wide Web, pp: 401 - 41 • Xiao, C., Wang, W., Lin, X., Xu Yu, J.,(2008) "Efficient Similarity Joins for Near Duplicate Detection", Proceeding of the 17th international 443 - 452. conference on World Wide Web, pp:131--140. • Lovins, J.B. (1968) "Development of a stemming algorithm". Mechanical Translation and Computational Linguistics. 1/2/2012 22
  • 24. Thank you 1/2/2012 24