SlideShare a Scribd company logo
1 of 5
Download to read offline
ISSN: 2278 – 1323
                                          International Journal of Advanced Research in Computer Engineering & Technology
                                                                                              Volume 1, Issue 4, June 2012



           Realizing Peer-to-Peer and Distributed Web
                             Crawler
                                  Anup A Garje, Prof. Bhavesh Patel, Dr. B. B. Meshram


                                                                        implemented as a queue). URLs from the frontier are
   Abstract—The tremendous growth of the World Wide Web                  recursively visited according to a set of policies.
has made tools such as search engines and information retrieval                    We present several issues to take into account when
systems have become essential. In this dissertation, we propose          crawling the Web. They lead to the fact that at design time of
a fully distributed, peer-to-peer architecture for web crawling.         a crawl, its intention needs to be fixed. The intention is
The main goal behind the development of such a system is to              defined by the goal that a specific crawl is targeted at; this can
provide an alternative but efficient, easily implementable and a
decentralized system for crawling, indexing, caching and
                                                                         differ in terms of crawl length, crawl intervals, crawl scope,
querying web pages. The main function of a webcrawler is to              etc. A major issue is that the Web is not static, but rather
recursively visit web pages, extract all URLs form the page,             dynamic and thus changes on the timescale of days, hours,
parse the page for keywords and visit the extracted URLs                 minutes. There are billions of documents available on the
recursively. We propose an architecture that can be easily               Web and crawling all data and furthermore maintaining a
implemeneted on a local (campus) network and which follows a             good freshness of the data becomes almost impossible. To
fully distributed, peer-to-peer architecture. The architecture           always keep the crawled data up-to-date we would need to
specifications, implementation details, requirements to be met           continuously crawl the Web, revisiting all pages we have
and analysis of such a system is discussed.                              once crawled. Whether we want to do this depends on the
                                                                         earlier mentioned crawling intention. Such an intention can
                                                                         for example be that we want to cover a preferably big part of
    Index Terms— Peer to peer, Distributed,Crawling,indexing
                                                                         the Web, crawl the Web for news on one topic or monitor one
                                                                         specific Web site for changes.
                                                                                    We discuss two different crawling strategies that
                        I. INTRODUCTION
                                                                         are related to the purpose of a crawl: incremental and
           Web crawlers download large quantities of data and            snapshot crawling. The strategies can be identified by
browse documents by passing from one hypertext link to                   different frontier growth behavior.
another. A web crawler (also known as a web spider or web                          In a snapshot strategy the crawler visits a URL only
robot) is a program or automated script which browses the                once; if the same URL is discovered again it is considered as
World Wide Web in a methodical, automated manner. This                   duplicate and discarded. Using this strategy the frontier is
process is called web crawling or spidering. Many sites, in              extended continuously with only new URLs and a crawl can
particular search engines, use spidering as a means of                   spread quite fast. This strategy is optimal if you want to e.g.
providing up-to-date data.                                               cover an either big or specific part of the Web once, or in
           Web crawlers are mainly used to create a copy of all          regular intervals. The incremental crawling strategy is
the visited pages for later processing by a search engine that           optimal for recurring continuous crawls with a limited scope;
will index the downloaded pages to provide fast searches.                when an already visited URL is rediscovered it is not rejected
Crawlers can also be used for automating maintenance tasks               but instead put into the frontier again. Using this strategy the
on a website, such as checking links or validating HTML                  frontier queues will never empty and a crawl could go on for
code. Also, crawlers can be used to gather specific types of             an indefinite long time. This strategy is optimal for
information from Web pages, such as harvesting e-mail                    monitoring a specific part of the Web for changes. Following
addresses (usually for spam). A web crawler is one type of               the crawling research field and relevant literature, we
bot, or software agent. In general, it starts with a list of URLs        distinguish not only between crawling strategies but as well
to visit, called the seeds. As the crawler visits these URLs, it         between crawler types. They are nevertheless related as
identifies all the hyperlinks in the page and adds them to the           different crawling strategies are used for different crawler
list of URLs to visit, called the crawl frontier (usually                types, which thus are related to the specific intentions we
                                                                         pursue when crawling the Web. While the crawling strategies
                                                                         are defined using the frontier growth behavior, the crawler
                                                                         types are based upon the scope of a crawl. They include types
    Anup A. Garje, Department of Computer Technology, Veetmata Jijabai   as broad, focused, topical or continuous crawling.
Technilogical Institute,Matunga, Mumbai.India                                      The two most important types of web crawling are;
    anupg.007@gmail.com                                                  broad and focused crawling. Broad (or universal) crawls
    Prof. Bhavesh Patel Department of Computer Technology, Veetmata
Jijabai Technilogical Institute,Matunga, Mumbai. India                   can be described as large crawls with a high bandwidth usage
    bh_patelin@yahoo.co.in                                               where the crawler fetches a large number of Web sites and
                                                                         goes as well into a high depth on each crawled site. This
    Dr. B. B. Meshram Head of Dept. of Computer Technology, Veermata     crawl type fits to the intention of crawling a large part of the
Jijabai      Technological   Institute,Matunga    Mumbai.       India
bbmeshram@vjti.org.in                                                    Web, if not even the whole Web. Not only the amount of
                                                                         collected Web data is important, but as well the completeness

                                                                                                                                      353
                                                  All Rights Reserved © 2012 IJARCET
ISSN: 2278 – 1323
                                       International Journal of Advanced Research in Computer Engineering & Technology
                                                                                           Volume 1, Issue 4, June 2012

of coverage of single Web sites. Focused (or topical) crawls                 3.   The crawler extracts links from the downloaded
on the other side are characterized by the fact that a number                     document.
of criteria are defined that limits the scope of a crawl (e.g. by
limiting the URLs to be visited to certain domains); the                     4.   Based on given rules the crawler decides
crawler fetches similar pages topic-wise. This crawl type is                      whether it wants to permanently store the
used with the intention to collect pages from a specific                          downloaded documents, index them, generate
domain, category, topic or similar.                                               metadata etc...

                                                                             5.   The crawler feeds the extracted links to the
                                                                                  frontier queue.
              II. CRAWLING - AN OVERVIEW
   In the following section we will introduce both the Web
crawler as such and some commonly known crawling                              The above steps are executed for all URLs that are
strategies that can be applied to them. A Web crawler, also         crawled by the Web crawler. Although a crawler has only one
called a robot or spider, is a software program that starts with    frontier, the frontier has multiple queues that are filled with
a set of URIs, fetches the documents (e.g. HTML pages,              URLs. Queues can be built based on different schemes: e.g.
service descriptions, images, audio files, etc.) available at       one queue per host. Additionally the queues can be ranked
those URLs, extract the URLs, i.e. links, from the documents        within the frontier which makes then that certain queues are
fetched in the previous step and start over the process             served earlier by the frontier than others. A similar issue is as
previously described. That is it automatically downloads            well the ranking of the URLs within the single queues.
Web pages and follows links in the pages, this way moving           During the setup of a crawl, it must be decided what URLs
from one Webpage to another.                                        get what priorities and get thus removed either early or late
                                                                    from a queue to be processed further. If a frontier is not set
                                                                    any limit and if the crawler disposes over unlimited hardware
                                                                    resources, it may grow indefinitely. This can be avoided by
                                                                    limiting the growth of the frontier, either by, e.g., restricting
                                                                    the number of pages the crawler may download from a
                                                                    domain, or by restricting the number of overall visited Web
                                                                    sites, what would at the same time limit the scope of the
                                                                    crawl. Whatever frontier strategy is chosen, the crawler
                                                                    proceeds in the same way with the URLs it gets from the
                                                                    frontier.




                                                                      III. SYSTEM DESIGN AND IMPLEMENTATION
                                                                       Architectural Details:
                                                                       Our main goal is to realize a fully distributed, Peer-to-Peer
                                                                    web crawler framework and highlight the features,
                                                                    advantages and credibility of such a system. Our system,
        Fig :General architecture of a Web Crawler                  named JADE, follows a fully decentralized distributed
             We will now shortly describe the basic steps that      architecture. A fully decentralized architecture means that
a crawler is executing. What it basically does is executing         there will be no central server or control entity, all the
different specific steps in a sequential way. The crawler starts    different components are considered to be of equal status (i.e.
by taking a set of seed pages, i.e. the URLs (Uniform               peers). The system uses an overlay network, which could be a
Resource Locator) which it starts with. It uses the URLs to         local network for peer-to-peer communication and an
build its frontier, i.e. the list (that is a queue) of unvisited    underlay network which is the network form which
URLs of the crawler. In the scope of one crawl this frontier is     information is crawled and indexed.
dynamic as it is extended by the URLs extracted from already           The overlay system provides a fully equipped framework
visited pages. The edge of a frontier will be limited by the        for peer-to-peer communication. The basic requirements
number of URLs found in all downloaded documents (and by            from such a network are an efficient communication
politeness restrictions that are followed for different servers).   platform, an environment for distributed data management
So once a URL is taken from the frontier queue it traverses         and retrieval, a fault tolerance and self-administering and
the following steps:                                                peer managing network.
          1. The crawler scheduler checks whether this page            Our system manly comprises of peer-entities, which form
              is intended to be fetched, i.e. whether there are
                                                                    the atomic units of the system and can be used as standalone
              no rules or policies that exclude this URL.
                                                                    or in a network. The following diagram shows the structural
         2.   The document the URL points to is fetched by          components of a single peer-entity;
              the multithreaded downloader.



                                                                                                                                 354
                                               All Rights Reserved © 2012 IJARCET
ISSN: 2278 – 1323
                                      International Journal of Advanced Research in Computer Engineering & Technology
                                                                                          Volume 1, Issue 4, June 2012

                                                                  are instantly searchable (in contrast to batch-processing of
                                                                  other search engine software).
                                                                  THE INDEXER
                                                                            The page indexing is done by the creation of a
                                                                  'reverse word index' (RWI): every page is parsed, the words
                                                                  are extracted and for every word a database table is
                                                                  maintained. The database tables are held in a file-based
                                                                  hash-table, so accessing a word index is extremely fast,
                                                                  resulting in an extremely fast search. The RWIs are hashed
                                                                  form in the database which leads to that the information not
                                                                  stored in plaintext and therefore the security of the index
                                                                  holder rises, since it is not possible to conclude who created
                                                                  the data. At the end of every crawling procedure the index is
                                                                  distributed over the peers participating in the P2P network.
                                                                  Only index entries in the form of URLs are stored, no other
                                                                  caching is performed. Jade implements its index structure as
                                                                  a distributed hash table (DHT): a hash function is applied to
                                                                  every index entry; the entry is then distributed to the
                                                                  appropriate peer. The resulting index contains no information
                                                                  about the origin of the keywords stored in it. Moreover,
                                                                  shared filters offer customized protection against undesirable
                                                                  content.
                                                                  THE DATABASE
                                                                            The database stores all indexed data provided by the
                      Fig : A peer entity                         indexer and the P2P, which is added to the network. They are
                                                                  each peer the data to fit his DHT and through the assigned
   It consists of following components;                           migration index of the directory. The structure of the
                                                                  database will be balanced, binary search tree (AVL tree)
       Crawler
                                                                  formed to a logarithmic search time on the number of
       Indexer                                                   elements in the tree. AVL, The name derives from the
       Database component                                        inventors and Adelson-Velsky Landis by whom this data
   If you are using Word, use either the Microsoft Equation       structure for balanced data distribution was developed in
Editor or the MathType add-on (http://www.mathtype.com)           1962. The AVL property ensures maximum performance in
for equations in your paper (Insert | Object | Create New |       terms of algorithmic order.
Microsoft Equation or MathType Equation). “Float over                       Tree nodes can be dynamically allocated and
text” should not be selected.                                     de-allocated and an unused-node list is maintained. For the
                                                                  PLASMA search algorithm, an ordered access to search
                                                                  results are necessary, therefore we needed an indexing
               IV. SYSTEM INTERNALS                               mechanism which stores the index in an ordered way. The
                                                                  database supports such access, and the resulting database
THE CRAWLER
                                                                  tables are stored as a single file. It is completely
          As mentioned earlier, the main goal of our system
                                                                  self-organizing and does not need any set-up or maintenance
was to implement a fully distributed, P2P web crawler.
                                                                  tasks that must be done by an administrator. Any database
Traditionally, crawling process consisted of recursively
                                                                  may grow to an unthinkable number of records: with one
requesting for a webpage, extracting the links from that page,
                                                                  billion records a database request needs a theoretical
and then requesting the pages from the extracted links. Each
                                                                  maximum number of only 44 comparisons.
page is parsed, indexed for keywords or other parameters and
                                                                            We have implemented Kelondro database
then links fro the page are extracted. The crawler then calls
                                                                  subsystem for realizing the above mentioned requirements
the extracted pages and thus the process continues.
                                                                  and features. The Kelondro database, is an open source AVL
          Apart from the above mentioned crawling method,
                                                                  based database structure, which provides all he necessary
another method exists, known as the proxy method. By using
                                                                  schema, functions and methods for inserting, querying,
a web proxy, that allows users to access pages from the web
                                                                  modifying a AVL tree based database.
through the proxy, we could index and parse the pages that
pass through the proxy. Thus only the pages visited by the
user will be indexed and parsed. Thus the user unknowingly
contributes in the indexing of the pages. The local caching of
the visited pages improves the access time of the system.
Advanced filtering can also be performed easily on the local
cache of visited pages.
          Our system runs a large number of processes which
operate on data stacks and data queues that are filled during a
web crawl and indexing process. The proposed system does a
real-time indexing, that means all pages that pass the crawler


                                                                                                                            355
                                             All Rights Reserved © 2012 IJARCET
ISSN: 2278 – 1323
                                      International Journal of Advanced Research in Computer Engineering & Technology
                                                                                          Volume 1, Issue 4, June 2012

                V. INFORMATION FLOW                               plasmaIndex - Objects are the true reverse words indexes. In
                                                                  plasmaIndex the plasmaIndexEntry - objects are stored in a
                                                                  kelondroTree; an indexed file in the file system.
                                                                  8 SEARCH/ QUERY FLOW




          Fig : Crawler Information Flow Diagram

   The above diagram shows the flow of information in the
crawler system. The HTML file from a crawled URL is
loaded onto the httpProxyServelet module. This module is             Fig: The Search/ Query Information Flow Diagram.
the proxy process that runs in the background.                       The above diagram shows the flow of a user query or a
   The file is then transferred to htpProxyCache module,          keyword search in the system.
which provides the proxy cache and where the processing of           The keyword or the query entered by the user is passed
the file is delayed until the proxy is idle. The cache entry is   onto the httpdFileServelet process, which accepts the
passed on to the plasmaSwitchboard module. This is the core       information and passes it to the plasmaSwitchBoard module.
module that forms the central part of the system..                   The query is validated and checked for consistency before
   There the URL is stored into plasmaLURL where the URL          passing it to plasmaSearch module, which is the search
is stored under a specific hash. The URL's from the content       function on the index. In plasmaSearch, the
are stripped off, stored in plasmaLURL with a 'wrong' date        plasmaSearchResult object is generated by simultaneous
(the date of the URL's are not known at this time, only after     enumeration of URL hashes in the reverse word indexes
fetching) and stacked with plasmaCrawlerTextStack.                plasmaIndex. The result page is then generated from this
   The content is read and splitted into rated words in           plasmaSearchResult object
plasmaCondenser. The splitted words are then integrated into
the index with plasmaSearch. In plasmaSearch the words are                       VI. SYSTEM ANALYSIS
indexed by reversing the relation between URL and words:          In this section we shall discuss certain security aspects,
one URL points to many words, the words within the                software structure, advantages, disadvantages and future
document at the URL. After reversing, one word points to          scope of the proposed system.
many URL's, all the URL's where the word occurrs. One                 SECURITY & PRIVACY ASPECTS
single     word->URL-hash        relation     is   stored    in   The system largely
plasmaIndexEntry. A set of plasmaIndexEntries is a reverse        Sharing the index to other users may arise privacy concerns.
word index. This reverse word index is stored temporarly in       The following properties were decide and implemented to
plasmaIndexCache.                                                 take care of security & privacy concerns;
   In plasmaIndexCache the single plasmaIndexEntry'ies are        Private Index and The local word index does not only
collected and stored into a plasmaIndex – entry. These            Index Movement contain information that a peer created by

                                                                                                                          356
                                             All Rights Reserved © 2012 IJARCET
ISSN: 2278 – 1323
                                    International Journal of Advanced Research in Computer Engineering & Technology
                                                                                        Volume 1, Issue 4, June 2012

                 surfing the internet, but also entries from                         VII. CONCLUSION
                 other peers. Word index files travel along               Thus, one can note that as the size and usage of the
                 the proxy peers to form a distributed hash      WWW increases, the use of a good Information retrieval
                 table. Therefore nobody can argue that          system comprising of indexers and crawlers becomes trivial.
                 information that is provided by this peer       The current web information retrieval systems provide too
                 was also retrieved by this peer and             much of censorship and restrictive policies. There is a need
                 therefore by the peer’s personal use of the     for a distributed, free and a collective view of the task of
                 internet. In fact it is very unlikely that      information retrieval and web page indexing and caching.
                 information that can be found on a peer         The system proposed aims to provide these views and design
                 was created by the peer itself, since the       goals.
                 search process targets only peers where it
                 is likely because of the movement of the                  The use of a censorship-free policy avoids all the
                 index to form the distributed hash table.       restrictions provided by current systems and enables full
                 During a test phase, all word indexes on a      coverage of the WWW as well as of “hidden web” or the
                 peer will be accessible. The future             “deep web”. This is not possible using existing systems. Also
                 production release will constraint searches     the use of Distributed Hash Tables (DHTs) and key based
                 to indexes entries on the peer that have        routing provides a solid framework for a distributed
                 been created by other peers, which will         peer-to-peer network architecture. The proxy provides an
                 ensure complete browsing privacy.               added functionality of caching web pages visited by the user,
Word Index       The words that are stored in the client’s       which is performed in the background.
Storage and      local word index are stored using a word
Content          hash. That means that not any word is                    We believe that the system proposed by us will
Responsibility   stored, but only the word hash. You             prove to be a complete implementation of a fully distributed,
                 cannot find any word that is indexed as         peer-to-peer architecture for web crawling, to be used on
                 clear text. You can also not re-translate the   small networks or campus networks. We believe that we have
                 word hashes into the original word. This        clearly stated the requirements, implementation details and
                 means that you don't know actually which        usage advantages regarding such a system and highlighted its
                 words are stored in your system. The            purpose.
                 positive effect is, that you cannot be
                 responsible for the words that are stored in                        VIII. REFERENCES
                 your peer. But if you want to deny storage          1.   Kobayashi, M. and Takeda, K. (2000). "Information
                 of specific words, you can put them into                 retrieval on the web". ACM Computing Surveys
                 the      'bluelist'    (in      the      file            (ACM Press) 32 (2): 144–173.
                 httpProxy.bluelist). No word that is in the         2.   Boldi, Paolo; Bruno Codenotti, Massimo Santini,
                 bluelist can be stored, searched or even                 Sebastiano Vigna (2004). "UbiCrawler: a scalable
                 viewed through the proxy.                                fully distributed Web crawler". Software: Practice
Peer             Information that is passed from one peer                 and Experience.
Communication    to another is encoded. That means that no           3.   Heydon, Allan; Najork, Marc (1999-06-26)
Encryption       information like search words, indexed                   Mercator: A Scalable, Extensible Web Crawler.
                 URL's or URL descriptions is transported
                                                                          http://www.cindoc.csic.es/cybermetrics/pdf/68.pdf.
                 in clear text. Network sniffers cannot see
                 the content that is exchanged. We also              4.   Brin, S. and Page, L. (1998). The anatomy of a
                 implemented an encryption method,                        large-scale hypertextual Web search engine.
                 where a temporary key, created by the               5.   Zeinalipour-Yazti, D. and Dikaiakos, M. D. (2002).
                 requesting peer is used to encrypt the                   Design and implementation of a distributed crawler
                 response (not yet active in test release, but            and filtering processor. In Proceedings of the Fifth
                 non-ascii/base64 - encoding is in place).                Next Generation Information Technologies and
Access           The proxy contains a two-stage access                    Systems (NGITS), volume 2382 of Lecture Notes in
Restrictions     control: IP filter check and an                          Computer Science, pages 58–74, Caesarea, Israel.
                 account/password gateway that can be                     Springer.
                 configured to access the proxy. The
                                                                     6.   Ali Ghodsi. Distributed k-ary System: Algorithms
                 default setting denies access to your proxy
                 from the internet, but allowes usage from                for Distributed Hash Tables. KTH-Royal Institute of
                 the intranet. The proxy and it's security                Technology, 2006.
                 settings can be configured using the                7. Goyal, Vikram (2003), Using the Jakarta Commons,
                 built-in web server for service pages; the               Part                                          I,
                 access to this service pages itself can also             http://www.onjava.com/pub/a/onjava/2003/06/25/c
                 be restricted again by using an IP filter and            ommons.html
                 an account/password combination.




                                                                                                                          357
                                            All Rights Reserved © 2012 IJARCET

More Related Content

What's hot

supporting privacy protection in personalized web search
supporting privacy protection in personalized web searchsupporting privacy protection in personalized web search
supporting privacy protection in personalized web searchswathi78
 
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...IJCSIS Research Publications
 
Open Corpus Adaptive Hypermedia
Open Corpus Adaptive HypermediaOpen Corpus Adaptive Hypermedia
Open Corpus Adaptive HypermediaPeter Brusilovsky
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
Web Mining
Web Mining Web Mining
Web Mining guestb73ec6
 

What's hot (6)

supporting privacy protection in personalized web search
supporting privacy protection in personalized web searchsupporting privacy protection in personalized web search
supporting privacy protection in personalized web search
 
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...
 
Open Corpus Adaptive Hypermedia
Open Corpus Adaptive HypermediaOpen Corpus Adaptive Hypermedia
Open Corpus Adaptive Hypermedia
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
Web Mining
Web Mining Web Mining
Web Mining
 
Seminar on crawler
Seminar on crawlerSeminar on crawler
Seminar on crawler
 

Viewers also liked

IChresemo Technologies
IChresemo TechnologiesIChresemo Technologies
IChresemo TechnologiesChinna Chresemo
 
NoSQL
NoSQLNoSQL
NoSQLdbulic
 
TLA_ fuer_Drittsemester
TLA_ fuer_DrittsemesterTLA_ fuer_Drittsemester
TLA_ fuer_Drittsemestersimondschweitzer
 
Google Updates 2014 - Three Birds and a Bear
Google Updates 2014 - Three Birds and a BearGoogle Updates 2014 - Three Birds and a Bear
Google Updates 2014 - Three Birds and a BearLead Generation Websites
 
Fachartikel "Die RĂŒckkehr der Telefonie", Call Center Scout, 10/2013
Fachartikel "Die RĂŒckkehr der Telefonie", Call Center Scout, 10/2013Fachartikel "Die RĂŒckkehr der Telefonie", Call Center Scout, 10/2013
Fachartikel "Die RĂŒckkehr der Telefonie", Call Center Scout, 10/2013Anja Bonelli
 
Backbone js in action
Backbone js in actionBackbone js in action
Backbone js in actionUsha Guduri
 

Viewers also liked (7)

IChresemo Technologies
IChresemo TechnologiesIChresemo Technologies
IChresemo Technologies
 
NoSQL
NoSQLNoSQL
NoSQL
 
Backbonejs
BackbonejsBackbonejs
Backbonejs
 
TLA_ fuer_Drittsemester
TLA_ fuer_DrittsemesterTLA_ fuer_Drittsemester
TLA_ fuer_Drittsemester
 
Google Updates 2014 - Three Birds and a Bear
Google Updates 2014 - Three Birds and a BearGoogle Updates 2014 - Three Birds and a Bear
Google Updates 2014 - Three Birds and a Bear
 
Fachartikel "Die RĂŒckkehr der Telefonie", Call Center Scout, 10/2013
Fachartikel "Die RĂŒckkehr der Telefonie", Call Center Scout, 10/2013Fachartikel "Die RĂŒckkehr der Telefonie", Call Center Scout, 10/2013
Fachartikel "Die RĂŒckkehr der Telefonie", Call Center Scout, 10/2013
 
Backbone js in action
Backbone js in actionBackbone js in action
Backbone js in action
 

Similar to 353 357

[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the webVan-Duyet Le
 
Smart Crawler Automation with RMI
Smart Crawler Automation with RMISmart Crawler Automation with RMI
Smart Crawler Automation with RMIIRJET Journal
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Denis Shestakov
 
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...ijwscjournal
 
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...ijwscjournal
 
E3602042044
E3602042044E3602042044
E3602042044ijceronline
 
Focused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainsFocused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainseSAT Journals
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
Sekhon final 1_ppt
Sekhon final 1_pptSekhon final 1_ppt
Sekhon final 1_pptManant Sweet
 
Pdd crawler a focused web
Pdd crawler  a focused webPdd crawler  a focused web
Pdd crawler a focused webcsandit
 
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...ijwscjournal
 
The Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web CrawlerThe Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web CrawlerIRJESJOURNAL
 
Web Crawling Using Location Aware Technique
Web Crawling Using Location Aware TechniqueWeb Crawling Using Location Aware Technique
Web Crawling Using Location Aware Techniqueijsrd.com
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET Journal
 
A Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyA Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyIOSR Journals
 

Similar to 353 357 (20)

407 409
407 409407 409
407 409
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web
 
Smart Crawler Automation with RMI
Smart Crawler Automation with RMISmart Crawler Automation with RMI
Smart Crawler Automation with RMI
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
 
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
 
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
 
E3602042044
E3602042044E3602042044
E3602042044
 
Faster and resourceful multi core web crawling
Faster and resourceful multi core web crawlingFaster and resourceful multi core web crawling
Faster and resourceful multi core web crawling
 
Focused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainsFocused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domains
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
Sekhon final 1_ppt
Sekhon final 1_pptSekhon final 1_ppt
Sekhon final 1_ppt
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
webcrawler.pptx
webcrawler.pptxwebcrawler.pptx
webcrawler.pptx
 
Pdd crawler a focused web
Pdd crawler  a focused webPdd crawler  a focused web
Pdd crawler a focused web
 
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
 
The Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web CrawlerThe Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web Crawler
 
Web Crawling Using Location Aware Technique
Web Crawling Using Location Aware TechniqueWeb Crawling Using Location Aware Technique
Web Crawling Using Location Aware Technique
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
 
F43033234
F43033234F43033234
F43033234
 
A Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyA Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET Technology
 

More from Editor IJARCET

Electrically small antennas: The art of miniaturization
Electrically small antennas: The art of miniaturizationElectrically small antennas: The art of miniaturization
Electrically small antennas: The art of miniaturizationEditor IJARCET
 
Volume 2-issue-6-2205-2207
Volume 2-issue-6-2205-2207Volume 2-issue-6-2205-2207
Volume 2-issue-6-2205-2207Editor IJARCET
 
Volume 2-issue-6-2195-2199
Volume 2-issue-6-2195-2199Volume 2-issue-6-2195-2199
Volume 2-issue-6-2195-2199Editor IJARCET
 
Volume 2-issue-6-2200-2204
Volume 2-issue-6-2200-2204Volume 2-issue-6-2200-2204
Volume 2-issue-6-2200-2204Editor IJARCET
 
Volume 2-issue-6-2190-2194
Volume 2-issue-6-2190-2194Volume 2-issue-6-2190-2194
Volume 2-issue-6-2190-2194Editor IJARCET
 
Volume 2-issue-6-2186-2189
Volume 2-issue-6-2186-2189Volume 2-issue-6-2186-2189
Volume 2-issue-6-2186-2189Editor IJARCET
 
Volume 2-issue-6-2177-2185
Volume 2-issue-6-2177-2185Volume 2-issue-6-2177-2185
Volume 2-issue-6-2177-2185Editor IJARCET
 
Volume 2-issue-6-2173-2176
Volume 2-issue-6-2173-2176Volume 2-issue-6-2173-2176
Volume 2-issue-6-2173-2176Editor IJARCET
 
Volume 2-issue-6-2165-2172
Volume 2-issue-6-2165-2172Volume 2-issue-6-2165-2172
Volume 2-issue-6-2165-2172Editor IJARCET
 
Volume 2-issue-6-2159-2164
Volume 2-issue-6-2159-2164Volume 2-issue-6-2159-2164
Volume 2-issue-6-2159-2164Editor IJARCET
 
Volume 2-issue-6-2155-2158
Volume 2-issue-6-2155-2158Volume 2-issue-6-2155-2158
Volume 2-issue-6-2155-2158Editor IJARCET
 
Volume 2-issue-6-2148-2154
Volume 2-issue-6-2148-2154Volume 2-issue-6-2148-2154
Volume 2-issue-6-2148-2154Editor IJARCET
 
Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Editor IJARCET
 
Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124Editor IJARCET
 
Volume 2-issue-6-2139-2142
Volume 2-issue-6-2139-2142Volume 2-issue-6-2139-2142
Volume 2-issue-6-2139-2142Editor IJARCET
 
Volume 2-issue-6-2130-2138
Volume 2-issue-6-2130-2138Volume 2-issue-6-2130-2138
Volume 2-issue-6-2130-2138Editor IJARCET
 
Volume 2-issue-6-2125-2129
Volume 2-issue-6-2125-2129Volume 2-issue-6-2125-2129
Volume 2-issue-6-2125-2129Editor IJARCET
 
Volume 2-issue-6-2114-2118
Volume 2-issue-6-2114-2118Volume 2-issue-6-2114-2118
Volume 2-issue-6-2114-2118Editor IJARCET
 
Volume 2-issue-6-2108-2113
Volume 2-issue-6-2108-2113Volume 2-issue-6-2108-2113
Volume 2-issue-6-2108-2113Editor IJARCET
 
Volume 2-issue-6-2102-2107
Volume 2-issue-6-2102-2107Volume 2-issue-6-2102-2107
Volume 2-issue-6-2102-2107Editor IJARCET
 

More from Editor IJARCET (20)

Electrically small antennas: The art of miniaturization
Electrically small antennas: The art of miniaturizationElectrically small antennas: The art of miniaturization
Electrically small antennas: The art of miniaturization
 
Volume 2-issue-6-2205-2207
Volume 2-issue-6-2205-2207Volume 2-issue-6-2205-2207
Volume 2-issue-6-2205-2207
 
Volume 2-issue-6-2195-2199
Volume 2-issue-6-2195-2199Volume 2-issue-6-2195-2199
Volume 2-issue-6-2195-2199
 
Volume 2-issue-6-2200-2204
Volume 2-issue-6-2200-2204Volume 2-issue-6-2200-2204
Volume 2-issue-6-2200-2204
 
Volume 2-issue-6-2190-2194
Volume 2-issue-6-2190-2194Volume 2-issue-6-2190-2194
Volume 2-issue-6-2190-2194
 
Volume 2-issue-6-2186-2189
Volume 2-issue-6-2186-2189Volume 2-issue-6-2186-2189
Volume 2-issue-6-2186-2189
 
Volume 2-issue-6-2177-2185
Volume 2-issue-6-2177-2185Volume 2-issue-6-2177-2185
Volume 2-issue-6-2177-2185
 
Volume 2-issue-6-2173-2176
Volume 2-issue-6-2173-2176Volume 2-issue-6-2173-2176
Volume 2-issue-6-2173-2176
 
Volume 2-issue-6-2165-2172
Volume 2-issue-6-2165-2172Volume 2-issue-6-2165-2172
Volume 2-issue-6-2165-2172
 
Volume 2-issue-6-2159-2164
Volume 2-issue-6-2159-2164Volume 2-issue-6-2159-2164
Volume 2-issue-6-2159-2164
 
Volume 2-issue-6-2155-2158
Volume 2-issue-6-2155-2158Volume 2-issue-6-2155-2158
Volume 2-issue-6-2155-2158
 
Volume 2-issue-6-2148-2154
Volume 2-issue-6-2148-2154Volume 2-issue-6-2148-2154
Volume 2-issue-6-2148-2154
 
Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147
 
Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124Volume 2-issue-6-2119-2124
Volume 2-issue-6-2119-2124
 
Volume 2-issue-6-2139-2142
Volume 2-issue-6-2139-2142Volume 2-issue-6-2139-2142
Volume 2-issue-6-2139-2142
 
Volume 2-issue-6-2130-2138
Volume 2-issue-6-2130-2138Volume 2-issue-6-2130-2138
Volume 2-issue-6-2130-2138
 
Volume 2-issue-6-2125-2129
Volume 2-issue-6-2125-2129Volume 2-issue-6-2125-2129
Volume 2-issue-6-2125-2129
 
Volume 2-issue-6-2114-2118
Volume 2-issue-6-2114-2118Volume 2-issue-6-2114-2118
Volume 2-issue-6-2114-2118
 
Volume 2-issue-6-2108-2113
Volume 2-issue-6-2108-2113Volume 2-issue-6-2108-2113
Volume 2-issue-6-2108-2113
 
Volume 2-issue-6-2102-2107
Volume 2-issue-6-2102-2107Volume 2-issue-6-2102-2107
Volume 2-issue-6-2102-2107
 

Recently uploaded

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

353 357

  • 1. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 4, June 2012 Realizing Peer-to-Peer and Distributed Web Crawler Anup A Garje, Prof. Bhavesh Patel, Dr. B. B. Meshram  implemented as a queue). URLs from the frontier are Abstract—The tremendous growth of the World Wide Web recursively visited according to a set of policies. has made tools such as search engines and information retrieval We present several issues to take into account when systems have become essential. In this dissertation, we propose crawling the Web. They lead to the fact that at design time of a fully distributed, peer-to-peer architecture for web crawling. a crawl, its intention needs to be fixed. The intention is The main goal behind the development of such a system is to defined by the goal that a specific crawl is targeted at; this can provide an alternative but efficient, easily implementable and a decentralized system for crawling, indexing, caching and differ in terms of crawl length, crawl intervals, crawl scope, querying web pages. The main function of a webcrawler is to etc. A major issue is that the Web is not static, but rather recursively visit web pages, extract all URLs form the page, dynamic and thus changes on the timescale of days, hours, parse the page for keywords and visit the extracted URLs minutes. There are billions of documents available on the recursively. We propose an architecture that can be easily Web and crawling all data and furthermore maintaining a implemeneted on a local (campus) network and which follows a good freshness of the data becomes almost impossible. To fully distributed, peer-to-peer architecture. The architecture always keep the crawled data up-to-date we would need to specifications, implementation details, requirements to be met continuously crawl the Web, revisiting all pages we have and analysis of such a system is discussed. once crawled. Whether we want to do this depends on the earlier mentioned crawling intention. Such an intention can for example be that we want to cover a preferably big part of Index Terms— Peer to peer, Distributed,Crawling,indexing the Web, crawl the Web for news on one topic or monitor one specific Web site for changes. We discuss two different crawling strategies that I. INTRODUCTION are related to the purpose of a crawl: incremental and Web crawlers download large quantities of data and snapshot crawling. The strategies can be identified by browse documents by passing from one hypertext link to different frontier growth behavior. another. A web crawler (also known as a web spider or web In a snapshot strategy the crawler visits a URL only robot) is a program or automated script which browses the once; if the same URL is discovered again it is considered as World Wide Web in a methodical, automated manner. This duplicate and discarded. Using this strategy the frontier is process is called web crawling or spidering. Many sites, in extended continuously with only new URLs and a crawl can particular search engines, use spidering as a means of spread quite fast. This strategy is optimal if you want to e.g. providing up-to-date data. cover an either big or specific part of the Web once, or in Web crawlers are mainly used to create a copy of all regular intervals. The incremental crawling strategy is the visited pages for later processing by a search engine that optimal for recurring continuous crawls with a limited scope; will index the downloaded pages to provide fast searches. when an already visited URL is rediscovered it is not rejected Crawlers can also be used for automating maintenance tasks but instead put into the frontier again. Using this strategy the on a website, such as checking links or validating HTML frontier queues will never empty and a crawl could go on for code. Also, crawlers can be used to gather specific types of an indefinite long time. This strategy is optimal for information from Web pages, such as harvesting e-mail monitoring a specific part of the Web for changes. Following addresses (usually for spam). A web crawler is one type of the crawling research field and relevant literature, we bot, or software agent. In general, it starts with a list of URLs distinguish not only between crawling strategies but as well to visit, called the seeds. As the crawler visits these URLs, it between crawler types. They are nevertheless related as identifies all the hyperlinks in the page and adds them to the different crawling strategies are used for different crawler list of URLs to visit, called the crawl frontier (usually types, which thus are related to the specific intentions we pursue when crawling the Web. While the crawling strategies are defined using the frontier growth behavior, the crawler types are based upon the scope of a crawl. They include types Anup A. Garje, Department of Computer Technology, Veetmata Jijabai as broad, focused, topical or continuous crawling. Technilogical Institute,Matunga, Mumbai.India The two most important types of web crawling are; anupg.007@gmail.com broad and focused crawling. Broad (or universal) crawls Prof. Bhavesh Patel Department of Computer Technology, Veetmata Jijabai Technilogical Institute,Matunga, Mumbai. India can be described as large crawls with a high bandwidth usage bh_patelin@yahoo.co.in where the crawler fetches a large number of Web sites and goes as well into a high depth on each crawled site. This Dr. B. B. Meshram Head of Dept. of Computer Technology, Veermata crawl type fits to the intention of crawling a large part of the Jijabai Technological Institute,Matunga Mumbai. India bbmeshram@vjti.org.in Web, if not even the whole Web. Not only the amount of collected Web data is important, but as well the completeness 353 All Rights Reserved © 2012 IJARCET
  • 2. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 4, June 2012 of coverage of single Web sites. Focused (or topical) crawls 3. The crawler extracts links from the downloaded on the other side are characterized by the fact that a number document. of criteria are defined that limits the scope of a crawl (e.g. by limiting the URLs to be visited to certain domains); the 4. Based on given rules the crawler decides crawler fetches similar pages topic-wise. This crawl type is whether it wants to permanently store the used with the intention to collect pages from a specific downloaded documents, index them, generate domain, category, topic or similar. metadata etc... 5. The crawler feeds the extracted links to the frontier queue. II. CRAWLING - AN OVERVIEW In the following section we will introduce both the Web crawler as such and some commonly known crawling The above steps are executed for all URLs that are strategies that can be applied to them. A Web crawler, also crawled by the Web crawler. Although a crawler has only one called a robot or spider, is a software program that starts with frontier, the frontier has multiple queues that are filled with a set of URIs, fetches the documents (e.g. HTML pages, URLs. Queues can be built based on different schemes: e.g. service descriptions, images, audio files, etc.) available at one queue per host. Additionally the queues can be ranked those URLs, extract the URLs, i.e. links, from the documents within the frontier which makes then that certain queues are fetched in the previous step and start over the process served earlier by the frontier than others. A similar issue is as previously described. That is it automatically downloads well the ranking of the URLs within the single queues. Web pages and follows links in the pages, this way moving During the setup of a crawl, it must be decided what URLs from one Webpage to another. get what priorities and get thus removed either early or late from a queue to be processed further. If a frontier is not set any limit and if the crawler disposes over unlimited hardware resources, it may grow indefinitely. This can be avoided by limiting the growth of the frontier, either by, e.g., restricting the number of pages the crawler may download from a domain, or by restricting the number of overall visited Web sites, what would at the same time limit the scope of the crawl. Whatever frontier strategy is chosen, the crawler proceeds in the same way with the URLs it gets from the frontier. III. SYSTEM DESIGN AND IMPLEMENTATION Architectural Details: Our main goal is to realize a fully distributed, Peer-to-Peer web crawler framework and highlight the features, advantages and credibility of such a system. Our system, Fig :General architecture of a Web Crawler named JADE, follows a fully decentralized distributed We will now shortly describe the basic steps that architecture. A fully decentralized architecture means that a crawler is executing. What it basically does is executing there will be no central server or control entity, all the different specific steps in a sequential way. The crawler starts different components are considered to be of equal status (i.e. by taking a set of seed pages, i.e. the URLs (Uniform peers). The system uses an overlay network, which could be a Resource Locator) which it starts with. It uses the URLs to local network for peer-to-peer communication and an build its frontier, i.e. the list (that is a queue) of unvisited underlay network which is the network form which URLs of the crawler. In the scope of one crawl this frontier is information is crawled and indexed. dynamic as it is extended by the URLs extracted from already The overlay system provides a fully equipped framework visited pages. The edge of a frontier will be limited by the for peer-to-peer communication. The basic requirements number of URLs found in all downloaded documents (and by from such a network are an efficient communication politeness restrictions that are followed for different servers). platform, an environment for distributed data management So once a URL is taken from the frontier queue it traverses and retrieval, a fault tolerance and self-administering and the following steps: peer managing network. 1. The crawler scheduler checks whether this page Our system manly comprises of peer-entities, which form is intended to be fetched, i.e. whether there are the atomic units of the system and can be used as standalone no rules or policies that exclude this URL. or in a network. The following diagram shows the structural 2. The document the URL points to is fetched by components of a single peer-entity; the multithreaded downloader. 354 All Rights Reserved © 2012 IJARCET
  • 3. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 4, June 2012 are instantly searchable (in contrast to batch-processing of other search engine software). THE INDEXER The page indexing is done by the creation of a 'reverse word index' (RWI): every page is parsed, the words are extracted and for every word a database table is maintained. The database tables are held in a file-based hash-table, so accessing a word index is extremely fast, resulting in an extremely fast search. The RWIs are hashed form in the database which leads to that the information not stored in plaintext and therefore the security of the index holder rises, since it is not possible to conclude who created the data. At the end of every crawling procedure the index is distributed over the peers participating in the P2P network. Only index entries in the form of URLs are stored, no other caching is performed. Jade implements its index structure as a distributed hash table (DHT): a hash function is applied to every index entry; the entry is then distributed to the appropriate peer. The resulting index contains no information about the origin of the keywords stored in it. Moreover, shared filters offer customized protection against undesirable content. THE DATABASE The database stores all indexed data provided by the Fig : A peer entity indexer and the P2P, which is added to the network. They are each peer the data to fit his DHT and through the assigned It consists of following components; migration index of the directory. The structure of the database will be balanced, binary search tree (AVL tree)  Crawler formed to a logarithmic search time on the number of  Indexer elements in the tree. AVL, The name derives from the  Database component inventors and Adelson-Velsky Landis by whom this data If you are using Word, use either the Microsoft Equation structure for balanced data distribution was developed in Editor or the MathType add-on (http://www.mathtype.com) 1962. The AVL property ensures maximum performance in for equations in your paper (Insert | Object | Create New | terms of algorithmic order. Microsoft Equation or MathType Equation). “Float over Tree nodes can be dynamically allocated and text” should not be selected. de-allocated and an unused-node list is maintained. For the PLASMA search algorithm, an ordered access to search results are necessary, therefore we needed an indexing IV. SYSTEM INTERNALS mechanism which stores the index in an ordered way. The database supports such access, and the resulting database THE CRAWLER tables are stored as a single file. It is completely As mentioned earlier, the main goal of our system self-organizing and does not need any set-up or maintenance was to implement a fully distributed, P2P web crawler. tasks that must be done by an administrator. Any database Traditionally, crawling process consisted of recursively may grow to an unthinkable number of records: with one requesting for a webpage, extracting the links from that page, billion records a database request needs a theoretical and then requesting the pages from the extracted links. Each maximum number of only 44 comparisons. page is parsed, indexed for keywords or other parameters and We have implemented Kelondro database then links fro the page are extracted. The crawler then calls subsystem for realizing the above mentioned requirements the extracted pages and thus the process continues. and features. The Kelondro database, is an open source AVL Apart from the above mentioned crawling method, based database structure, which provides all he necessary another method exists, known as the proxy method. By using schema, functions and methods for inserting, querying, a web proxy, that allows users to access pages from the web modifying a AVL tree based database. through the proxy, we could index and parse the pages that pass through the proxy. Thus only the pages visited by the user will be indexed and parsed. Thus the user unknowingly contributes in the indexing of the pages. The local caching of the visited pages improves the access time of the system. Advanced filtering can also be performed easily on the local cache of visited pages. Our system runs a large number of processes which operate on data stacks and data queues that are filled during a web crawl and indexing process. The proposed system does a real-time indexing, that means all pages that pass the crawler 355 All Rights Reserved © 2012 IJARCET
  • 4. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 4, June 2012 V. INFORMATION FLOW plasmaIndex - Objects are the true reverse words indexes. In plasmaIndex the plasmaIndexEntry - objects are stored in a kelondroTree; an indexed file in the file system. 8 SEARCH/ QUERY FLOW Fig : Crawler Information Flow Diagram The above diagram shows the flow of information in the crawler system. The HTML file from a crawled URL is loaded onto the httpProxyServelet module. This module is Fig: The Search/ Query Information Flow Diagram. the proxy process that runs in the background. The above diagram shows the flow of a user query or a The file is then transferred to htpProxyCache module, keyword search in the system. which provides the proxy cache and where the processing of The keyword or the query entered by the user is passed the file is delayed until the proxy is idle. The cache entry is onto the httpdFileServelet process, which accepts the passed on to the plasmaSwitchboard module. This is the core information and passes it to the plasmaSwitchBoard module. module that forms the central part of the system.. The query is validated and checked for consistency before There the URL is stored into plasmaLURL where the URL passing it to plasmaSearch module, which is the search is stored under a specific hash. The URL's from the content function on the index. In plasmaSearch, the are stripped off, stored in plasmaLURL with a 'wrong' date plasmaSearchResult object is generated by simultaneous (the date of the URL's are not known at this time, only after enumeration of URL hashes in the reverse word indexes fetching) and stacked with plasmaCrawlerTextStack. plasmaIndex. The result page is then generated from this The content is read and splitted into rated words in plasmaSearchResult object plasmaCondenser. The splitted words are then integrated into the index with plasmaSearch. In plasmaSearch the words are VI. SYSTEM ANALYSIS indexed by reversing the relation between URL and words: In this section we shall discuss certain security aspects, one URL points to many words, the words within the software structure, advantages, disadvantages and future document at the URL. After reversing, one word points to scope of the proposed system. many URL's, all the URL's where the word occurrs. One SECURITY & PRIVACY ASPECTS single word->URL-hash relation is stored in The system largely plasmaIndexEntry. A set of plasmaIndexEntries is a reverse Sharing the index to other users may arise privacy concerns. word index. This reverse word index is stored temporarly in The following properties were decide and implemented to plasmaIndexCache. take care of security & privacy concerns; In plasmaIndexCache the single plasmaIndexEntry'ies are Private Index and The local word index does not only collected and stored into a plasmaIndex – entry. These Index Movement contain information that a peer created by 356 All Rights Reserved © 2012 IJARCET
  • 5. ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 4, June 2012 surfing the internet, but also entries from VII. CONCLUSION other peers. Word index files travel along Thus, one can note that as the size and usage of the the proxy peers to form a distributed hash WWW increases, the use of a good Information retrieval table. Therefore nobody can argue that system comprising of indexers and crawlers becomes trivial. information that is provided by this peer The current web information retrieval systems provide too was also retrieved by this peer and much of censorship and restrictive policies. There is a need therefore by the peer’s personal use of the for a distributed, free and a collective view of the task of internet. In fact it is very unlikely that information retrieval and web page indexing and caching. information that can be found on a peer The system proposed aims to provide these views and design was created by the peer itself, since the goals. search process targets only peers where it is likely because of the movement of the The use of a censorship-free policy avoids all the index to form the distributed hash table. restrictions provided by current systems and enables full During a test phase, all word indexes on a coverage of the WWW as well as of “hidden web” or the peer will be accessible. The future “deep web”. This is not possible using existing systems. Also production release will constraint searches the use of Distributed Hash Tables (DHTs) and key based to indexes entries on the peer that have routing provides a solid framework for a distributed been created by other peers, which will peer-to-peer network architecture. The proxy provides an ensure complete browsing privacy. added functionality of caching web pages visited by the user, Word Index The words that are stored in the client’s which is performed in the background. Storage and local word index are stored using a word Content hash. That means that not any word is We believe that the system proposed by us will Responsibility stored, but only the word hash. You prove to be a complete implementation of a fully distributed, cannot find any word that is indexed as peer-to-peer architecture for web crawling, to be used on clear text. You can also not re-translate the small networks or campus networks. We believe that we have word hashes into the original word. This clearly stated the requirements, implementation details and means that you don't know actually which usage advantages regarding such a system and highlighted its words are stored in your system. The purpose. positive effect is, that you cannot be responsible for the words that are stored in VIII. REFERENCES your peer. But if you want to deny storage 1. Kobayashi, M. and Takeda, K. (2000). "Information of specific words, you can put them into retrieval on the web". ACM Computing Surveys the 'bluelist' (in the file (ACM Press) 32 (2): 144–173. httpProxy.bluelist). No word that is in the 2. Boldi, Paolo; Bruno Codenotti, Massimo Santini, bluelist can be stored, searched or even Sebastiano Vigna (2004). "UbiCrawler: a scalable viewed through the proxy. fully distributed Web crawler". Software: Practice Peer Information that is passed from one peer and Experience. Communication to another is encoded. That means that no 3. Heydon, Allan; Najork, Marc (1999-06-26) Encryption information like search words, indexed Mercator: A Scalable, Extensible Web Crawler. URL's or URL descriptions is transported http://www.cindoc.csic.es/cybermetrics/pdf/68.pdf. in clear text. Network sniffers cannot see the content that is exchanged. We also 4. Brin, S. and Page, L. (1998). The anatomy of a implemented an encryption method, large-scale hypertextual Web search engine. where a temporary key, created by the 5. Zeinalipour-Yazti, D. and Dikaiakos, M. D. (2002). requesting peer is used to encrypt the Design and implementation of a distributed crawler response (not yet active in test release, but and filtering processor. In Proceedings of the Fifth non-ascii/base64 - encoding is in place). Next Generation Information Technologies and Access The proxy contains a two-stage access Systems (NGITS), volume 2382 of Lecture Notes in Restrictions control: IP filter check and an Computer Science, pages 58–74, Caesarea, Israel. account/password gateway that can be Springer. configured to access the proxy. The 6. Ali Ghodsi. Distributed k-ary System: Algorithms default setting denies access to your proxy from the internet, but allowes usage from for Distributed Hash Tables. KTH-Royal Institute of the intranet. The proxy and it's security Technology, 2006. settings can be configured using the 7. Goyal, Vikram (2003), Using the Jakarta Commons, built-in web server for service pages; the Part I, access to this service pages itself can also http://www.onjava.com/pub/a/onjava/2003/06/25/c be restricted again by using an IP filter and ommons.html an account/password combination. 357 All Rights Reserved © 2012 IJARCET