SlideShare a Scribd company logo
1 of 32
Download to read offline
Outline                           WIRE Project                   Web Crawler               Conclusions




            WIRE: an Open Source Web Information
                    Retrieval Environment

                           Carlos Castillo and Ricardo Baeza-Yates
                                            Center for Web Research
                                             http://www.cwr.cl/
                                          CS Dept., University of Chile


                                              OSWIR 2005
                                           Compiegne, France
                                           September 19, 2005

Carlos Castillo and Ricardo Baeza-Yates                                        Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                        http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions




          1 WIRE Project



          2 Web Crawler



          3 Conclusions




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project                     Web Crawler                      Conclusions



General Architecture

                                                             XML Index           XML Search
                      Focused Crawling




                                                                                  Text Search
                                                             Text Index
                  Crawling                Collection
                                                              Statistics



                  Importing                                   Extracting



                              Clustering         Classification



Carlos Castillo and Ricardo Baeza-Yates                                                 Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                                 http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Characteristics



          b Roughly 25,000 lines of open-source C/C++ code
          L Asynchronous DNS and HTTP requests, small memory
            and processing requirements (except during the analysis)
          V Highly configurable: rate of download, parser parameters,
            scheduling policy, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Characteristics



          b Roughly 25,000 lines of open-source C/C++ code
          L Asynchronous DNS and HTTP requests, small memory
            and processing requirements (except during the analysis)
          V Highly configurable: rate of download, parser parameters,
            scheduling policy, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Characteristics



          b Roughly 25,000 lines of open-source C/C++ code
          L Asynchronous DNS and HTTP requests, small memory
            and processing requirements (except during the analysis)
          V Highly configurable: rate of download, parser parameters,
            scheduling policy, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project                        Web Crawler                      Conclusions



Web Crawler
                                                       Manager
                                                 Page score calculations
                                                 Long-term scheduling




                       Seeder                                                    Harvester
                                                       Collection
                    Link resolving                                          Short-term scheduling
                   Robots exclusions                                          Network transfers




                                                      Gatherer
                                                       Parsing
                                                    Link extraction


Carlos Castillo and Ricardo Baeza-Yates                                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                                    http://www.cwr.cl/
Outline                           WIRE Project                    Web Crawler                     Conclusions



Scheduling


                                                 Future      Current
                                                                           =    Profit
                                                 Value        Value



                                                }
                      quality             0.4
             P1       freshness           0.1                              = Profit: 0.36
                                                    0.4       0.04
                      visited?            1



                                                }
                      quality             0.7
             P2       freshness           0.9                              = Profit: 0.07
                                                              0.63
                                                    0.7
                      visited?            1



                                                }
                      quality             0.6
                      freshness           -                               = Profit: 0.6
             P3                                     0.6       0
                      visited?            0

Carlos Castillo and Ricardo Baeza-Yates                                              Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                               http://www.cwr.cl/
Outline                            WIRE Project                         Web Crawler                            Conclusions



Downloading pages


                                                                  World Wide Web




          Web sites           S1          S2          S3          S4          S5          S6          S7
                                   P1,1        P2,1        P3,1        P4,1        P5,1        P6,1        P7,1
                                   P1,2        P2,2        P3,2        P4,2        P5,2        P6,2        P7,2
                                   P1,3        P2,3                    P4,3        P5,3        P6,2        P7,3
          Web pages
                                   P1,4        P2,4                    P4,4        P5,4                    P7,4
                                               P2,5                    P4,5                                P7,5
                                               P2,6
Carlos Castillo and Ricardo Baeza-Yates                                                           Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                                            http://www.cwr.cl/
Outline                           WIRE Project                   Web Crawler                   Conclusions



Storing contents
                                Document

                                                         1         hash(       )
                                                 Content seen?

                                      2



                                                   3
                                                             Disk Storage


                                     Free space list

Carlos Castillo and Ricardo Baeza-Yates                                            Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                            http://www.cwr.cl/
Outline                           WIRE Project                   Web Crawler                       Conclusions



URL parsing

                                     http://host.domain.com/dir/file.html
                            1

                                                                    3
                h1('host.domain.com')


                                                                   h2('235 dir/file.html')




                host.domain.com 235
                                                 2
                                                             235 path/file.html 9421
                                                                                      4
                            SITE-ID = 235; DOC-ID = 9421

Carlos Castillo and Ricardo Baeza-Yates                                              Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                                http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


          b Includes link analysis and extraction of statistics (data is
            exported as .csv files)
          b Reports are generated using LTEXand gnuplot
                                          A

          b Report about documents: histograms of size, in- and
            out-degree, link scores, page depth, HTTP responses,
            age, media types, etc.
          b Report about sites: degree distribution in the hostgraph,
            maximum depth, pages per site, link structure, etc.



Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


          b Includes link analysis and extraction of statistics (data is
            exported as .csv files)
          b Reports are generated using LTEXand gnuplot
                                          A

          b Report about documents: histograms of size, in- and
            out-degree, link scores, page depth, HTTP responses,
            age, media types, etc.
          b Report about sites: degree distribution in the hostgraph,
            maximum depth, pages per site, link structure, etc.



Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


          b Includes link analysis and extraction of statistics (data is
            exported as .csv files)
          b Reports are generated using LTEXand gnuplot
                                          A

          b Report about documents: histograms of size, in- and
            out-degree, link scores, page depth, HTTP responses,
            age, media types, etc.
          b Report about sites: degree distribution in the hostgraph,
            maximum depth, pages per site, link structure, etc.



Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


          b Includes link analysis and extraction of statistics (data is
            exported as .csv files)
          b Reports are generated using LTEXand gnuplot
                                          A

          b Report about documents: histograms of size, in- and
            out-degree, link scores, page depth, HTTP responses,
            age, media types, etc.
          b Report about sites: degree distribution in the hostgraph,
            maximum depth, pages per site, link structure, etc.



Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/

More Related Content

Similar to WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Ceph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph Community
 
Time -Travel on the Internet
Time -Travel on the InternetTime -Travel on the Internet
Time -Travel on the InternetIRJET Journal
 
Invincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in TapioInvincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in TapioInvincea, Inc.
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInHakka Labs
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Artefactual Systems - AtoM
 
Accelerating your Research with Microsoft Azure (June 2015)
Accelerating your Research with Microsoft Azure (June 2015)Accelerating your Research with Microsoft Azure (June 2015)
Accelerating your Research with Microsoft Azure (June 2015)Microsoft Azure for Research
 
GENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterGENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterIan Foster
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commonsJesse Wang
 
Devoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaDevoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaNGDATA
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital PreservationMat Kelly
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopNikolai Avteniev
 
Case Study for Ego-centric Citation Network
Case Study for Ego-centric Citation NetworkCase Study for Ego-centric Citation Network
Case Study for Ego-centric Citation NetworkMike Taylor
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Blue BRIDGE
 
Descriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory InstitutionsDescriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory InstitutionsE. Murphy
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
 
2012 09 aos-workshop-johanneskeizer
2012 09 aos-workshop-johanneskeizer2012 09 aos-workshop-johanneskeizer
2012 09 aos-workshop-johanneskeizerJohannes Keizer
 

Similar to WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne) (20)

Ceph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph used in Cancer Research at OICR
Ceph used in Cancer Research at OICR
 
Time -Travel on the Internet
Time -Travel on the InternetTime -Travel on the Internet
Time -Travel on the Internet
 
Invincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in TapioInvincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in Tapio
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
 
Accelerating your Research with Microsoft Azure (June 2015)
Accelerating your Research with Microsoft Azure (June 2015)Accelerating your Research with Microsoft Azure (June 2015)
Accelerating your Research with Microsoft Azure (June 2015)
 
GENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterGENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian Foster
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
Devoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaDevoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in Java
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Towards a Web of Services
Towards a Web of ServicesTowards a Web of Services
Towards a Web of Services
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital Preservation
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
 
Case Study for Ego-centric Citation Network
Case Study for Ego-centric Citation NetworkCase Study for Ego-centric Citation Network
Case Study for Ego-centric Citation Network
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
 
Publishing Linked Data from RDB
Publishing Linked Data from RDBPublishing Linked Data from RDB
Publishing Linked Data from RDB
 
Descriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory InstitutionsDescriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory Institutions
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
 
2012 09 aos-workshop-johanneskeizer
2012 09 aos-workshop-johanneskeizer2012 09 aos-workshop-johanneskeizer
2012 09 aos-workshop-johanneskeizer
 

More from Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social MediaCarlos Castillo (ChaTo)
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Carlos Castillo (ChaTo)
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Carlos Castillo (ChaTo)
 

More from Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 

Recently uploaded

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

  • 1. Outline WIRE Project Web Crawler Conclusions WIRE: an Open Source Web Information Retrieval Environment Carlos Castillo and Ricardo Baeza-Yates Center for Web Research http://www.cwr.cl/ CS Dept., University of Chile OSWIR 2005 Compiegne, France September 19, 2005 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 2. Outline WIRE Project Web Crawler Conclusions 1 WIRE Project 2 Web Crawler 3 Conclusions Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 3. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 4. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 5. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 6. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 7. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 8. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 9. Outline WIRE Project Web Crawler Conclusions General Architecture XML Index XML Search Focused Crawling Text Search Text Index Crawling Collection Statistics Importing Extracting Clustering Classification Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 10. Outline WIRE Project Web Crawler Conclusions Characteristics b Roughly 25,000 lines of open-source C/C++ code L Asynchronous DNS and HTTP requests, small memory and processing requirements (except during the analysis) V Highly configurable: rate of download, parser parameters, scheduling policy, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 11. Outline WIRE Project Web Crawler Conclusions Characteristics b Roughly 25,000 lines of open-source C/C++ code L Asynchronous DNS and HTTP requests, small memory and processing requirements (except during the analysis) V Highly configurable: rate of download, parser parameters, scheduling policy, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 12. Outline WIRE Project Web Crawler Conclusions Characteristics b Roughly 25,000 lines of open-source C/C++ code L Asynchronous DNS and HTTP requests, small memory and processing requirements (except during the analysis) V Highly configurable: rate of download, parser parameters, scheduling policy, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 13. Outline WIRE Project Web Crawler Conclusions Web Crawler Manager Page score calculations Long-term scheduling Seeder Harvester Collection Link resolving Short-term scheduling Robots exclusions Network transfers Gatherer Parsing Link extraction Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 14. Outline WIRE Project Web Crawler Conclusions Scheduling Future Current = Profit Value Value } quality 0.4 P1 freshness 0.1 = Profit: 0.36 0.4 0.04 visited? 1 } quality 0.7 P2 freshness 0.9 = Profit: 0.07 0.63 0.7 visited? 1 } quality 0.6 freshness - = Profit: 0.6 P3 0.6 0 visited? 0 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 15. Outline WIRE Project Web Crawler Conclusions Downloading pages World Wide Web Web sites S1 S2 S3 S4 S5 S6 S7 P1,1 P2,1 P3,1 P4,1 P5,1 P6,1 P7,1 P1,2 P2,2 P3,2 P4,2 P5,2 P6,2 P7,2 P1,3 P2,3 P4,3 P5,3 P6,2 P7,3 Web pages P1,4 P2,4 P4,4 P5,4 P7,4 P2,5 P4,5 P7,5 P2,6 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 16. Outline WIRE Project Web Crawler Conclusions Storing contents Document 1 hash( ) Content seen? 2 3 Disk Storage Free space list Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 17. Outline WIRE Project Web Crawler Conclusions URL parsing http://host.domain.com/dir/file.html 1 3 h1('host.domain.com') h2('235 dir/file.html') host.domain.com 235 2 235 path/file.html 9421 4 SITE-ID = 235; DOC-ID = 9421 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 18. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 19. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 20. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 21. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 22. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 23. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 24. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 25. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 26. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 27. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 28. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 29. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 30. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 31. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 32. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/