SlideShare une entreprise Scribd logo
1  sur  46
Télécharger pour lire hors ligne
When RSS Fails:
Web Scraping with HTTP
              Matthew Turland
            Senior Consultant
            Blue Parabola LLC
             February 27, 2009
What is Web Scraping?




A 2 Step Process
Its Goal: Data
Obtain It
Transform It
Automate It
Step 1: Retrieval
The Client
The Server
The Request
The Response
Or In Your Case
Step #2: Analysis
Locate Desired Data
Extract It
Use It
So To Recap
2 Step Process

Step 1:
                       GET /some/resource
Retrieval              ...


                       HTTP/1.1 200 OK
                                            Resource
                       ...
                                            with data
                                            you want

                                            Usable
              Raw
Step 2:
                                             data
            resource
Analysis
How Is It Different?

Consuming Web Services
Web service data formats     Web scraping data formats




Data Mining
Focus in data mining         Focus in web scraping
What Is It Used For?


System
integration


               Crawlers
               and indexers


                              Integration
                              testing
Disadvantages
One small change to markup...
... may break your application.
Or in modern terms...
Reverse Engineering Required
Multiple Requests
No Nice Neat Data Package
Quite the Opposite, In Fact
Know enough HTTP to...

Use one like this:

To do this:
Know enough HTTP to...

 Learn to use and troubleshoot one like this:



PEAR::HTTP_Client         pecl_http     Zend_Http_Client


                    Or roll your own!

              Filesystem + Streams          cURL
Let's GET Started

                          request line
                                         protocol version in
  method        URI address for the
                                         use by the client
or operation    desired resource

  GET /wiki/Main_Page HTTP/1.1

  Host: en.wikipedia.org
header name              header value
                    header
more headers follow...
URI vs URL

                        URI

         1. Uniquely identifies a resource
URL
         2. Indicates how to locate a resource

         3. Does both and is thus human-usable.



 More info in RFC 3986 Sections 1.1.3 and 1.2.2
Warning about GET

                                GET

In principle:
quot;Let's do this by the book.quot;


                                      GET
In reality:
quot;'Safe operation'? Whatever.quot;
Query Strings

                                                       Value
          Ampersands to separate
                                         Parameter
          parameter name-value pairs.

                              URL
                                          Query String
http://en.wikipedia.org/w/index.php? title=Query_string&action=edit


Question mark to separate
                                Equal signs to separate parameter
the resource address and
                                names and respective values
query string
URL Encoding
                              Also called percent encoding.
Parameter      Value
first          this is a field
second         is it clear enough (already)?
Query String
first=this+is+a+field&second=is+it+clear+%28already%29%3F

parse_str, urlencode, urldecode: Handy PHP URL functions
$_SERVER['QUERY_STRING'] / http_build_query($_GET)
More info on URL encoding in RFC 3986 Section 2.1
POST Requests
Most Common
                                        POST
HTTP Operations                                  /w/index.php
1. GET
2. POST
...

                                                 /new/resource
GET /some/resource HTTP/1.1
                                                      -or-
Header: Value
...
                                               /updated/resource
              POST /some/resource HTTP/1.1
none          Header: Value

              request body
POST Request Example
                                       Content type for data
                                       submitted via HTML form
 Blank line separates                  (multipart/form-data for file uploads)
 request headers and body

 POST /w/index.php?title=Wikipedia:Sandbox HTTP/1.1
 Content­Type: application/x­www­form­urlencoded

 wpStarttime=20080719022313&wpEdittime=20080719022100
 ...

                     Note: Most browsers have a query string length limit.
                     Lowest known common denominator: IE7
                     strlen(entire URL) <= 2,048 bytes.
Request body         This limit is not standardized. It applies
... look familiar?   to query strings, but not request bodies.
HEAD Request
Same as GET with two exceptions:
HEAD /wiki/Main_Page HTTP/1.1
Host: en.wikipedia.org
                                       ?
     1 HEAD vs GET

HTTP/1.1 200 OK
Header: Value

                                       Sometimes headers
                                         are all you want
      2 No response body
                                        Headers
                                        Body
Responses
                          Response
Lowest protocol version                 Response
                          status code
required to process the                 status description
response

                       Status line
     HTTP/1.0 200 OK
     Server: Apache
     X­Powered­By: PHP/5.2.5
     ...
                                   Same header format as
                                   requests, but different
     [body]                        headers are used
                                   (see RFC 2616 Section 14)
Response Status Codes

1xx Informational
  Request received, continuing
  process.
2xx Success
  Request received, understood,
  and accepted.
3xx Redirection
  Client must take additional action to complete the request.
4xx Client Error
  Request is malformed or could not be fulfilled.
5xx Server Error
  Request was valid, but the server failed to process it.

                                  See RFC 2616 Section 10 for more info.
Headers
   Set-Cookie             See RFC 2109 or RFC 2965
                          for more info.
     Cookie



    Location             Watch out for
                         infinite loops!



 Last-Modified                   ETag
                    OR
If-Modified-Since          If-None-Match
              304 Not Modified
More Headers
              WWW-Authenticate
                                           See RFC 2617
                 Authorization
                                           for more info.
            200 OK / 403 Forbidden


                                       Some servers perform
             User-Agent
                                       user agent sniffing


Some clients perform
                                     User-Agent:
user agent spoofing
Best Practices
Simulate User Behavior
Minimize Requests
Batch Jobs, Non-Peak Hours
Questions?
 No heckling... OK, maybe just a little.
 I generally blog about my experiences with web scraping
  and PHP at http://ishouldbecoding.com.
  </shameless_plug>

                  Thanks for coming!

Contenu connexe

Tendances

Php file upload, cookies & session
Php file upload, cookies & sessionPhp file upload, cookies & session
Php file upload, cookies & sessionJamshid Hashimi
 
05 File Handling Upload Mysql
05 File Handling Upload Mysql05 File Handling Upload Mysql
05 File Handling Upload MysqlGeshan Manandhar
 
Php Simple Xml
Php Simple XmlPhp Simple Xml
Php Simple Xmlmussawir20
 
PHP And Web Services: Perfect Partners
PHP And Web Services: Perfect PartnersPHP And Web Services: Perfect Partners
PHP And Web Services: Perfect PartnersLorna Mitchell
 
Session Server - Maintaing State between several Servers
Session Server - Maintaing State between several ServersSession Server - Maintaing State between several Servers
Session Server - Maintaing State between several ServersStephan Schmidt
 
RESTful SOA - 中科院暑期讲座
RESTful SOA - 中科院暑期讲座RESTful SOA - 中科院暑期讲座
RESTful SOA - 中科院暑期讲座Li Yi
 
Open Source Package PHP & MySQL
Open Source Package PHP & MySQLOpen Source Package PHP & MySQL
Open Source Package PHP & MySQLkalaisai
 
Modern Web Development with Perl
Modern Web Development with PerlModern Web Development with Perl
Modern Web Development with PerlDave Cross
 
Intro to php
Intro to phpIntro to php
Intro to phpSp Singh
 
Making Java REST with JAX-RS 2.0
Making Java REST with JAX-RS 2.0Making Java REST with JAX-RS 2.0
Making Java REST with JAX-RS 2.0Dmytro Chyzhykov
 
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARXML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARStephan Schmidt
 
Cakefest 2010: API Development
Cakefest 2010: API DevelopmentCakefest 2010: API Development
Cakefest 2010: API DevelopmentAndrew Curioso
 
Go OO! - Real-life Design Patterns in PHP 5
Go OO! - Real-life Design Patterns in PHP 5Go OO! - Real-life Design Patterns in PHP 5
Go OO! - Real-life Design Patterns in PHP 5Stephan Schmidt
 

Tendances (20)

Php file upload, cookies & session
Php file upload, cookies & sessionPhp file upload, cookies & session
Php file upload, cookies & session
 
05 File Handling Upload Mysql
05 File Handling Upload Mysql05 File Handling Upload Mysql
05 File Handling Upload Mysql
 
Advanced Json
Advanced JsonAdvanced Json
Advanced Json
 
Php Simple Xml
Php Simple XmlPhp Simple Xml
Php Simple Xml
 
PHP And Web Services: Perfect Partners
PHP And Web Services: Perfect PartnersPHP And Web Services: Perfect Partners
PHP And Web Services: Perfect Partners
 
Session Server - Maintaing State between several Servers
Session Server - Maintaing State between several ServersSession Server - Maintaing State between several Servers
Session Server - Maintaing State between several Servers
 
RESTful SOA - 中科院暑期讲座
RESTful SOA - 中科院暑期讲座RESTful SOA - 中科院暑期讲座
RESTful SOA - 中科院暑期讲座
 
Php
PhpPhp
Php
 
Open Source Package PHP & MySQL
Open Source Package PHP & MySQLOpen Source Package PHP & MySQL
Open Source Package PHP & MySQL
 
Modern Web Development with Perl
Modern Web Development with PerlModern Web Development with Perl
Modern Web Development with Perl
 
Intro to php
Intro to phpIntro to php
Intro to php
 
Intro to PHP
Intro to PHPIntro to PHP
Intro to PHP
 
Copy of cgi
Copy of cgiCopy of cgi
Copy of cgi
 
Making Java REST with JAX-RS 2.0
Making Java REST with JAX-RS 2.0Making Java REST with JAX-RS 2.0
Making Java REST with JAX-RS 2.0
 
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARXML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
 
PHP POWERPOINT SLIDES
PHP POWERPOINT SLIDESPHP POWERPOINT SLIDES
PHP POWERPOINT SLIDES
 
Cakefest 2010: API Development
Cakefest 2010: API DevelopmentCakefest 2010: API Development
Cakefest 2010: API Development
 
Go OO! - Real-life Design Patterns in PHP 5
Go OO! - Real-life Design Patterns in PHP 5Go OO! - Real-life Design Patterns in PHP 5
Go OO! - Real-life Design Patterns in PHP 5
 
PEAR For The Masses
PEAR For The MassesPEAR For The Masses
PEAR For The Masses
 
Develop webservice in PHP
Develop webservice in PHPDevelop webservice in PHP
Develop webservice in PHP
 

En vedette

Eclipsecon09 Introduction To Groovy
Eclipsecon09 Introduction To GroovyEclipsecon09 Introduction To Groovy
Eclipsecon09 Introduction To GroovyAndres Almiray
 
Instalação do novo módulo Moip gratuito para Magento
Instalação do novo módulo Moip gratuito para MagentoInstalação do novo módulo Moip gratuito para Magento
Instalação do novo módulo Moip gratuito para MagentoMoip
 
SimpleXML In PHP 5
SimpleXML In PHP 5SimpleXML In PHP 5
SimpleXML In PHP 5Ron Pringle
 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingMichelle Minkoff
 
Http Parameter Pollution, a new category of web attacks
Http Parameter Pollution, a new category of web attacksHttp Parameter Pollution, a new category of web attacks
Http Parameter Pollution, a new category of web attacksStefano Di Paola
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With PythonRobert Dempsey
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documentsTommy Tavenner
 
JamNeo news aggregator
JamNeo news aggregatorJamNeo news aggregator
JamNeo news aggregatorJamNeo
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with PythonPaul Schreiber
 
interpolation
interpolationinterpolation
interpolation8laddu8
 
Web Services PHP Tutorial
Web Services PHP TutorialWeb Services PHP Tutorial
Web Services PHP TutorialLorna Mitchell
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python Viren Rajput
 
Social Media Mining - Chapter 3 (Network Measures)
Social Media Mining - Chapter 3 (Network Measures)Social Media Mining - Chapter 3 (Network Measures)
Social Media Mining - Chapter 3 (Network Measures)SocialMediaMining
 
Web scraping in python
Web scraping in pythonWeb scraping in python
Web scraping in pythonSaurav Tomar
 

En vedette (19)

Eclipsecon09 Introduction To Groovy
Eclipsecon09 Introduction To GroovyEclipsecon09 Introduction To Groovy
Eclipsecon09 Introduction To Groovy
 
Instalação do novo módulo Moip gratuito para Magento
Instalação do novo módulo Moip gratuito para MagentoInstalação do novo módulo Moip gratuito para Magento
Instalação do novo módulo Moip gratuito para Magento
 
Span and Div tags in HTML
Span and Div tags in HTMLSpan and Div tags in HTML
Span and Div tags in HTML
 
Php Rss
Php RssPhp Rss
Php Rss
 
SimpleXML In PHP 5
SimpleXML In PHP 5SimpleXML In PHP 5
SimpleXML In PHP 5
 
Web::Scraper
Web::ScraperWeb::Scraper
Web::Scraper
 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without Programming
 
Http Parameter Pollution, a new category of web attacks
Http Parameter Pollution, a new category of web attacksHttp Parameter Pollution, a new category of web attacks
Http Parameter Pollution, a new category of web attacks
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
JamNeo news aggregator
JamNeo news aggregatorJamNeo news aggregator
JamNeo news aggregator
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
interpolation
interpolationinterpolation
interpolation
 
Web Services PHP Tutorial
Web Services PHP TutorialWeb Services PHP Tutorial
Web Services PHP Tutorial
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Social Media Mining - Chapter 3 (Network Measures)
Social Media Mining - Chapter 3 (Network Measures)Social Media Mining - Chapter 3 (Network Measures)
Social Media Mining - Chapter 3 (Network Measures)
 
Web scraping in python
Web scraping in pythonWeb scraping in python
Web scraping in python
 
Web scraping com python
Web scraping com pythonWeb scraping com python
Web scraping com python
 

Similaire à When RSS Fails: Web Scraping with HTTP

RESTful for opentravel.org by HP
RESTful for opentravel.org by HPRESTful for opentravel.org by HP
RESTful for opentravel.org by HPRoni Schuetz
 
OpenTravel Advisory Forum 2012 REST XML Resources
OpenTravel Advisory Forum 2012 REST XML ResourcesOpenTravel Advisory Forum 2012 REST XML Resources
OpenTravel Advisory Forum 2012 REST XML ResourcesOpenTravel Alliance
 
So you think you know REST - DPC11
So you think you know REST - DPC11So you think you know REST - DPC11
So you think you know REST - DPC11Evert Pot
 
HTTP Request and Response Structure
HTTP Request and Response StructureHTTP Request and Response Structure
HTTP Request and Response StructureBhagyashreeGajera1
 
Rest presentation
Rest  presentationRest  presentation
Rest presentationsrividhyau
 
REST in ( a mobile ) peace @ WHYMCA 05-21-2011
REST in ( a mobile ) peace @ WHYMCA 05-21-2011REST in ( a mobile ) peace @ WHYMCA 05-21-2011
REST in ( a mobile ) peace @ WHYMCA 05-21-2011Alessandro Nadalin
 
Service approach for development Rest API in Symfony2
Service approach for development Rest API in Symfony2Service approach for development Rest API in Symfony2
Service approach for development Rest API in Symfony2Sumy PHP User Grpoup
 
REST-API introduction for developers
REST-API introduction for developersREST-API introduction for developers
REST-API introduction for developersPatrick Savalle
 
Resource-Oriented Web Services
Resource-Oriented Web ServicesResource-Oriented Web Services
Resource-Oriented Web ServicesBradley Holt
 
Web Fundamentals
Web FundamentalsWeb Fundamentals
Web Fundamentalsarunv
 
KMUTNB - Internet Programming 2/7
KMUTNB - Internet Programming 2/7KMUTNB - Internet Programming 2/7
KMUTNB - Internet Programming 2/7phuphax
 
2014 database - course 1 - www introduction
2014 database - course 1 - www introduction2014 database - course 1 - www introduction
2014 database - course 1 - www introductionHung-yu Lin
 
Java EE 8: What Servlet 4 and HTTP2 Mean
Java EE 8: What Servlet 4 and HTTP2 MeanJava EE 8: What Servlet 4 and HTTP2 Mean
Java EE 8: What Servlet 4 and HTTP2 MeanAlex Theedom
 
RESTful services
RESTful servicesRESTful services
RESTful servicesgouthamrv
 

Similaire à When RSS Fails: Web Scraping with HTTP (20)

RESTful for opentravel.org by HP
RESTful for opentravel.org by HPRESTful for opentravel.org by HP
RESTful for opentravel.org by HP
 
OpenTravel Advisory Forum 2012 REST XML Resources
OpenTravel Advisory Forum 2012 REST XML ResourcesOpenTravel Advisory Forum 2012 REST XML Resources
OpenTravel Advisory Forum 2012 REST XML Resources
 
So you think you know REST - DPC11
So you think you know REST - DPC11So you think you know REST - DPC11
So you think you know REST - DPC11
 
HTTP Request and Response Structure
HTTP Request and Response StructureHTTP Request and Response Structure
HTTP Request and Response Structure
 
Rest presentation
Rest  presentationRest  presentation
Rest presentation
 
REST in ( a mobile ) peace @ WHYMCA 05-21-2011
REST in ( a mobile ) peace @ WHYMCA 05-21-2011REST in ( a mobile ) peace @ WHYMCA 05-21-2011
REST in ( a mobile ) peace @ WHYMCA 05-21-2011
 
Service approach for development Rest API in Symfony2
Service approach for development Rest API in Symfony2Service approach for development Rest API in Symfony2
Service approach for development Rest API in Symfony2
 
REST-API introduction for developers
REST-API introduction for developersREST-API introduction for developers
REST-API introduction for developers
 
Doing REST Right
Doing REST RightDoing REST Right
Doing REST Right
 
Resource-Oriented Web Services
Resource-Oriented Web ServicesResource-Oriented Web Services
Resource-Oriented Web Services
 
Web Fundamentals
Web FundamentalsWeb Fundamentals
Web Fundamentals
 
A Look at OData
A Look at ODataA Look at OData
A Look at OData
 
Web Services Tutorial
Web Services TutorialWeb Services Tutorial
Web Services Tutorial
 
Web services tutorial
Web services tutorialWeb services tutorial
Web services tutorial
 
KMUTNB - Internet Programming 2/7
KMUTNB - Internet Programming 2/7KMUTNB - Internet Programming 2/7
KMUTNB - Internet Programming 2/7
 
2014 database - course 1 - www introduction
2014 database - course 1 - www introduction2014 database - course 1 - www introduction
2014 database - course 1 - www introduction
 
Java EE 8: What Servlet 4 and HTTP2 Mean
Java EE 8: What Servlet 4 and HTTP2 MeanJava EE 8: What Servlet 4 and HTTP2 Mean
Java EE 8: What Servlet 4 and HTTP2 Mean
 
Troubleshooting.pptx
Troubleshooting.pptxTroubleshooting.pptx
Troubleshooting.pptx
 
RESTful services
RESTful servicesRESTful services
RESTful services
 
WebApp #3 : API
WebApp #3 : APIWebApp #3 : API
WebApp #3 : API
 

Plus de Matthew Turland

New SPL Features in PHP 5.3
New SPL Features in PHP 5.3New SPL Features in PHP 5.3
New SPL Features in PHP 5.3Matthew Turland
 
New SPL Features in PHP 5.3 (TEK-X)
New SPL Features in PHP 5.3 (TEK-X)New SPL Features in PHP 5.3 (TEK-X)
New SPL Features in PHP 5.3 (TEK-X)Matthew Turland
 
Open Source Networking with Vyatta
Open Source Networking with VyattaOpen Source Networking with Vyatta
Open Source Networking with VyattaMatthew Turland
 
Open Source Content Management Systems
Open Source Content Management SystemsOpen Source Content Management Systems
Open Source Content Management SystemsMatthew Turland
 
PHP Basics for Designers
PHP Basics for DesignersPHP Basics for Designers
PHP Basics for DesignersMatthew Turland
 
Creating Web Services with Zend Framework - Matthew Turland
Creating Web Services with Zend Framework - Matthew TurlandCreating Web Services with Zend Framework - Matthew Turland
Creating Web Services with Zend Framework - Matthew TurlandMatthew Turland
 
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake DevilleThe OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake DevilleMatthew Turland
 
Utilizing the Xen Hypervisor in business practice - Bryan Fusilier
Utilizing the Xen Hypervisor in business practice - Bryan FusilierUtilizing the Xen Hypervisor in business practice - Bryan Fusilier
Utilizing the Xen Hypervisor in business practice - Bryan FusilierMatthew Turland
 
The Ruby Programming Language - Ryan Farnell
The Ruby Programming Language - Ryan FarnellThe Ruby Programming Language - Ryan Farnell
The Ruby Programming Language - Ryan FarnellMatthew Turland
 
PDQ Programming Languages plus an overview of Alice - Frank Ducrest
PDQ Programming Languages plus an overview of Alice - Frank DucrestPDQ Programming Languages plus an overview of Alice - Frank Ducrest
PDQ Programming Languages plus an overview of Alice - Frank DucrestMatthew Turland
 
Getting Involved in Open Source - Matthew Turland
Getting Involved in Open Source - Matthew TurlandGetting Involved in Open Source - Matthew Turland
Getting Involved in Open Source - Matthew TurlandMatthew Turland
 

Plus de Matthew Turland (13)

New SPL Features in PHP 5.3
New SPL Features in PHP 5.3New SPL Features in PHP 5.3
New SPL Features in PHP 5.3
 
New SPL Features in PHP 5.3 (TEK-X)
New SPL Features in PHP 5.3 (TEK-X)New SPL Features in PHP 5.3 (TEK-X)
New SPL Features in PHP 5.3 (TEK-X)
 
Sinatra
SinatraSinatra
Sinatra
 
Web Scraping with PHP
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHP
 
Open Source Networking with Vyatta
Open Source Networking with VyattaOpen Source Networking with Vyatta
Open Source Networking with Vyatta
 
Open Source Content Management Systems
Open Source Content Management SystemsOpen Source Content Management Systems
Open Source Content Management Systems
 
PHP Basics for Designers
PHP Basics for DesignersPHP Basics for Designers
PHP Basics for Designers
 
Creating Web Services with Zend Framework - Matthew Turland
Creating Web Services with Zend Framework - Matthew TurlandCreating Web Services with Zend Framework - Matthew Turland
Creating Web Services with Zend Framework - Matthew Turland
 
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake DevilleThe OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
 
Utilizing the Xen Hypervisor in business practice - Bryan Fusilier
Utilizing the Xen Hypervisor in business practice - Bryan FusilierUtilizing the Xen Hypervisor in business practice - Bryan Fusilier
Utilizing the Xen Hypervisor in business practice - Bryan Fusilier
 
The Ruby Programming Language - Ryan Farnell
The Ruby Programming Language - Ryan FarnellThe Ruby Programming Language - Ryan Farnell
The Ruby Programming Language - Ryan Farnell
 
PDQ Programming Languages plus an overview of Alice - Frank Ducrest
PDQ Programming Languages plus an overview of Alice - Frank DucrestPDQ Programming Languages plus an overview of Alice - Frank Ducrest
PDQ Programming Languages plus an overview of Alice - Frank Ducrest
 
Getting Involved in Open Source - Matthew Turland
Getting Involved in Open Source - Matthew TurlandGetting Involved in Open Source - Matthew Turland
Getting Involved in Open Source - Matthew Turland
 

Dernier

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Dernier (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

When RSS Fails: Web Scraping with HTTP

  • 1. When RSS Fails: Web Scraping with HTTP Matthew Turland Senior Consultant Blue Parabola LLC February 27, 2009
  • 2. What is Web Scraping? A 2 Step Process
  • 12. Or In Your Case
  • 17. So To Recap 2 Step Process Step 1: GET /some/resource Retrieval ... HTTP/1.1 200 OK Resource ... with data you want Usable Raw Step 2: data resource Analysis
  • 18. How Is It Different? Consuming Web Services Web service data formats Web scraping data formats Data Mining Focus in data mining Focus in web scraping
  • 19. What Is It Used For? System integration Crawlers and indexers Integration testing
  • 21. One small change to markup...
  • 22. ... may break your application.
  • 23. Or in modern terms...
  • 26. No Nice Neat Data Package
  • 28. Know enough HTTP to... Use one like this: To do this:
  • 29. Know enough HTTP to... Learn to use and troubleshoot one like this: PEAR::HTTP_Client pecl_http Zend_Http_Client Or roll your own! Filesystem + Streams cURL
  • 30. Let's GET Started request line protocol version in method URI address for the use by the client or operation desired resource GET /wiki/Main_Page HTTP/1.1 Host: en.wikipedia.org header name header value header more headers follow...
  • 31. URI vs URL URI 1. Uniquely identifies a resource URL 2. Indicates how to locate a resource 3. Does both and is thus human-usable. More info in RFC 3986 Sections 1.1.3 and 1.2.2
  • 32. Warning about GET GET In principle: quot;Let's do this by the book.quot; GET In reality: quot;'Safe operation'? Whatever.quot;
  • 33. Query Strings Value Ampersands to separate Parameter parameter name-value pairs. URL Query String http://en.wikipedia.org/w/index.php? title=Query_string&action=edit Question mark to separate Equal signs to separate parameter the resource address and names and respective values query string
  • 34. URL Encoding Also called percent encoding. Parameter Value first this is a field second is it clear enough (already)? Query String first=this+is+a+field&second=is+it+clear+%28already%29%3F parse_str, urlencode, urldecode: Handy PHP URL functions $_SERVER['QUERY_STRING'] / http_build_query($_GET) More info on URL encoding in RFC 3986 Section 2.1
  • 35. POST Requests Most Common POST HTTP Operations /w/index.php 1. GET 2. POST ... /new/resource GET /some/resource HTTP/1.1 -or- Header: Value ... /updated/resource POST /some/resource HTTP/1.1 none Header: Value request body
  • 36. POST Request Example Content type for data submitted via HTML form Blank line separates (multipart/form-data for file uploads) request headers and body POST /w/index.php?title=Wikipedia:Sandbox HTTP/1.1 Content­Type: application/x­www­form­urlencoded wpStarttime=20080719022313&wpEdittime=20080719022100 ... Note: Most browsers have a query string length limit. Lowest known common denominator: IE7 strlen(entire URL) <= 2,048 bytes. Request body This limit is not standardized. It applies ... look familiar? to query strings, but not request bodies.
  • 37. HEAD Request Same as GET with two exceptions: HEAD /wiki/Main_Page HTTP/1.1 Host: en.wikipedia.org ? 1 HEAD vs GET HTTP/1.1 200 OK Header: Value Sometimes headers are all you want 2 No response body Headers Body
  • 38. Responses Response Lowest protocol version Response status code required to process the status description response Status line HTTP/1.0 200 OK Server: Apache X­Powered­By: PHP/5.2.5 ... Same header format as requests, but different [body] headers are used (see RFC 2616 Section 14)
  • 39. Response Status Codes 1xx Informational Request received, continuing process. 2xx Success Request received, understood, and accepted. 3xx Redirection Client must take additional action to complete the request. 4xx Client Error Request is malformed or could not be fulfilled. 5xx Server Error Request was valid, but the server failed to process it. See RFC 2616 Section 10 for more info.
  • 40. Headers Set-Cookie See RFC 2109 or RFC 2965 for more info. Cookie Location Watch out for infinite loops! Last-Modified ETag OR If-Modified-Since If-None-Match 304 Not Modified
  • 41. More Headers WWW-Authenticate See RFC 2617 Authorization for more info. 200 OK / 403 Forbidden Some servers perform User-Agent user agent sniffing Some clients perform User-Agent: user agent spoofing
  • 46. Questions?  No heckling... OK, maybe just a little.  I generally blog about my experiences with web scraping and PHP at http://ishouldbecoding.com. </shameless_plug> Thanks for coming!