SlideShare une entreprise Scribd logo
1  sur  13
Télécharger pour lire hors ligne
Parsing & Crawling libs
WE DON'T USE
1 / 13
Beautiful Soup
built on top of lxml and html5lib
higher levels commands
handles encoding itself
example :
frombs4importBeautifulSoup
soup=BeautifulSoup(html_doc)
soup.title
#<title>TheDormouse'sstory</title>
soup.title.name
#u'title'
soup.title.string
#u'TheDormouse'sstory'
soup.title.parent.name
#u'head'
soup.p
#<pclass="title"><b>TheDormouse'sstory</b></p>
soup.p['class']
#u'title'
soup.a
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>
soup.find_all('a')
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
# <aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
# <aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
soup.find(id="link3")
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a> 2 / 13
Beautiful Soup - Y U no use me ?
yet a new kind of soup
gotta go a step lower
crappy acronym
3 / 13
html5lib
implements the WHATWG HTML5 specification.
will inject tbodyand such
is actually usable directly in lxml, we could use it
4 / 13
html5lib - Y U no use me ?
Y would I ? uuhh
5 / 13
Scrapy
write rules
built-in handling of compression, cache, cookies, authentication, user-
agent spoofing, robots.txt handling, statistics, crawl depth restriction, etc
extendable : middlewares, extensions, and pipelines
Web management console for monitoring and controlling your bot
Telnet console for low-level access to the Scrapy process
fromscrapy.itemimportItem,Field
classTorrentItem(Item):
url=Field()
name=Field()
description=Field()
size=Field()
fromscrapy.contrib.spidersimportCrawlSpider,Rule
fromscrapy.contrib.linkextractors.sgmlimportSgmlLinkExtractor
fromscrapy.selectorimportSelector
classMininovaSpider(CrawlSpider):
name='mininova'
allowed_domains=['mininova.org']
start_urls=['http://www.mininova.org/today']
rules=[Rule(SgmlLinkExtractor(allow=['/tor/d+']),'parse_torrent')]
defparse_torrent(self,response):
sel=Selector(response)
torrent=TorrentItem()
torrent['url']=response.url
torrent['name']=sel.xpath("//h1/text()").extract()
torrent['description']=sel.xpath("//div[@id='description']").extract()
torrent['size']=sel.xpath("//div[@id='info-left']/p[2]/text()[2]").extract()
returntorrent
6 / 13
Scrapy Shell
scrapyshell
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
[s]AvailableScrapyobjects:
[s] crawler <scrapy.crawler.Crawlerobjectat0x1e16b50>
[s] item {}
[s] request <GEThttp://scrapy.org>
[s] response <200http://scrapy.org>
[s] sel <Selectorxpath=Nonedata=u'<html>n <head>n <metacharset="utf-8'>
[s] settings <CrawlerSettingsmodule=None>
[s] spider <Spider'default'at0x20c6f50>
[s]Usefulshortcuts:
[s] shelp() Shellhelp(printthishelp)
[s] fetch(req_or_url)Fetchrequest(orURL)andupdatelocalobjects
[s] view(response) Viewresponseinabrowser
In[1]:sel.xpath('//title')
Out[1]:[<Selectorxpath='//title'data=u'<title>OpenDirectory-Computers:Progr'>]
7 / 13
Scrapy Settings
CONCURRENT_ITEMS
CONCURRENT_REQUESTS
CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_IP
8 / 13
Scrappy - Y U no use me ?
I want to !
how to integrate scrapy daemon with MRQ ?
have to implement a proxies rotating middleware
9 / 13
Scrapy - Scrapinghub
10 / 13
Did I miss something ?
mechanize, twill => shitty deprecated crawling modules
i forgot their names => black boxes paid services
11 / 13
Did I miss something ?
GET LARGE
12 / 13
Adrien Di Pasquale
16/05/2014
13 / 13

Contenu connexe

Tendances

Build a Better Editing Experience with Advanced Custom Fields - #WCTO16
Build a Better Editing Experience with Advanced Custom Fields - #WCTO16Build a Better Editing Experience with Advanced Custom Fields - #WCTO16
Build a Better Editing Experience with Advanced Custom Fields - #WCTO16Jeseph Meyers
 
How To Write a WordPress Plugin
How To Write a WordPress PluginHow To Write a WordPress Plugin
How To Write a WordPress PluginAndy Stratton
 
Rapid Application Development with CakePHP 1.3
Rapid Application Development with CakePHP 1.3Rapid Application Development with CakePHP 1.3
Rapid Application Development with CakePHP 1.3kidtangerine
 
Entry-level PHP for WordPress
Entry-level PHP for WordPressEntry-level PHP for WordPress
Entry-level PHP for WordPresssprclldr
 
Laravel mailables with mail trap io
Laravel mailables with mail trap ioLaravel mailables with mail trap io
Laravel mailables with mail trap ioSoon Tuck Yee
 
Wordpress as a CMS - WordCampNOLA
Wordpress as a CMS - WordCampNOLAWordpress as a CMS - WordCampNOLA
Wordpress as a CMS - WordCampNOLAMarc Juneau
 

Tendances (6)

Build a Better Editing Experience with Advanced Custom Fields - #WCTO16
Build a Better Editing Experience with Advanced Custom Fields - #WCTO16Build a Better Editing Experience with Advanced Custom Fields - #WCTO16
Build a Better Editing Experience with Advanced Custom Fields - #WCTO16
 
How To Write a WordPress Plugin
How To Write a WordPress PluginHow To Write a WordPress Plugin
How To Write a WordPress Plugin
 
Rapid Application Development with CakePHP 1.3
Rapid Application Development with CakePHP 1.3Rapid Application Development with CakePHP 1.3
Rapid Application Development with CakePHP 1.3
 
Entry-level PHP for WordPress
Entry-level PHP for WordPressEntry-level PHP for WordPress
Entry-level PHP for WordPress
 
Laravel mailables with mail trap io
Laravel mailables with mail trap ioLaravel mailables with mail trap io
Laravel mailables with mail trap io
 
Wordpress as a CMS - WordCampNOLA
Wordpress as a CMS - WordCampNOLAWordpress as a CMS - WordCampNOLA
Wordpress as a CMS - WordCampNOLA
 

Similaire à Prezdev parsing & crawling libs

Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4Eueung Mulyana
 
Cloud Automation with Opscode Chef
Cloud Automation with Opscode ChefCloud Automation with Opscode Chef
Cloud Automation with Opscode ChefSri Ram
 
Cloud Automation with Opscode Chef
Cloud Automation with Opscode ChefCloud Automation with Opscode Chef
Cloud Automation with Opscode ChefSri Ram
 
Cloud Automation with Opscode Chef
Cloud Automation with Opscode ChefCloud Automation with Opscode Chef
Cloud Automation with Opscode ChefSri Ram
 
Shell scripting _how_to_automate_command_l_-_jason_cannon
Shell scripting _how_to_automate_command_l_-_jason_cannonShell scripting _how_to_automate_command_l_-_jason_cannon
Shell scripting _how_to_automate_command_l_-_jason_cannonSyed Altaf
 
Your own (little) gem: building an online business with Ruby
Your own (little) gem: building an online business with RubyYour own (little) gem: building an online business with Ruby
Your own (little) gem: building an online business with RubyLindsay Holmwood
 
Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.Workhorse Computing
 
Cookbook refactoring & abstracting logic to Ruby(gems)
Cookbook refactoring & abstracting logic to Ruby(gems)Cookbook refactoring & abstracting logic to Ruby(gems)
Cookbook refactoring & abstracting logic to Ruby(gems)Chef Software, Inc.
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BSJohn D
 
Tutorial perl programming basic eng ver
Tutorial perl programming basic eng verTutorial perl programming basic eng ver
Tutorial perl programming basic eng verQrembiezs Intruder
 
URL Mapping, with and without mod_rewrite
URL Mapping, with and without mod_rewriteURL Mapping, with and without mod_rewrite
URL Mapping, with and without mod_rewriteRich Bowen
 

Similaire à Prezdev parsing & crawling libs (12)

Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4
 
Cloud Automation with Opscode Chef
Cloud Automation with Opscode ChefCloud Automation with Opscode Chef
Cloud Automation with Opscode Chef
 
Cloud Automation with Opscode Chef
Cloud Automation with Opscode ChefCloud Automation with Opscode Chef
Cloud Automation with Opscode Chef
 
Cloud Automation with Opscode Chef
Cloud Automation with Opscode ChefCloud Automation with Opscode Chef
Cloud Automation with Opscode Chef
 
Shell scripting _how_to_automate_command_l_-_jason_cannon
Shell scripting _how_to_automate_command_l_-_jason_cannonShell scripting _how_to_automate_command_l_-_jason_cannon
Shell scripting _how_to_automate_command_l_-_jason_cannon
 
Your own (little) gem: building an online business with Ruby
Your own (little) gem: building an online business with RubyYour own (little) gem: building an online business with Ruby
Your own (little) gem: building an online business with Ruby
 
Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.
 
Cookbook refactoring & abstracting logic to Ruby(gems)
Cookbook refactoring & abstracting logic to Ruby(gems)Cookbook refactoring & abstracting logic to Ruby(gems)
Cookbook refactoring & abstracting logic to Ruby(gems)
 
Php by tanbircox
Php by tanbircoxPhp by tanbircox
Php by tanbircox
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BS
 
Tutorial perl programming basic eng ver
Tutorial perl programming basic eng verTutorial perl programming basic eng ver
Tutorial perl programming basic eng ver
 
URL Mapping, with and without mod_rewrite
URL Mapping, with and without mod_rewriteURL Mapping, with and without mod_rewrite
URL Mapping, with and without mod_rewrite
 

Dernier

Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROmotivationalword821
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 

Dernier (20)

Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTRO
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 

Prezdev parsing & crawling libs

  • 1. Parsing & Crawling libs WE DON'T USE 1 / 13
  • 2. Beautiful Soup built on top of lxml and html5lib higher levels commands handles encoding itself example : frombs4importBeautifulSoup soup=BeautifulSoup(html_doc) soup.title #<title>TheDormouse'sstory</title> soup.title.name #u'title' soup.title.string #u'TheDormouse'sstory' soup.title.parent.name #u'head' soup.p #<pclass="title"><b>TheDormouse'sstory</b></p> soup.p['class'] #u'title' soup.a #<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a> soup.find_all('a') #[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>, # <aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>, # <aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>] soup.find(id="link3") #<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a> 2 / 13
  • 3. Beautiful Soup - Y U no use me ? yet a new kind of soup gotta go a step lower crappy acronym 3 / 13
  • 4. html5lib implements the WHATWG HTML5 specification. will inject tbodyand such is actually usable directly in lxml, we could use it 4 / 13
  • 5. html5lib - Y U no use me ? Y would I ? uuhh 5 / 13
  • 6. Scrapy write rules built-in handling of compression, cache, cookies, authentication, user- agent spoofing, robots.txt handling, statistics, crawl depth restriction, etc extendable : middlewares, extensions, and pipelines Web management console for monitoring and controlling your bot Telnet console for low-level access to the Scrapy process fromscrapy.itemimportItem,Field classTorrentItem(Item): url=Field() name=Field() description=Field() size=Field() fromscrapy.contrib.spidersimportCrawlSpider,Rule fromscrapy.contrib.linkextractors.sgmlimportSgmlLinkExtractor fromscrapy.selectorimportSelector classMininovaSpider(CrawlSpider): name='mininova' allowed_domains=['mininova.org'] start_urls=['http://www.mininova.org/today'] rules=[Rule(SgmlLinkExtractor(allow=['/tor/d+']),'parse_torrent')] defparse_torrent(self,response): sel=Selector(response) torrent=TorrentItem() torrent['url']=response.url torrent['name']=sel.xpath("//h1/text()").extract() torrent['description']=sel.xpath("//div[@id='description']").extract() torrent['size']=sel.xpath("//div[@id='info-left']/p[2]/text()[2]").extract() returntorrent 6 / 13
  • 7. Scrapy Shell scrapyshell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/" [s]AvailableScrapyobjects: [s] crawler <scrapy.crawler.Crawlerobjectat0x1e16b50> [s] item {} [s] request <GEThttp://scrapy.org> [s] response <200http://scrapy.org> [s] sel <Selectorxpath=Nonedata=u'<html>n <head>n <metacharset="utf-8'> [s] settings <CrawlerSettingsmodule=None> [s] spider <Spider'default'at0x20c6f50> [s]Usefulshortcuts: [s] shelp() Shellhelp(printthishelp) [s] fetch(req_or_url)Fetchrequest(orURL)andupdatelocalobjects [s] view(response) Viewresponseinabrowser In[1]:sel.xpath('//title') Out[1]:[<Selectorxpath='//title'data=u'<title>OpenDirectory-Computers:Progr'>] 7 / 13
  • 9. Scrappy - Y U no use me ? I want to ! how to integrate scrapy daemon with MRQ ? have to implement a proxies rotating middleware 9 / 13
  • 11. Did I miss something ? mechanize, twill => shitty deprecated crawling modules i forgot their names => black boxes paid services 11 / 13
  • 12. Did I miss something ? GET LARGE 12 / 13