SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Web Scraping in
Python with Scrapy
Kota Kato
@orangain
2015-09-08, 鮨会
Who am I?
• Kota Kato
• @orangain
• Software Engineer
• Interested in automation such as Jenkins,
Chef, Docker etc.
Definition: Web Scraping
• Web scraping (web harvesting or web data
extraction) is a computer software technique
of extracting information from websites.
Web scraping - Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Web_scraping
eBook-1
• Cross-store search engine for ebooks.
• Retrieve ebook data from 9 ebook stores.
http://ebook-1.com/
QB Meter
• Visualize crowdedness
of QB HOUSE, 10
minutes barbershop.
• Retrieve crowdedness
from QB HOUSE's
Web site every 5
minutes.
http://qbmeter.capybala.com/
Prototype of
Glance
• Prototype of simple news
app like newspaper.
• Retrieve news from NHK
NEWS WEB 4 times per a
day.
Pokedos
• Web app to find nearest
bus stops to see the
arrival information of
buses.
• Retrieve location of the
all bus stops in Kyoto-
city.
http://bus.capybala.com/
Why Web Scraping?
• For Web Developer:
• Develop mash-up application.
• For Data Analyst:
• Retrieve data to analyze.
• For Everybody:
• Automate operation of web sites.
Why Use Python?
• Easy to use
• Powerful libraries, especially Scrapy
• Seamlessness between data processing and
developing application
Web Scraping in Python
• Combination of lightweight libraries:
• Retrieving: Requests
• Scraping: lxml, Beautiful Soup
• Full stack framework:
• Scrapy Today's topic
Scrapy
Scrapy
• Fast, simple and extensible Web scraping
framework in Python
• Currently compatible only with Python 2.7
• In-progress Python 3 support
• Maintained by Scrapinghub
• BSD License
http://scrapy.org/
Why Use Scrapy?
• Annoying stuffs in crawling and scraping are
done by Scrapy.
Extracting
Links
Throttling Concurrency
robots.txt and
<meta> Tags
XML Sitemaps
Filtering
Duplicated
URLs
Retry on Error Job Control
Getting Started with Scrapy
$ pip install scrapy
$ cat > myspider.py <<EOF
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://blog.scrapinghub.com']
def parse(self, response):
for url in response.css('ul li a::attr("href")').re(r'.*/dddd/dd/$'):
yield scrapy.Request(response.urljoin(url), self.parse_titles)
def parse_titles(self, response):
for post_title in response.css('div.entries > ul > li a::text').extract():
yield {'title': post_title}
EOF
$ scrapy runspider myspider.py
http://scrapy.org/Requirements: Python 2.7, libxml2 and libxslt
Let's Collect Sushi Images
Create a Scrapy Project
$ scrapy startproject sushibot
$ tree sushibot/
sushibot/
!"" scrapy.cfg
#"" sushibot
!"" __init__.py
!"" items.py
!"" pipelines.py
!"" settings.py
#"" spiders
#"" __init__.py
2 directories, 6 files
Generate a Spider
$ cd sushibot
$ scrapy genspider sushi api.flickr.com
$ cat sushibot/spiders/sushi.py
# -*- coding: utf-8 -*-
import scrapy
class SushiSpider(scrapy.Spider):
name = "sushi"
allowed_domains = ["api.flickr.com"]
start_urls = (
'http://www.api.flickr.com/',
)
def parse(self, response):
pass
Flickr API to Search Photos
$ curl 'https://api.flickr.com/services/rest/?
method=flickr.photos.search&api_key=******&text=sushi&sort=relevance
' > photos.xml
$ cat photos.xml
<?xml version="1.0" encoding="utf-8" ?>
<rsp stat="ok">
<photos page="1" pages="871" perpage="100" total="87088">
<photo id="4794344495" owner="38553162@N00" secret="d907790937"
server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0"
isfamily="0" />
<photo id="8486536177" owner="78779574@N00" secret="f77b824ebb"
server="8382" farm="9" title="Best Salmon Sushi" ispublic="1"
isfriend="0" isfamily="0" />
...
https://www.flickr.com/services/api/flickr.photos.search.html
Construct Photo's URL
<photo id="4794344495" owner="38553162@N00" secret="d907790937"
server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0"
isfamily="0" />
https://farm{farm-id}.staticflickr.com/{server-id}/{id}_{secret}
_[mstzb].jpg
https://farm5.staticflickr.com/4093/4794344495_d907790937_b.jpg
https://www.flickr.com/services/api/misc.urls.html
Photo element:
Photo's URL template:
Result:
spider/sushi.py (Modified)
# -*- coding: utf-8 -*-
import os
import scrapy
from sushibot.items import SushibotItem
class SushiSpider(scrapy.Spider):
name = "sushi"
allowed_domains = ["api.flickr.com", "staticflickr.com"]
start_urls = (
'https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=' +
os.environ['FLICKR_KEY'] + '&text=sushi&sort=relevance',
)
def parse(self, response):
for photo in response.css('photo'):
yield scrapy.Request(photo_url(photo), self.handle_image)
def handle_image(self, response):
return SushibotItem(url=response.url, body=response.body)
def photo_url(photo):
return 'https://farm{farm}.staticflickr.com/{server}/{id}_{secret}_{size}.jpg'.format(
farm=photo.xpath('@farm').extract_first(),
server=photo.xpath('@server').extract_first(),
id=photo.xpath('@id').extract_first(),
secret=photo.xpath('@secret').extract_first(),
size='b',
)
Scrapy's Architecture
http://doc.scrapy.org/en/1.0/topics/architecture.html
items.py
# -*- coding: utf-8 -*-
from pprint import pformat
import scrapy
class SushibotItem(scrapy.Item):
url = scrapy.Field()
body = scrapy.Field()
def __str__(self):
return pformat({
'url': self['url'],
'body': self['body'][:10] + '...',
})
pipelines.py
# -*- coding: utf-8 -*-
import os
class SaveImagePipeline(object):
def process_item(self, item, spider):
output_dir = 'images'
if not os.path.exists(output_dir):
os.makedirs(output_dir)
filename = item['url'].split('/')[-1]
with open(os.path.join(output_dir, filename), 'wb') as f:
f.write(item['body'])
return item
settings.py
• Appended settings:
# Crawl responsibly by identifying yourself (and your website) on the
user-agent
USER_AGENT = 'sushibot (+orangain@gmail.com)'
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/
settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'sushibot.pipelines.SaveImagePipeline': 300,
}
Run Spider
$ FLICKR_KEY=********** scrapy crawl sushi
NOTE: Provide Flickr's API key with environment variables.
Thank you!
• Web scraping has power to propose
improvement.
• Source code is available at

https://github.com/orangain/sushibot
@orangain

Contenu connexe

Tendances

LogStash - Yes, logging can be awesome
LogStash - Yes, logging can be awesomeLogStash - Yes, logging can be awesome
LogStash - Yes, logging can be awesomeJames Turnbull
 
elasticsearch basics workshop
elasticsearch basics workshopelasticsearch basics workshop
elasticsearch basics workshopMathieu Elie
 
Debugging and Testing ES Systems
Debugging and Testing ES SystemsDebugging and Testing ES Systems
Debugging and Testing ES SystemsChris Birchall
 
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014Puppet
 
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)Дмитрий Бумов
 
Luigi Presentation at OSCON 2013
Luigi Presentation at OSCON 2013Luigi Presentation at OSCON 2013
Luigi Presentation at OSCON 2013Erik Bernhardsson
 
Approach to find critical vulnerabilities
Approach to find critical vulnerabilitiesApproach to find critical vulnerabilities
Approach to find critical vulnerabilitiesAshish Kunwar
 
Spatial MongoDB, Node.JS, and Express - server-side JS for your application
Spatial MongoDB, Node.JS, and Express - server-side JS for your applicationSpatial MongoDB, Node.JS, and Express - server-side JS for your application
Spatial MongoDB, Node.JS, and Express - server-side JS for your applicationSteven Pousty
 
Django and Mongoengine
Django and MongoengineDjango and Mongoengine
Django and Mongoengineaustinpublic
 
Code4 lib 20141129 python
Code4 lib 20141129 pythonCode4 lib 20141129 python
Code4 lib 20141129 pythontdsmithCapU
 
Logstash-Elasticsearch-Kibana
Logstash-Elasticsearch-KibanaLogstash-Elasticsearch-Kibana
Logstash-Elasticsearch-Kibanadknx01
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Building an API with Django and Django REST Framework
Building an API with Django and Django REST FrameworkBuilding an API with Django and Django REST Framework
Building an API with Django and Django REST FrameworkChristopher Foresman
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0IBM Cloud Data Services
 
Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go Steven Francia
 
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게Seongyun Byeon
 

Tendances (20)

Analyse Yourself
Analyse YourselfAnalyse Yourself
Analyse Yourself
 
LogStash - Yes, logging can be awesome
LogStash - Yes, logging can be awesomeLogStash - Yes, logging can be awesome
LogStash - Yes, logging can be awesome
 
elasticsearch basics workshop
elasticsearch basics workshopelasticsearch basics workshop
elasticsearch basics workshop
 
Debugging and Testing ES Systems
Debugging and Testing ES SystemsDebugging and Testing ES Systems
Debugging and Testing ES Systems
 
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
 
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
 
ElasticSearch
ElasticSearchElasticSearch
ElasticSearch
 
Luigi Presentation at OSCON 2013
Luigi Presentation at OSCON 2013Luigi Presentation at OSCON 2013
Luigi Presentation at OSCON 2013
 
Approach to find critical vulnerabilities
Approach to find critical vulnerabilitiesApproach to find critical vulnerabilities
Approach to find critical vulnerabilities
 
Spatial MongoDB, Node.JS, and Express - server-side JS for your application
Spatial MongoDB, Node.JS, and Express - server-side JS for your applicationSpatial MongoDB, Node.JS, and Express - server-side JS for your application
Spatial MongoDB, Node.JS, and Express - server-side JS for your application
 
Web::Scraper
Web::ScraperWeb::Scraper
Web::Scraper
 
Django and Mongoengine
Django and MongoengineDjango and Mongoengine
Django and Mongoengine
 
Code4 lib 20141129 python
Code4 lib 20141129 pythonCode4 lib 20141129 python
Code4 lib 20141129 python
 
Logstash-Elasticsearch-Kibana
Logstash-Elasticsearch-KibanaLogstash-Elasticsearch-Kibana
Logstash-Elasticsearch-Kibana
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Building an API with Django and Django REST Framework
Building an API with Django and Django REST FrameworkBuilding an API with Django and Django REST Framework
Building an API with Django and Django REST Framework
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
 
Django Mongodb Engine
Django Mongodb EngineDjango Mongodb Engine
Django Mongodb Engine
 
Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go
 
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
 

En vedette

Downloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyDownloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyErin Shellman
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...Anton
 
Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profitFederico Feroldi
 
Web crawler - Scrapy
Web crawler - Scrapy Web crawler - Scrapy
Web crawler - Scrapy yafish
 
Java Script Based Client Server Webapps 2
Java Script Based Client Server Webapps 2Java Script Based Client Server Webapps 2
Java Script Based Client Server Webapps 2kriszyp
 
快快樂樂學 Scrapy
快快樂樂學 Scrapy快快樂樂學 Scrapy
快快樂樂學 Scrapyrecast203
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BSJohn D
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4Eueung Mulyana
 
Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source toolsSammy Fung
 
Web Scraping Technologies
Web Scraping TechnologiesWeb Scraping Technologies
Web Scraping TechnologiesKrishna Sunuwar
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With PythonRobert Dempsey
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupJim Chang
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scrapingScrapinghub
 
Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014Bruno Rocha
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with PythonPaul Schreiber
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoSammy Fung
 
電腦不只會幫你選土豆,還會幫你選新聞
電腦不只會幫你選土豆,還會幫你選新聞電腦不只會幫你選土豆,還會幫你選新聞
電腦不只會幫你選土豆,還會幫你選新聞Andy Dai
 

En vedette (20)

Downloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyDownloading the internet with Python + Scrapy
Downloading the internet with Python + Scrapy
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
Scrapy
ScrapyScrapy
Scrapy
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
 
Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profit
 
Web crawler - Scrapy
Web crawler - Scrapy Web crawler - Scrapy
Web crawler - Scrapy
 
Java Script Based Client Server Webapps 2
Java Script Based Client Server Webapps 2Java Script Based Client Server Webapps 2
Java Script Based Client Server Webapps 2
 
快快樂樂學 Scrapy
快快樂樂學 Scrapy快快樂樂學 Scrapy
快快樂樂學 Scrapy
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BS
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4
 
Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source tools
 
Web Scraping Technologies
Web Scraping TechnologiesWeb Scraping Technologies
Web Scraping Technologies
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful Soup
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
 
Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014
 
Pydata-Python tools for webscraping
Pydata-Python tools for webscrapingPydata-Python tools for webscraping
Pydata-Python tools for webscraping
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
電腦不只會幫你選土豆,還會幫你選新聞
電腦不只會幫你選土豆,還會幫你選新聞電腦不只會幫你選土豆,還會幫你選新聞
電腦不只會幫你選土豆,還會幫你選新聞
 

Similaire à Web Scraping in Python with Scrapy: Collect Sushi Images

Crab - A Python Framework for Building Recommendation Systems
Crab - A Python Framework for Building Recommendation SystemsCrab - A Python Framework for Building Recommendation Systems
Crab - A Python Framework for Building Recommendation SystemsMarcel Caraciolo
 
Great Tools Heavily Used In Japan, You Don't Know.
Great Tools Heavily Used In Japan, You Don't Know.Great Tools Heavily Used In Japan, You Don't Know.
Great Tools Heavily Used In Japan, You Don't Know.Junichi Ishida
 
Python と Docker で mypy Playground を開発した話
Python と Docker で mypy Playground を開発した話Python と Docker で mypy Playground を開発した話
Python と Docker で mypy Playground を開発した話Yusuke Miyazaki
 
夜宴24期《这段时间》
夜宴24期《这段时间》夜宴24期《这段时间》
夜宴24期《这段时间》Koubei Banquet
 
Introduction to Crab - Python Framework for Building Recommender Systems
Introduction to Crab - Python Framework for Building Recommender SystemsIntroduction to Crab - Python Framework for Building Recommender Systems
Introduction to Crab - Python Framework for Building Recommender SystemsMarcel Caraciolo
 
或るWebサービス開発のこれから - "オープンWebサービス"という妄想 -
或るWebサービス開発のこれから - "オープンWebサービス"という妄想 -或るWebサービス開発のこれから - "オープンWebサービス"という妄想 -
或るWebサービス開発のこれから - "オープンWebサービス"という妄想 -Kei Shiratsuchi
 
Eating Fruit - Combining Robots & Apps
Eating Fruit - Combining Robots & AppsEating Fruit - Combining Robots & Apps
Eating Fruit - Combining Robots & AppsRobotGrrl
 
オブジェクトストレージの詳解とクラウドサービスを活かすスケーラブルなシステム開発
オブジェクトストレージの詳解とクラウドサービスを活かすスケーラブルなシステム開発オブジェクトストレージの詳解とクラウドサービスを活かすスケーラブルなシステム開発
オブジェクトストレージの詳解とクラウドサービスを活かすスケーラブルなシステム開発IIJ
 
Jenkins-Koji plugin presentation on Python & Ruby devel group @ Brno
Jenkins-Koji plugin presentation on Python & Ruby devel group @ BrnoJenkins-Koji plugin presentation on Python & Ruby devel group @ Brno
Jenkins-Koji plugin presentation on Python & Ruby devel group @ BrnoVaclav Tunka
 
Scraping Scripting Hacking
Scraping Scripting HackingScraping Scripting Hacking
Scraping Scripting HackingMike Ellis
 
Puppet Camp New York 2014: Streamlining Puppet Development Workflow
Puppet Camp New York 2014: Streamlining Puppet Development Workflow Puppet Camp New York 2014: Streamlining Puppet Development Workflow
Puppet Camp New York 2014: Streamlining Puppet Development Workflow Puppet
 
Steamlining your puppet development workflow
Steamlining your puppet development workflowSteamlining your puppet development workflow
Steamlining your puppet development workflowTomas Doran
 
20120524 english lt2_pythontoolsfortesting
20120524 english lt2_pythontoolsfortesting20120524 english lt2_pythontoolsfortesting
20120524 english lt2_pythontoolsfortestingKazuhiro Oinuma
 
Practical White Hat Hacker Training - Passive Information Gathering(OSINT)
Practical White Hat Hacker Training -  Passive Information Gathering(OSINT)Practical White Hat Hacker Training -  Passive Information Gathering(OSINT)
Practical White Hat Hacker Training - Passive Information Gathering(OSINT)PRISMA CSI
 
Kinect Workshop Part 1/2
Kinect Workshop Part 1/2Kinect Workshop Part 1/2
Kinect Workshop Part 1/2Seiya Konno
 
通过 Ktor 迅速打造以 Kotlin 为核心的后端服务应用
通过 Ktor 迅速打造以 Kotlin 为核心的后端服务应用通过 Ktor 迅速打造以 Kotlin 为核心的后端服务应用
通过 Ktor 迅速打造以 Kotlin 为核心的后端服务应用Shengyou Fan
 
CoffeeScript: The Good Parts
CoffeeScript: The Good PartsCoffeeScript: The Good Parts
CoffeeScript: The Good PartsC4Media
 
Package Management via Spack on SJTU π Supercomputer
Package Management via Spack on SJTU π SupercomputerPackage Management via Spack on SJTU π Supercomputer
Package Management via Spack on SJTU π SupercomputerJianwen Wei
 

Similaire à Web Scraping in Python with Scrapy: Collect Sushi Images (20)

Crab - A Python Framework for Building Recommendation Systems
Crab - A Python Framework for Building Recommendation SystemsCrab - A Python Framework for Building Recommendation Systems
Crab - A Python Framework for Building Recommendation Systems
 
儲かるドキュメント
儲かるドキュメント儲かるドキュメント
儲かるドキュメント
 
Great Tools Heavily Used In Japan, You Don't Know.
Great Tools Heavily Used In Japan, You Don't Know.Great Tools Heavily Used In Japan, You Don't Know.
Great Tools Heavily Used In Japan, You Don't Know.
 
Python と Docker で mypy Playground を開発した話
Python と Docker で mypy Playground を開発した話Python と Docker で mypy Playground を開発した話
Python と Docker で mypy Playground を開発した話
 
夜宴24期《这段时间》
夜宴24期《这段时间》夜宴24期《这段时间》
夜宴24期《这段时间》
 
Introduction to Crab - Python Framework for Building Recommender Systems
Introduction to Crab - Python Framework for Building Recommender SystemsIntroduction to Crab - Python Framework for Building Recommender Systems
Introduction to Crab - Python Framework for Building Recommender Systems
 
或るWebサービス開発のこれから - "オープンWebサービス"という妄想 -
或るWebサービス開発のこれから - "オープンWebサービス"という妄想 -或るWebサービス開発のこれから - "オープンWebサービス"という妄想 -
或るWebサービス開発のこれから - "オープンWebサービス"という妄想 -
 
Eating Fruit - Combining Robots & Apps
Eating Fruit - Combining Robots & AppsEating Fruit - Combining Robots & Apps
Eating Fruit - Combining Robots & Apps
 
オブジェクトストレージの詳解とクラウドサービスを活かすスケーラブルなシステム開発
オブジェクトストレージの詳解とクラウドサービスを活かすスケーラブルなシステム開発オブジェクトストレージの詳解とクラウドサービスを活かすスケーラブルなシステム開発
オブジェクトストレージの詳解とクラウドサービスを活かすスケーラブルなシステム開発
 
Jenkins-Koji plugin presentation on Python & Ruby devel group @ Brno
Jenkins-Koji plugin presentation on Python & Ruby devel group @ BrnoJenkins-Koji plugin presentation on Python & Ruby devel group @ Brno
Jenkins-Koji plugin presentation on Python & Ruby devel group @ Brno
 
Scraping Scripting Hacking
Scraping Scripting HackingScraping Scripting Hacking
Scraping Scripting Hacking
 
Puppet Camp New York 2014: Streamlining Puppet Development Workflow
Puppet Camp New York 2014: Streamlining Puppet Development Workflow Puppet Camp New York 2014: Streamlining Puppet Development Workflow
Puppet Camp New York 2014: Streamlining Puppet Development Workflow
 
Steamlining your puppet development workflow
Steamlining your puppet development workflowSteamlining your puppet development workflow
Steamlining your puppet development workflow
 
20120524 english lt2_pythontoolsfortesting
20120524 english lt2_pythontoolsfortesting20120524 english lt2_pythontoolsfortesting
20120524 english lt2_pythontoolsfortesting
 
Practical White Hat Hacker Training - Passive Information Gathering(OSINT)
Practical White Hat Hacker Training -  Passive Information Gathering(OSINT)Practical White Hat Hacker Training -  Passive Information Gathering(OSINT)
Practical White Hat Hacker Training - Passive Information Gathering(OSINT)
 
Kinect Workshop Part 1/2
Kinect Workshop Part 1/2Kinect Workshop Part 1/2
Kinect Workshop Part 1/2
 
通过 Ktor 迅速打造以 Kotlin 为核心的后端服务应用
通过 Ktor 迅速打造以 Kotlin 为核心的后端服务应用通过 Ktor 迅速打造以 Kotlin 为核心的后端服务应用
通过 Ktor 迅速打造以 Kotlin 为核心的后端服务应用
 
Web Scrapping Using Python
Web Scrapping Using PythonWeb Scrapping Using Python
Web Scrapping Using Python
 
CoffeeScript: The Good Parts
CoffeeScript: The Good PartsCoffeeScript: The Good Parts
CoffeeScript: The Good Parts
 
Package Management via Spack on SJTU π Supercomputer
Package Management via Spack on SJTU π SupercomputerPackage Management via Spack on SJTU π Supercomputer
Package Management via Spack on SJTU π Supercomputer
 

Dernier

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

Web Scraping in Python with Scrapy: Collect Sushi Images

  • 1. Web Scraping in Python with Scrapy Kota Kato @orangain 2015-09-08, 鮨会
  • 2. Who am I? • Kota Kato • @orangain • Software Engineer • Interested in automation such as Jenkins, Chef, Docker etc.
  • 3. Definition: Web Scraping • Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Web scraping - Wikipedia, the free encyclopedia
 https://en.wikipedia.org/wiki/Web_scraping
  • 4. eBook-1 • Cross-store search engine for ebooks. • Retrieve ebook data from 9 ebook stores. http://ebook-1.com/
  • 5. QB Meter • Visualize crowdedness of QB HOUSE, 10 minutes barbershop. • Retrieve crowdedness from QB HOUSE's Web site every 5 minutes. http://qbmeter.capybala.com/
  • 6. Prototype of Glance • Prototype of simple news app like newspaper. • Retrieve news from NHK NEWS WEB 4 times per a day.
  • 7. Pokedos • Web app to find nearest bus stops to see the arrival information of buses. • Retrieve location of the all bus stops in Kyoto- city. http://bus.capybala.com/
  • 8. Why Web Scraping? • For Web Developer: • Develop mash-up application. • For Data Analyst: • Retrieve data to analyze. • For Everybody: • Automate operation of web sites.
  • 9. Why Use Python? • Easy to use • Powerful libraries, especially Scrapy • Seamlessness between data processing and developing application
  • 10. Web Scraping in Python • Combination of lightweight libraries: • Retrieving: Requests • Scraping: lxml, Beautiful Soup • Full stack framework: • Scrapy Today's topic
  • 12. Scrapy • Fast, simple and extensible Web scraping framework in Python • Currently compatible only with Python 2.7 • In-progress Python 3 support • Maintained by Scrapinghub • BSD License http://scrapy.org/
  • 13. Why Use Scrapy? • Annoying stuffs in crawling and scraping are done by Scrapy. Extracting Links Throttling Concurrency robots.txt and <meta> Tags XML Sitemaps Filtering Duplicated URLs Retry on Error Job Control
  • 14. Getting Started with Scrapy $ pip install scrapy $ cat > myspider.py <<EOF import scrapy class BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['http://blog.scrapinghub.com'] def parse(self, response): for url in response.css('ul li a::attr("href")').re(r'.*/dddd/dd/$'): yield scrapy.Request(response.urljoin(url), self.parse_titles) def parse_titles(self, response): for post_title in response.css('div.entries > ul > li a::text').extract(): yield {'title': post_title} EOF $ scrapy runspider myspider.py http://scrapy.org/Requirements: Python 2.7, libxml2 and libxslt
  • 16. Create a Scrapy Project $ scrapy startproject sushibot $ tree sushibot/ sushibot/ !"" scrapy.cfg #"" sushibot !"" __init__.py !"" items.py !"" pipelines.py !"" settings.py #"" spiders #"" __init__.py 2 directories, 6 files
  • 17. Generate a Spider $ cd sushibot $ scrapy genspider sushi api.flickr.com $ cat sushibot/spiders/sushi.py # -*- coding: utf-8 -*- import scrapy class SushiSpider(scrapy.Spider): name = "sushi" allowed_domains = ["api.flickr.com"] start_urls = ( 'http://www.api.flickr.com/', ) def parse(self, response): pass
  • 18. Flickr API to Search Photos $ curl 'https://api.flickr.com/services/rest/? method=flickr.photos.search&api_key=******&text=sushi&sort=relevance ' > photos.xml $ cat photos.xml <?xml version="1.0" encoding="utf-8" ?> <rsp stat="ok"> <photos page="1" pages="871" perpage="100" total="87088"> <photo id="4794344495" owner="38553162@N00" secret="d907790937" server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0" isfamily="0" /> <photo id="8486536177" owner="78779574@N00" secret="f77b824ebb" server="8382" farm="9" title="Best Salmon Sushi" ispublic="1" isfriend="0" isfamily="0" /> ... https://www.flickr.com/services/api/flickr.photos.search.html
  • 19. Construct Photo's URL <photo id="4794344495" owner="38553162@N00" secret="d907790937" server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0" isfamily="0" /> https://farm{farm-id}.staticflickr.com/{server-id}/{id}_{secret} _[mstzb].jpg https://farm5.staticflickr.com/4093/4794344495_d907790937_b.jpg https://www.flickr.com/services/api/misc.urls.html Photo element: Photo's URL template: Result:
  • 20. spider/sushi.py (Modified) # -*- coding: utf-8 -*- import os import scrapy from sushibot.items import SushibotItem class SushiSpider(scrapy.Spider): name = "sushi" allowed_domains = ["api.flickr.com", "staticflickr.com"] start_urls = ( 'https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=' + os.environ['FLICKR_KEY'] + '&text=sushi&sort=relevance', ) def parse(self, response): for photo in response.css('photo'): yield scrapy.Request(photo_url(photo), self.handle_image) def handle_image(self, response): return SushibotItem(url=response.url, body=response.body) def photo_url(photo): return 'https://farm{farm}.staticflickr.com/{server}/{id}_{secret}_{size}.jpg'.format( farm=photo.xpath('@farm').extract_first(), server=photo.xpath('@server').extract_first(), id=photo.xpath('@id').extract_first(), secret=photo.xpath('@secret').extract_first(), size='b', )
  • 22. items.py # -*- coding: utf-8 -*- from pprint import pformat import scrapy class SushibotItem(scrapy.Item): url = scrapy.Field() body = scrapy.Field() def __str__(self): return pformat({ 'url': self['url'], 'body': self['body'][:10] + '...', })
  • 23. pipelines.py # -*- coding: utf-8 -*- import os class SaveImagePipeline(object): def process_item(self, item, spider): output_dir = 'images' if not os.path.exists(output_dir): os.makedirs(output_dir) filename = item['url'].split('/')[-1] with open(os.path.join(output_dir, filename), 'wb') as f: f.write(item['body']) return item
  • 24. settings.py • Appended settings: # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'sushibot (+orangain@gmail.com)' # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/ settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 1 # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'sushibot.pipelines.SaveImagePipeline': 300, }
  • 25. Run Spider $ FLICKR_KEY=********** scrapy crawl sushi NOTE: Provide Flickr's API key with environment variables.
  • 26. Thank you! • Web scraping has power to propose improvement. • Source code is available at
 https://github.com/orangain/sushibot @orangain