Web Scraping in Python with Scrapy: Collect Sushi Images

Web Scraping in
Python with Scrapy
Kota Kato
@orangain
2015-09-08, 鮨会

Who am I?
• Kota Kato
• @orangain
• Software Engineer
• Interested in automation such as Jenkins,
Chef, Docker etc.

Deﬁnition: Web Scraping
• Web scraping (web harvesting or web data
extraction) is a computer software technique
of extracting information from websites.
Web scraping - Wikipedia, the free encyclopedia 
https://en.wikipedia.org/wiki/Web_scraping

eBook-1
• Cross-store search engine for ebooks.
• Retrieve ebook data from 9 ebook stores.
http://ebook-1.com/

QB Meter
• Visualize crowdedness
of QB HOUSE, 10
minutes barbershop.
• Retrieve crowdedness
from QB HOUSE's
Web site every 5
minutes.
http://qbmeter.capybala.com/

Prototype of
Glance
• Prototype of simple news
app like newspaper.
• Retrieve news from NHK
NEWS WEB 4 times per a
day.

Pokedos
• Web app to ﬁnd nearest
bus stops to see the
arrival information of
buses.
• Retrieve location of the
all bus stops in Kyoto-
city.
http://bus.capybala.com/

Why Web Scraping?
• For Web Developer:
• Develop mash-up application.
• For Data Analyst:
• Retrieve data to analyze.
• For Everybody:
• Automate operation of web sites.

Why Use Python?
• Easy to use
• Powerful libraries, especially Scrapy
• Seamlessness between data processing and
developing application

Web Scraping in Python
• Combination of lightweight libraries:
• Retrieving: Requests
• Scraping: lxml, Beautiful Soup
• Full stack framework:
• Scrapy Today's topic

Scrapy
• Fast, simple and extensible Web scraping
framework in Python
• Currently compatible only with Python 2.7
• In-progress Python 3 support
• Maintained by Scrapinghub
• BSD License
http://scrapy.org/

Why Use Scrapy?
• Annoying stuffs in crawling and scraping are
done by Scrapy.
Extracting
Links
Throttling Concurrency
robots.txt and
<meta> Tags
XML Sitemaps
Filtering
Duplicated
URLs
Retry on Error Job Control

Getting Started with Scrapy
$ pip install scrapy
$ cat > myspider.py <<EOF
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://blog.scrapinghub.com']
def parse(self, response):
for url in response.css('ul li a::attr("href")').re(r'.*/dddd/dd/$'):
yield scrapy.Request(response.urljoin(url), self.parse_titles)
def parse_titles(self, response):
for post_title in response.css('div.entries > ul > li a::text').extract():
yield {'title': post_title}
EOF
$ scrapy runspider myspider.py
http://scrapy.org/Requirements: Python 2.7, libxml2 and libxslt

Create a Scrapy Project
$ scrapy startproject sushibot
$ tree sushibot/
sushibot/
!"" scrapy.cfg
#"" sushibot
!"" __init__.py
!"" items.py
!"" pipelines.py
!"" settings.py
#"" spiders
#"" __init__.py
2 directories, 6 files

Generate a Spider
$ cd sushibot
$ scrapy genspider sushi api.flickr.com
$ cat sushibot/spiders/sushi.py
# -*- coding: utf-8 -*-
import scrapy
class SushiSpider(scrapy.Spider):
name = "sushi"
allowed_domains = ["api.flickr.com"]
start_urls = (
'http://www.api.flickr.com/',
)
pass

Flickr API to Search Photos
$ curl 'https://api.flickr.com/services/rest/?
method=flickr.photos.search&api_key=******&text=sushi&sort=relevance
' > photos.xml
$ cat photos.xml
<?xml version="1.0" encoding="utf-8" ?>
<rsp stat="ok">
<photos page="1" pages="871" perpage="100" total="87088">
<photo id="4794344495" owner="38553162@N00" secret="d907790937"
server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0"
isfamily="0" />
<photo id="8486536177" owner="78779574@N00" secret="f77b824ebb"
server="8382" farm="9" title="Best Salmon Sushi" ispublic="1"
isfriend="0" isfamily="0" />
...
https://www.ﬂickr.com/services/api/ﬂickr.photos.search.html

Construct Photo's URL
<photo id="4794344495" owner="38553162@N00" secret="d907790937"
server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0"
isfamily="0" />
https://farm{farm-id}.staticflickr.com/{server-id}/{id}_{secret}
_[mstzb].jpg
https://farm5.staticflickr.com/4093/4794344495_d907790937_b.jpg
https://www.ﬂickr.com/services/api/misc.urls.html
Photo element:
Photo's URL template:
Result:

spider/sushi.py (Modiﬁed)
# -*- coding: utf-8 -*-
import os
import scrapy
from sushibot.items import SushibotItem
class SushiSpider(scrapy.Spider):
name = "sushi"
allowed_domains = ["api.flickr.com", "staticflickr.com"]
start_urls = (
'https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=' +
os.environ['FLICKR_KEY'] + '&text=sushi&sort=relevance',
)
for photo in response.css('photo'):
yield scrapy.Request(photo_url(photo), self.handle_image)
def handle_image(self, response):
return SushibotItem(url=response.url, body=response.body)
def photo_url(photo):
return 'https://farm{farm}.staticflickr.com/{server}/{id}_{secret}_{size}.jpg'.format(
farm=photo.xpath('@farm').extract_first(),
server=photo.xpath('@server').extract_first(),
id=photo.xpath('@id').extract_first(),
secret=photo.xpath('@secret').extract_first(),
size='b',
)

Scrapy's Architecture
http://doc.scrapy.org/en/1.0/topics/architecture.html

items.py
# -*- coding: utf-8 -*-
from pprint import pformat
import scrapy
class SushibotItem(scrapy.Item):
url = scrapy.Field()
body = scrapy.Field()
def __str__(self):
return pformat({
'url': self['url'],
'body': self['body'][:10] + '...',
})

pipelines.py
# -*- coding: utf-8 -*-
import os
class SaveImagePipeline(object):
def process_item(self, item, spider):
output_dir = 'images'
if not os.path.exists(output_dir):
os.makedirs(output_dir)
filename = item['url'].split('/')[-1]
with open(os.path.join(output_dir, filename), 'wb') as f:
f.write(item['body'])
return item

settings.py
• Appended settings:
# Crawl responsibly by identifying yourself (and your website) on the
user-agent
USER_AGENT = 'sushibot (+orangain@gmail.com)'
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/
settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'sushibot.pipelines.SaveImagePipeline': 300,
}

Run Spider
$ FLICKR_KEY=********** scrapy crawl sushi
NOTE: Provide Flickr's API key with environment variables.

Thank you!
• Web scraping has power to propose
improvement.
• Source code is available at 
https://github.com/orangain/sushibot
@orangain

Web Scraping in Python with Scrapy: Collect Sushi Images

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Web Scraping in Python with Scrapy: Collect Sushi Images

Similaire à Web Scraping in Python with Scrapy: Collect Sushi Images (20)

Dernier

Dernier (20)

Web Scraping in Python with Scrapy: Collect Sushi Images