Crawling the web for fun and profit

Crawling the Web
(for fun and profit)
Federico Feroldi

“A Web crawler is a computer
program that browses the World
Wide Web in a methodical,
automated manner.”
Wikipedia

Picture greetings to photoholic1 --LennyB

Search engines only show you
what their crawlers can catch

Picture greetings to jimbrickett

The deep web contains a
lot of valuable information

e-commerce finance
transportation
yellow pages
medicine
government
opinions real estate
personal
intranets social
Picture greetings to tricky ™

Dig deeper with
your own crawler
Picture greetings to Super*Junk

Information
=
Competitive
Advantage
Picture greetings to mastrobiggo

B a cku p h i s t o r i c a l
data: web sites, blogs

Social network analysis: find
influencers and interests
based on “social circles”

Sentiment analysis: find
what people say about
your brand or product

Personal data and
online reputation

Do It Yourself

Picture greetings to vic_206

Anybody can build
a search engine

Scrapy Scheduler Internet
architecture
Re
qu
es

Data
ts

Item Scrapy
Downloader
pipeline Requests Engine

es
Ite ns
ms po
R es

Spider

Twitter social graph crawler
with Scrapy in 150 LOC

The Web is much bigger
than what you can search
with Google

Thank you

federico@cloudify.me

twitter.com/cloudify

Crawling the web for fun and profit

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Crawling the web for fun and profit

Similaire à Crawling the web for fun and profit (20)

Plus de Federico Feroldi

Plus de Federico Feroldi (7)

Dernier

Dernier (20)

Crawling the web for fun and profit