Ruby Robots: Guide to Web Scraping and Automation

Ruby
Robots

Daniel Cukier
@danicuki
http://www.ﬂickr.com/photos/ﬂysi/183272970

Relatives

• spiders
• crawlers
• bots

http://www.ﬂickr.com/photos/nhankamer/5016628611

require 'anemone'

Anemone.crawl(url) do |anemone|
anemone.on_every_page do |page|
puts page.url
end
end http://www.cantora.mus.br/
http://www.cantora.mus.br/fotos
http://www.cantora.mus.br/?locale=en
http://www.cantora.mus.br/?locale=pt-BR
http://www.cantora.mus.br/musicas
http://www.cantora.mus.br/videos
http://www.cantora.mus.br/agenda
http://www.cantora.mus.br/novidades
http://www.cantora.mus.br/musicas/baixar
http://www.cantora.mus.br/visitors/baixar
http://www.cantora.mus.br/social
http://www.cantora.mus.br/fotos?locale=pt-BR
http://www.cantora.mus.br/musicas?locale=en
http://www.cantora.mus.br/fotos?locale=en

XPath
<html>
...
<div class="bla">
<a>legal</a>
</div>
...
</html>

html_doc = Nokogiri::HTML(html)
info = html_doc.xpath(
"//div[@class='bla']/a")
info.text
=> legal

XPath
<table id="super"> >> html_doc = Nokogiri::HTML(html)
<tr> >> info = html_doc.xpath(
<td>L1C1</td> "//table[@id='super']/tr")
<td>L1C2</td> >> info.size
=> 3
</tr>
<tr>
>> info
<td>L2C1</td> => legal
<td>L2C2</td>
</tr> >> info[0].xpath("td").size
<tr> => 2
<td>L3C1</td>
<td>L3C2</td> >> info[2].xpath("td")[1].text
</tr> => "L3C2"
</table>

ET
G

http://www.ﬂickr.com/photos/amortize/766738216

http://www.flickr.com/photos/abbeychristine/223898960

Good bot

/robots.txt

User-agent: *
Disallow:

http://www.ﬂickr.com/photos/temily/5645585162

http://www.ﬂickr.com/photos/nephelim/5632618462

>> body = RestClient.get(url)
>> json = JSON.parse(body)
>> content = json["Content"]
>> content.size
=> 16
AHA!!!
http://.../artistas?maxRowsList=1600&ﬁlter=Recentes
>> body = RestClient.get(url)
>> json = JSON.parse(a)
>> content = json["Content"]
>> content.size
=> 1600

http://.../artistas?maxRowsList=1600000&ﬁlter=Recentes
>> content.size
=> 9154

Bingo!!!

>> b["Content"].map {|c| c["ProfileUrl"]}
["caravella", "tomleite", "jeffersontavares", "rodrigoaraujo",
"jorgemendes", "bossapunk", "daviddepiro", "freetools", "ironia",
"tiagorosa", "outprofile", "lucianokoscky",
"bandateatraldecarona", "tlounge", "almanaque", "razzyoficial",
"cretinosecanalhas", "cincorios", "ninoantunes", "caiocorsalette",
"alinedelima", "thelio", "grupodomdesamba", "ladoz",
"alexandrepontes", "poeiradgua", "betimalu", "leonardobessa",
"kamaross", "marcusdocavaco", "atividadeinformal", "angelkeys",
"locojohn", "forcamusic", "tiaguinhoabreu", "marcelonegrao",
"jstonemghiphop", "uniaoglobal", "bandaefex", "severarock",
"manitu", "sasso", "kakka", "xsopretty", "belepoke", "caixaazul",
"wknd", "bandastarven", "bleiamusic", "3porcentoaocubo",
"lucianoterra", "hipnoia", "influencianegra", "bandaursamaior",
"mariafreitas", "jessejames", "vagnerrockxe", "stageo3",
"lemoneight", "innocence", "dinda", "marcelocapela",
"paulocamoeseoslusiadas", "magnussrock", "bandatheburk",
"mercantes", "bandaturnerock", "flaviasaolli", "tonysagga",
"thiagoponde", "centeio", "grupodeubranco", "bocadeleao",
"eusoueliascardan", "notoriaoficial", "planomasterrock", "rofgod",
"dreemonphc", "chicobrant", "osz", "bandalightspeed",
"cavernadenarnia", "sergiobenevenuto", "viniciusdeoliveira", ...]

>> html = RestClient.get("http://.../robomacaco")
>> html_doc = Nokogiri::HTML(html)
>> info = html_doc.xpath("//span[@class='name']")
>> info.text
=> "robo-macaco@hotmail.com
RIO DE JANEIRO - RJ - Brasil
21 9675-0199

cookies

cookies = {}
c = "s_nr=12954999; s_v19=12978609471; ... __utmc=206845458"
cook = c.split(";").map {|i| i.strip.split("=")}
cook.each {|u| cookies[u[0]] = u[1]}

RestClient.get(url, :cookies => cookies)

http://www.ip-adress.com/proxy_list

>> response = RestClient.get(url)
>> html_doc = Nokogiri::HTML(response)
>> table = html_doc.xpath("//table
[@class='proxylist']")
>> lines = table.children
>> lines.shift # tira o cabeçalho
Text

IP
>> lines[1].text
=> "208.52.144.55 document.write(":"+i+r+i+r)
anonymous proxy server-2 minutes ago United States"

JAVASCRIPT
=
RUBY

http://www.flickr.com/photos/drics/4266471776/

>> lines[1].text
=> "208.52.144.55 document.write(":"+i+r+i+r) anonymous
proxy server-2 minutes ago United States"

>> server = lines[1].text.split[0]
=> "208.52.144.55"

>> digits = lines[1].text.split(")")[0].split("+")
=> ["208.52.144.55document.write(":"", "i", "r", "i", "r"]
>> digits.shift
>> digits
=> ["i", "r", "i", "r"]
>> port = digits.map {|c| eval(c)}.join("")
=> "8080"
Voilà

RestClient.proxy = "http://#{server}:#{port}"

mechanize
agent = Mechanize.new
site = "http://www.cantora.mus.br"
page = agent.get("#{site}/baixar")
form = page.form
form['visitor[name]'] = 'daniel'
form['visitor[email]'] = "danicuki@gmail.com"
page = agent.submit(form)
tracks = page.links.select { |l| l.href =~ /track/ }
tracks.each do |t|
file = agent.get("#{site}#{t})
file.save
end

protection techniques

javascript
text as image
captcha
don’t be ingenuous

captcha
prove you are not a robot

YES you can!

3 steps

1. Download Image
2. filter image
3. run OCR software

scaling

http://www.ﬂickr.com/photos/liquene/3330714590

clouds

$ knife ec2 server create

Nessa vida de programador maluco
Me aparece cada situação
De repente um cliente, uma proposta bruta
Pra pegar de um site informação
Você tá louco, esse tipo de crime eu não faço
Se quiser tenho uns amigos lá do sul
Faz pra mim que eu te pago com essa jóia cool

Te dou um ruby
Pra você roubar
Com o seu robô

Quer fazer robô?
É só usar ruby
É só usar ruby
Pra fazer robô

http://www.ﬂickr.com/photos/jobafunky/5572503988

Thank you

Daniel Cukier
@danicuki

Ruby Robots: Guide to Web Scraping and Automation

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Ruby Robots: Guide to Web Scraping and Automation

Similaire à Ruby Robots: Guide to Web Scraping and Automation (20)

Plus de Daniel Cukier

Plus de Daniel Cukier (20)

Dernier

Dernier (20)

Ruby Robots: Guide to Web Scraping and Automation