SlideShare a Scribd company logo
1 of 20
Web Scraping in Ruby*
 * for fun and profit
2 of the 6 W’s


Who?
Why?
Why Ruby?
Fun                       5.times { print
                   “I like scraping in Ruby” }
OOP
Pretty
Closures
Flexible
Interactive mode
Why Ruby?

Community
Culture of testing
Rails is a web app
  test culture + web app = web testing
Lots of libraries!
Why Scrape the Web?
Information
Research
Testing
  acceptance & integration
  performance / load
Standard API: HTTP + text
Why not?
IANAL
Legal Concerns
Copyright
  Online != Public Domain
  Fair Use
License
  AUP, EULA, TOS, TOU ...
  Example: www.dexknows.com
Trespass to chattel
Objective


Get data. Have fun. Be nice.
Request
GET /
Host: www.google.com




Response
200 OK
<html>
  <head>
     <title>My page</title>
  </head>
  <body>
     <p>da body</p>
  </body>
</html>
HTTP Status Codes

200 level - Success on client and server
300 level - Redirection - client is supposed to do
something else
400 level - Client error
500 level - Server error
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
 "http://www.w3.org/TR/html4/loose.dtd">

<html lang="en">
<head>

  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  <title>All about Joe</title>
</head>
<body>
  <div id=”header”> | <span class=”nav-link”>home</span> | </div>
  <div id=”content”>
    <div id=”sidebar”>I’m a paragraph</div>
    <div id=”body”>
       <h1>I’m a header</h1>
       <p class=”bio”>
         <img src=”/img/mugshot.png”>
         <span class=”name”>Joe Blow</span>
       </p>
       <p>Info about Joe</p>
    </div>
  </div>
  <div id=”footer”>&copy; 2009 Creative Commons</div>
</body>
</html>
CSS & XPath Selectors
Way to target data inside a document
CSS
  “head title”
  "div#body * span.name"
XPath
  /head/title
  //div[@id='body']/*/span[@class='name']
Selecting XPath Nodes

      title           selects all nodes in the doc


     /html             selects from the root node


     //title      selects all nodes below current node

     [@src]       selects nodes with an attribute and
[@class=’name’]            optionally a value
Browser Tools

Firefox
  QuarkRuby’s version of Firebug
    click to get CSS & XPath Selectors
  Firefinder extension to Firebug
    query doc with CSS Selectors
Interaction                                          Parsing

 net/http                                     regex

                 Watir family

                                hpricot & nokogiri

              Mechanize

                    webrat

                    scrubyt
Scrubyt
http://github.com/scrubber/scrubyt_examples/blob/master/google.rb


require 'rubygems'
require 'scrubyt'
 
google_data = Scrubyt::Extractor.define do
  fetch 'http://www.google.com/search?hl=en&q=ruby'
  
  link_title "//a[@class='l']", :write_text => true do
    link_url
  end
end
 
p google_data.to_hash
Watir (Safariwatir)
require 'safariwatir'

browser = Watir::Safari.new
browser.goto("http://google.com")
browser.text_field(:name, "q").set("safariwatir")
browser.button(:name, "btnI").click
puts "FAILURE" unless
browser.contains_text("software")
Webrat Scraper
   require 'webrat_scraper'

   class MyScraper < WebratScraper
     def initialize
       @url = "http://www.google.com"
       super
     end

    def first_result_for(search_term)
      visit @url

      fill_in "q", :with => search_term
      click_button

      first_link = (doc/"li.g a.l").first

       {:text => first_link.inner_text,
         :url => first_link.attributes[“href”].to_s }
     end
   end

   m = MyScraper.new
   result = m.first_result_for("webrat-mechanize")
   puts result.inspect
Resources
Ruby http://ruby-lang.org

HTTP Status Codes http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

CSS Selectors http://www.w3schools.com/Css/css_syntax.asp

XPath http://www.w3schools.com/XPath/xpath_syntax.asp

Mechanize http://mechanize.rubyforge.org/mechanize/

Nokogiri http://nokogiri.rubyforge.org/nokogiri/

Hpricot http://github.com/whymirror/hpricot

Webrat http://wiki.github.com/brynary/webrat

Scrubyt http://scrubyt.org/

Webrat Scraper http://github.com/jtzemp/webrat-scraper

TourBus http://github.com/dbrady/tourbus
Advanced Topics
Distributed Scraping
Anonymization
Security
  Captcha
  XSRF & CSRF protections
Load and Performance Testing

More Related Content

What's hot

Simple Web Apps With Sinatra
Simple Web Apps With SinatraSimple Web Apps With Sinatra
Simple Web Apps With Sinatraa_l
 
Web scraping 101 with goutte
Web scraping 101 with goutteWeb scraping 101 with goutte
Web scraping 101 with goutteJoshua Copeland
 
Basics of Front End Web Dev PowerPoint
Basics of Front End Web Dev PowerPointBasics of Front End Web Dev PowerPoint
Basics of Front End Web Dev PowerPointSahil Gandhi
 
The Need for Speed - SMX Sydney 2013
The Need for Speed - SMX Sydney 2013The Need for Speed - SMX Sydney 2013
The Need for Speed - SMX Sydney 2013Bastian Grimm
 
Seo Bootcamp for Small Buisinesses
 Seo Bootcamp for Small Buisinesses Seo Bootcamp for Small Buisinesses
Seo Bootcamp for Small BuisinessesCharlie Kalech
 
디자인 패턴과 YUI를 이용해 Rich UI 빠르게 구현하기
디자인 패턴과 YUI를 이용해 Rich UI 빠르게 구현하기디자인 패턴과 YUI를 이용해 Rich UI 빠르게 구현하기
디자인 패턴과 YUI를 이용해 Rich UI 빠르게 구현하기Jinho Jung
 
Creating HTML Pages
Creating HTML PagesCreating HTML Pages
Creating HTML PagesMike Crabb
 
On-page SEO for Drupal
On-page SEO for DrupalOn-page SEO for Drupal
On-page SEO for DrupalSvilen Sabev
 
Introduction to jQuery Mobile - Web Deliver for All
Introduction to jQuery Mobile - Web Deliver for AllIntroduction to jQuery Mobile - Web Deliver for All
Introduction to jQuery Mobile - Web Deliver for AllMarc Grabanski
 
Findability Bliss Through Web Standards
Findability Bliss Through Web StandardsFindability Bliss Through Web Standards
Findability Bliss Through Web StandardsAarron Walter
 
Progressive Downloads and Rendering - take #2
Progressive Downloads and Rendering - take #2Progressive Downloads and Rendering - take #2
Progressive Downloads and Rendering - take #2Stoyan Stefanov
 
High Performance Social Plugins
High Performance Social PluginsHigh Performance Social Plugins
High Performance Social PluginsStoyan Stefanov
 
JavaServer Pages
JavaServer Pages JavaServer Pages
JavaServer Pages profbnk
 
HTML5 and the web of tomorrow!
HTML5  and the  web of tomorrow!HTML5  and the  web of tomorrow!
HTML5 and the web of tomorrow!Christian Heilmann
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMoshe Kaplan
 
Html5的应用与推行
Html5的应用与推行Html5的应用与推行
Html5的应用与推行Sofish Lin
 
Get Django, Get Hired - An opinionated guide to getting the best job, for the...
Get Django, Get Hired - An opinionated guide to getting the best job, for the...Get Django, Get Hired - An opinionated guide to getting the best job, for the...
Get Django, Get Hired - An opinionated guide to getting the best job, for the...Marcel Chastain
 

What's hot (20)

Simple Web Apps With Sinatra
Simple Web Apps With SinatraSimple Web Apps With Sinatra
Simple Web Apps With Sinatra
 
Web scraping 101 with goutte
Web scraping 101 with goutteWeb scraping 101 with goutte
Web scraping 101 with goutte
 
Web Components Revolution
Web Components RevolutionWeb Components Revolution
Web Components Revolution
 
Basics of Front End Web Dev PowerPoint
Basics of Front End Web Dev PowerPointBasics of Front End Web Dev PowerPoint
Basics of Front End Web Dev PowerPoint
 
The Need for Speed - SMX Sydney 2013
The Need for Speed - SMX Sydney 2013The Need for Speed - SMX Sydney 2013
The Need for Speed - SMX Sydney 2013
 
Seo Bootcamp for Small Buisinesses
 Seo Bootcamp for Small Buisinesses Seo Bootcamp for Small Buisinesses
Seo Bootcamp for Small Buisinesses
 
디자인 패턴과 YUI를 이용해 Rich UI 빠르게 구현하기
디자인 패턴과 YUI를 이용해 Rich UI 빠르게 구현하기디자인 패턴과 YUI를 이용해 Rich UI 빠르게 구현하기
디자인 패턴과 YUI를 이용해 Rich UI 빠르게 구현하기
 
Creating HTML Pages
Creating HTML PagesCreating HTML Pages
Creating HTML Pages
 
On-page SEO for Drupal
On-page SEO for DrupalOn-page SEO for Drupal
On-page SEO for Drupal
 
Introduction to jQuery Mobile - Web Deliver for All
Introduction to jQuery Mobile - Web Deliver for AllIntroduction to jQuery Mobile - Web Deliver for All
Introduction to jQuery Mobile - Web Deliver for All
 
Findability Bliss Through Web Standards
Findability Bliss Through Web StandardsFindability Bliss Through Web Standards
Findability Bliss Through Web Standards
 
Please dont touch-3.6-jsday
Please dont touch-3.6-jsdayPlease dont touch-3.6-jsday
Please dont touch-3.6-jsday
 
Progressive Downloads and Rendering - take #2
Progressive Downloads and Rendering - take #2Progressive Downloads and Rendering - take #2
Progressive Downloads and Rendering - take #2
 
High Performance Social Plugins
High Performance Social PluginsHigh Performance Social Plugins
High Performance Social Plugins
 
JavaServer Pages
JavaServer Pages JavaServer Pages
JavaServer Pages
 
HTML5 and the web of tomorrow!
HTML5  and the  web of tomorrow!HTML5  and the  web of tomorrow!
HTML5 and the web of tomorrow!
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
HTML 5 & CSS 3
HTML 5 & CSS 3HTML 5 & CSS 3
HTML 5 & CSS 3
 
Html5的应用与推行
Html5的应用与推行Html5的应用与推行
Html5的应用与推行
 
Get Django, Get Hired - An opinionated guide to getting the best job, for the...
Get Django, Get Hired - An opinionated guide to getting the best job, for the...Get Django, Get Hired - An opinionated guide to getting the best job, for the...
Get Django, Get Hired - An opinionated guide to getting the best job, for the...
 

Viewers also liked

さらに仕事に使うRuby
さらに仕事に使うRubyさらに仕事に使うRuby
さらに仕事に使うRubyKentaro Goto
 
仕事で使うRuby
仕事で使うRuby仕事で使うRuby
仕事で使うRubyKentaro Goto
 
Hpricot GURU-SP por Jonas Alves
Hpricot GURU-SP por Jonas AlvesHpricot GURU-SP por Jonas Alves
Hpricot GURU-SP por Jonas AlvesJonas Alves
 
もっと仕事で使うRuby
もっと仕事で使うRubyもっと仕事で使うRuby
もっと仕事で使うRubyKentaro Goto
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome EconomyHelge Tennø
 

Viewers also liked (8)

Web scraping
Web scrapingWeb scraping
Web scraping
 
さらに仕事に使うRuby
さらに仕事に使うRubyさらに仕事に使うRuby
さらに仕事に使うRuby
 
仕事で使うRuby
仕事で使うRuby仕事で使うRuby
仕事で使うRuby
 
Hpricot GURU-SP por Jonas Alves
Hpricot GURU-SP por Jonas AlvesHpricot GURU-SP por Jonas Alves
Hpricot GURU-SP por Jonas Alves
 
20世紀Ruby
20世紀Ruby20世紀Ruby
20世紀Ruby
 
HTML Parsing With Hpricot
HTML Parsing With HpricotHTML Parsing With Hpricot
HTML Parsing With Hpricot
 
もっと仕事で使うRuby
もっと仕事で使うRubyもっと仕事で使うRuby
もっと仕事で使うRuby
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome Economy
 

Similar to Web Scraping In Ruby Utosc 2009.Key

It is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考えるIt is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考えるSadaaki HIRAI
 
Frontend for developers
Frontend for developersFrontend for developers
Frontend for developersHernan Mammana
 
Great+Seo+Cheatsheet
Great+Seo+CheatsheetGreat+Seo+Cheatsheet
Great+Seo+Cheatsheetjeetututeja
 
The Big Picture and How to Get Started
The Big Picture and How to Get StartedThe Big Picture and How to Get Started
The Big Picture and How to Get Startedguest1af57e
 
Seo cheat sheet_2-2013
Seo cheat sheet_2-2013Seo cheat sheet_2-2013
Seo cheat sheet_2-2013ekkarthik
 
Seo cheat sheet_2-2013
Seo cheat sheet_2-2013Seo cheat sheet_2-2013
Seo cheat sheet_2-2013vijay patil
 
An Introduction to Tornado
An Introduction to TornadoAn Introduction to Tornado
An Introduction to TornadoGavin Roy
 
The Django Web Application Framework
The Django Web Application FrameworkThe Django Web Application Framework
The Django Web Application FrameworkSimon Willison
 
Using and scaling Rack and Rack-based middleware
Using and scaling Rack and Rack-based middlewareUsing and scaling Rack and Rack-based middleware
Using and scaling Rack and Rack-based middlewareAlona Mekhovova
 
BP-6 Repository Customization Best Practices
BP-6 Repository Customization Best PracticesBP-6 Repository Customization Best Practices
BP-6 Repository Customization Best PracticesAlfresco Software
 
Html5 drupal7 with mandakini kumari(1)
Html5 drupal7 with mandakini kumari(1)Html5 drupal7 with mandakini kumari(1)
Html5 drupal7 with mandakini kumari(1)Mandakini Kumari
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2fishwarter
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2fishwarter
 
JS Lab`16. Владимир Воевидка: "Как работает браузер"
JS Lab`16. Владимир Воевидка: "Как работает браузер"JS Lab`16. Владимир Воевидка: "Как работает браузер"
JS Lab`16. Владимир Воевидка: "Как работает браузер"GeeksLab Odessa
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Esteve Castells
 

Similar to Web Scraping In Ruby Utosc 2009.Key (20)

BrightonSEO
BrightonSEOBrightonSEO
BrightonSEO
 
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考えるIt is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
 
Frontend for developers
Frontend for developersFrontend for developers
Frontend for developers
 
Great+Seo+Cheatsheet
Great+Seo+CheatsheetGreat+Seo+Cheatsheet
Great+Seo+Cheatsheet
 
HTML5
HTML5HTML5
HTML5
 
The Devil and HTML5
The Devil and HTML5The Devil and HTML5
The Devil and HTML5
 
Scrapy workshop
Scrapy workshopScrapy workshop
Scrapy workshop
 
The Big Picture and How to Get Started
The Big Picture and How to Get StartedThe Big Picture and How to Get Started
The Big Picture and How to Get Started
 
Seo cheat sheet_2-2013
Seo cheat sheet_2-2013Seo cheat sheet_2-2013
Seo cheat sheet_2-2013
 
Seo cheat sheet_2-2013
Seo cheat sheet_2-2013Seo cheat sheet_2-2013
Seo cheat sheet_2-2013
 
An Introduction to Tornado
An Introduction to TornadoAn Introduction to Tornado
An Introduction to Tornado
 
The Django Web Application Framework
The Django Web Application FrameworkThe Django Web Application Framework
The Django Web Application Framework
 
Using and scaling Rack and Rack-based middleware
Using and scaling Rack and Rack-based middlewareUsing and scaling Rack and Rack-based middleware
Using and scaling Rack and Rack-based middleware
 
BP-6 Repository Customization Best Practices
BP-6 Repository Customization Best PracticesBP-6 Repository Customization Best Practices
BP-6 Repository Customization Best Practices
 
Html5 drupal7 with mandakini kumari(1)
Html5 drupal7 with mandakini kumari(1)Html5 drupal7 with mandakini kumari(1)
Html5 drupal7 with mandakini kumari(1)
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2
 
JS Lab`16. Владимир Воевидка: "Как работает браузер"
JS Lab`16. Владимир Воевидка: "Как работает браузер"JS Lab`16. Владимир Воевидка: "Как работает браузер"
JS Lab`16. Владимир Воевидка: "Как работает браузер"
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
 
HTML 5 - Overview
HTML 5 - OverviewHTML 5 - Overview
HTML 5 - Overview
 

Recently uploaded

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Web Scraping In Ruby Utosc 2009.Key

  • 1. Web Scraping in Ruby* * for fun and profit
  • 2. 2 of the 6 W’s Who? Why?
  • 3. Why Ruby? Fun 5.times { print “I like scraping in Ruby” } OOP Pretty Closures Flexible Interactive mode
  • 4. Why Ruby? Community Culture of testing Rails is a web app test culture + web app = web testing Lots of libraries!
  • 5. Why Scrape the Web? Information Research Testing acceptance & integration performance / load Standard API: HTTP + text Why not?
  • 7. Legal Concerns Copyright Online != Public Domain Fair Use License AUP, EULA, TOS, TOU ... Example: www.dexknows.com Trespass to chattel
  • 8. Objective Get data. Have fun. Be nice.
  • 9. Request GET / Host: www.google.com Response 200 OK <html> <head> <title>My page</title> </head> <body> <p>da body</p> </body> </html>
  • 10. HTTP Status Codes 200 level - Success on client and server 300 level - Redirection - client is supposed to do something else 400 level - Client error 500 level - Server error
  • 11. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>All about Joe</title> </head> <body> <div id=”header”> | <span class=”nav-link”>home</span> | </div> <div id=”content”> <div id=”sidebar”>I’m a paragraph</div> <div id=”body”> <h1>I’m a header</h1> <p class=”bio”> <img src=”/img/mugshot.png”> <span class=”name”>Joe Blow</span> </p> <p>Info about Joe</p> </div> </div> <div id=”footer”>&copy; 2009 Creative Commons</div> </body> </html>
  • 12. CSS & XPath Selectors Way to target data inside a document CSS “head title” "div#body * span.name" XPath /head/title //div[@id='body']/*/span[@class='name']
  • 13. Selecting XPath Nodes title selects all nodes in the doc /html selects from the root node //title selects all nodes below current node [@src] selects nodes with an attribute and [@class=’name’] optionally a value
  • 14. Browser Tools Firefox QuarkRuby’s version of Firebug click to get CSS & XPath Selectors Firefinder extension to Firebug query doc with CSS Selectors
  • 15. Interaction Parsing net/http regex Watir family hpricot & nokogiri Mechanize webrat scrubyt
  • 16. Scrubyt http://github.com/scrubber/scrubyt_examples/blob/master/google.rb require 'rubygems' require 'scrubyt'   google_data = Scrubyt::Extractor.define do   fetch 'http://www.google.com/search?hl=en&q=ruby'      link_title "//a[@class='l']", :write_text => true do     link_url   end end   p google_data.to_hash
  • 17. Watir (Safariwatir) require 'safariwatir' browser = Watir::Safari.new browser.goto("http://google.com") browser.text_field(:name, "q").set("safariwatir") browser.button(:name, "btnI").click puts "FAILURE" unless browser.contains_text("software")
  • 18. Webrat Scraper require 'webrat_scraper' class MyScraper < WebratScraper def initialize @url = "http://www.google.com" super end def first_result_for(search_term) visit @url fill_in "q", :with => search_term click_button first_link = (doc/"li.g a.l").first {:text => first_link.inner_text, :url => first_link.attributes[“href”].to_s } end end m = MyScraper.new result = m.first_result_for("webrat-mechanize") puts result.inspect
  • 19. Resources Ruby http://ruby-lang.org HTTP Status Codes http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html CSS Selectors http://www.w3schools.com/Css/css_syntax.asp XPath http://www.w3schools.com/XPath/xpath_syntax.asp Mechanize http://mechanize.rubyforge.org/mechanize/ Nokogiri http://nokogiri.rubyforge.org/nokogiri/ Hpricot http://github.com/whymirror/hpricot Webrat http://wiki.github.com/brynary/webrat Scrubyt http://scrubyt.org/ Webrat Scraper http://github.com/jtzemp/webrat-scraper TourBus http://github.com/dbrady/tourbus
  • 20. Advanced Topics Distributed Scraping Anonymization Security Captcha XSRF & CSRF protections Load and Performance Testing

Editor's Notes

  1. Civil Law, Contract Law &amp; Tort Law Copyright is civil law with a lot of case law &amp; precedent TOS, AUP, etc. are contract law Tresspass to chattel tort law - have to &amp;#x2018;damage&amp;#x2019; Dex knows: look at TOU for automated scraping, robots.txt and the sitemap