SlideShare une entreprise Scribd logo
1  sur  13
Télécharger pour lire hors ligne
Scraping recalcitrant web sites
              with Python & Selenium
                     Roger Barnes




SyPy July 2012
Some sites suck
Some sites suck - "for your own good"




For security reasons, each button is
an image, dynamically generated by
a hash wrapped in a mess of
javascript, randomly placed
...but they work in a web browser!




  Let's use the web browser to scrape them
Enter Selenium



      Selenium automates browsers

                 That's it
Selenium can...
●   navigate (windows, frames, links)
●   find elements and parse attributes
●   interact and trigger events (click, type, ...)
●   capture screenshots
●   run javascript
●   let the browser take care of the hard stuff
    (cookies, javascript, sessions, profiles,
    DOM)

Comes with various components and bindings
                         ... including python
General Recipe
Ingredients:
● firefox (or chrome)
● firebug (or chrome dev tools)
● Selenium IDE
    ○ record a session, write less code
●   python and its batteries
●   python-selenium
●   xvfb and pyvirtualdisplay (optional)
●   other libraries to taste
    ○ eg image manipulation, database access, DOM
      parsing, OCR
General Recipe
Method:
● Install requirements (apt-get, pip etc)
   ○ sudo apt-get install xvfb firefox
   ○ pip install selenium pyvirtualdisplay
● Start up Firefox and Selenium IDE
● Record a "test" run through site
   ○ Add in some assertions along the way
● Export test as Python script
● Hack from there
   ○ Loops
   ○ Image/data extraction
   ○ Wrangling data into a database
Example from Selenium IDE
class Ingdirect2(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Firefox()
        self.driver.implicitly_wait( 30)
        self.base_url = "https://www.ingdirect.com.au"
        self.verificationErrors = []

   def test_ingdirect2(self):
       driver = self.driver
                                                           But what about
       driver.get( self.base_url + "/client/index.aspx")
                                                           that dang
       driver.switch_to_frame( 'body') # Had to add this keypad? ...
       driver.find_element_by_id( "txtCIF").clear()
       driver.find_element_by_id( "txtCIF").send_keys( "12345678")
       driver.find_element_by_id( "objKeypad_B1").click()
       driver.find_element_by_id( "objKeypad_B2").click()
       driver.find_element_by_id( "objKeypad_B3").click()
       driver.find_element_by_id( "objKeypad_B4").click()
       driver.find_element_by_id( "btnLogin").click()
       self.assertTrue( self.is_element_present(By.ID, "ctl2_lblBalance"))
PIL saves the day
# Get screenshot for extraction of button images
screenshot = driver.get_screenshot_as_base64()
im = Image.open(StringIO.StringIO(base64.decodestring(screenshot)))

table = driver.find_element_by_xpath( '//*[@id="objKeypad_divShowAll"]/table')
all_buttons = table.find_elements_by_tag_name( "input")

# Determine md5sum of each button by cropping based on element positions
for button in all_buttons:
    button_image = im.crop(getcropbox(button))
    hexid = hashlib.md5(button_image.tostring()).hexdigest()
    button_mapping[hexid] = button.get_attribute( "id")


# Now we know which button is which ( based on previous lookup), enter the PIN
for char in self.pin:
    driver.find_element_by_id(button_mapping[hex_mapping[char]]).click()

driver.find_element_by_id( "btnLogin").click()

# We're in!!!11one
But why do all this?
It's my data!                                  ... and I'll graph if i want to




       * Actual results may vary. Graph indicates open inodes, not high-roller gambling problem
That's all folks
Slides
● http://bit.ly/scrapium

Code
● https://gist.github.com/3015852

Me
● https://twitter.com/mindsocket
● https://github.com/mindsocket
● roger@mindsocket.com.au

Contenu connexe

Tendances

Selenium Automation Using Ruby
Selenium Automation Using RubySelenium Automation Using Ruby
Selenium Automation Using RubyKumari Warsha Goel
 
Selenium webdriver
Selenium webdriverSelenium webdriver
Selenium webdriversean_todd
 
Automated Testing With Watir
Automated Testing With WatirAutomated Testing With Watir
Automated Testing With WatirTimothy Fisher
 
Detecting headless browsers
Detecting headless browsersDetecting headless browsers
Detecting headless browsersSergey Shekyan
 
Session on Selenium 4 : What’s coming our way? by Hitesh Prajapati
Session on Selenium 4 : What’s coming our way? by Hitesh PrajapatiSession on Selenium 4 : What’s coming our way? by Hitesh Prajapati
Session on Selenium 4 : What’s coming our way? by Hitesh PrajapatiAgile Testing Alliance
 
Introduction to Protractor
Introduction to ProtractorIntroduction to Protractor
Introduction to ProtractorJie-Wei Wu
 
Jenkins and Groovy
Jenkins and GroovyJenkins and Groovy
Jenkins and GroovyKiyotaka Oku
 
Introduction To Ruby Watir (Web Application Testing In Ruby)
Introduction To Ruby Watir (Web Application Testing In Ruby)Introduction To Ruby Watir (Web Application Testing In Ruby)
Introduction To Ruby Watir (Web Application Testing In Ruby)Mindfire Solutions
 
High Performance JavaScript 2011
High Performance JavaScript 2011High Performance JavaScript 2011
High Performance JavaScript 2011Nicholas Zakas
 
An introduction to PhantomJS: A headless browser for automation test.
An introduction to PhantomJS: A headless browser for automation test.An introduction to PhantomJS: A headless browser for automation test.
An introduction to PhantomJS: A headless browser for automation test.BugRaptors
 
Protractor framework – how to make stable e2e tests for Angular applications
Protractor framework – how to make stable e2e tests for Angular applicationsProtractor framework – how to make stable e2e tests for Angular applications
Protractor framework – how to make stable e2e tests for Angular applicationsLudmila Nesvitiy
 
Session on Launching Selenium Grid and Running tests using docker compose and...
Session on Launching Selenium Grid and Running tests using docker compose and...Session on Launching Selenium Grid and Running tests using docker compose and...
Session on Launching Selenium Grid and Running tests using docker compose and...Agile Testing Alliance
 
Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...
Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...
Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...Ondřej Machulda
 
Integrační testy - Selenium
Integrační testy - SeleniumIntegrační testy - Selenium
Integrační testy - SeleniumKeyup
 
Protractor Tutorial Quality in Agile 2015
Protractor Tutorial Quality in Agile 2015Protractor Tutorial Quality in Agile 2015
Protractor Tutorial Quality in Agile 2015Andrew Eisenberg
 
淺談 Groovy 與 AWS 雲端應用開發整合
淺談 Groovy 與 AWS 雲端應用開發整合淺談 Groovy 與 AWS 雲端應用開發整合
淺談 Groovy 與 AWS 雲端應用開發整合Kyle Lin
 

Tendances (20)

Web driver training
Web driver trainingWeb driver training
Web driver training
 
Selenium Automation Using Ruby
Selenium Automation Using RubySelenium Automation Using Ruby
Selenium Automation Using Ruby
 
Selenium webdriver
Selenium webdriverSelenium webdriver
Selenium webdriver
 
Automated Testing With Watir
Automated Testing With WatirAutomated Testing With Watir
Automated Testing With Watir
 
Detecting headless browsers
Detecting headless browsersDetecting headless browsers
Detecting headless browsers
 
Webdriver.io
Webdriver.io Webdriver.io
Webdriver.io
 
Session on Selenium 4 : What’s coming our way? by Hitesh Prajapati
Session on Selenium 4 : What’s coming our way? by Hitesh PrajapatiSession on Selenium 4 : What’s coming our way? by Hitesh Prajapati
Session on Selenium 4 : What’s coming our way? by Hitesh Prajapati
 
Introduction to Protractor
Introduction to ProtractorIntroduction to Protractor
Introduction to Protractor
 
Jenkins and Groovy
Jenkins and GroovyJenkins and Groovy
Jenkins and Groovy
 
Zombiejs
ZombiejsZombiejs
Zombiejs
 
Introduction To Ruby Watir (Web Application Testing In Ruby)
Introduction To Ruby Watir (Web Application Testing In Ruby)Introduction To Ruby Watir (Web Application Testing In Ruby)
Introduction To Ruby Watir (Web Application Testing In Ruby)
 
High Performance JavaScript 2011
High Performance JavaScript 2011High Performance JavaScript 2011
High Performance JavaScript 2011
 
Code ceptioninstallation
Code ceptioninstallationCode ceptioninstallation
Code ceptioninstallation
 
An introduction to PhantomJS: A headless browser for automation test.
An introduction to PhantomJS: A headless browser for automation test.An introduction to PhantomJS: A headless browser for automation test.
An introduction to PhantomJS: A headless browser for automation test.
 
Protractor framework – how to make stable e2e tests for Angular applications
Protractor framework – how to make stable e2e tests for Angular applicationsProtractor framework – how to make stable e2e tests for Angular applications
Protractor framework – how to make stable e2e tests for Angular applications
 
Session on Launching Selenium Grid and Running tests using docker compose and...
Session on Launching Selenium Grid and Running tests using docker compose and...Session on Launching Selenium Grid and Running tests using docker compose and...
Session on Launching Selenium Grid and Running tests using docker compose and...
 
Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...
Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...
Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...
 
Integrační testy - Selenium
Integrační testy - SeleniumIntegrační testy - Selenium
Integrační testy - Selenium
 
Protractor Tutorial Quality in Agile 2015
Protractor Tutorial Quality in Agile 2015Protractor Tutorial Quality in Agile 2015
Protractor Tutorial Quality in Agile 2015
 
淺談 Groovy 與 AWS 雲端應用開發整合
淺談 Groovy 與 AWS 雲端應用開發整合淺談 Groovy 與 AWS 雲端應用開發整合
淺談 Groovy 與 AWS 雲端應用開發整合
 

Similaire à Scraping recalcitrant web sites with Python & Selenium

Web UI test automation instruments
Web UI test automation instrumentsWeb UI test automation instruments
Web UI test automation instrumentsArtem Nagornyi
 
Testing web application with Python
Testing web application with PythonTesting web application with Python
Testing web application with PythonJachym Cepicky
 
iOS Automation Primitives
iOS Automation PrimitivesiOS Automation Primitives
iOS Automation PrimitivesSynack
 
Owasp orlando, april 13, 2016
Owasp orlando, april 13, 2016Owasp orlando, april 13, 2016
Owasp orlando, april 13, 2016Mikhail Sosonkin
 
How to execute Automation Testing using Selenium
How to execute Automation Testing using SeleniumHow to execute Automation Testing using Selenium
How to execute Automation Testing using Seleniumvaluebound
 
Top100summit 谷歌-scott-improve your automated web application testing
Top100summit  谷歌-scott-improve your automated web application testingTop100summit  谷歌-scott-improve your automated web application testing
Top100summit 谷歌-scott-improve your automated web application testingdrewz lin
 
Automating Django Functional Tests Using Selenium on Cloud
Automating Django Functional Tests Using Selenium on CloudAutomating Django Functional Tests Using Selenium on Cloud
Automating Django Functional Tests Using Selenium on CloudJonghyun Park
 
Javascript Everywhere
Javascript EverywhereJavascript Everywhere
Javascript EverywherePascal Rettig
 
UI Testing Best Practices - An Expected Journey
UI Testing Best Practices - An Expected JourneyUI Testing Best Practices - An Expected Journey
UI Testing Best Practices - An Expected JourneyOren Farhi
 
Server Side JavaScript - You ain't seen nothing yet
Server Side JavaScript - You ain't seen nothing yetServer Side JavaScript - You ain't seen nothing yet
Server Side JavaScript - You ain't seen nothing yetTom Croucher
 
Introduction to jQuery
Introduction to jQueryIntroduction to jQuery
Introduction to jQueryAlek Davis
 
Writing automation tests with python selenium behave pageobjects
Writing automation tests with python selenium behave pageobjectsWriting automation tests with python selenium behave pageobjects
Writing automation tests with python selenium behave pageobjectsLeticia Rss
 
OpenCms Days 2014 - User Generated Content in OpenCms 9.5
OpenCms Days 2014 - User Generated Content in OpenCms 9.5OpenCms Days 2014 - User Generated Content in OpenCms 9.5
OpenCms Days 2014 - User Generated Content in OpenCms 9.5Alkacon Software GmbH & Co. KG
 
Testing ASP.NET - Progressive.NET
Testing ASP.NET - Progressive.NETTesting ASP.NET - Progressive.NET
Testing ASP.NET - Progressive.NETBen Hall
 
JSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
JSMVCOMFG - To sternly look at JavaScript MVC and Templating FrameworksJSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
JSMVCOMFG - To sternly look at JavaScript MVC and Templating FrameworksMario Heiderich
 
StHack 2014 - Mario "@0x6D6172696F" Heiderich - JSMVCOMFG
StHack 2014 - Mario "@0x6D6172696F" Heiderich - JSMVCOMFGStHack 2014 - Mario "@0x6D6172696F" Heiderich - JSMVCOMFG
StHack 2014 - Mario "@0x6D6172696F" Heiderich - JSMVCOMFGStHack
 
Automation - web testing with selenium
Automation - web testing with seleniumAutomation - web testing with selenium
Automation - web testing with seleniumTzirla Rozental
 
Catch a spider monkey
Catch a spider monkeyCatch a spider monkey
Catch a spider monkeyChengHui Weng
 

Similaire à Scraping recalcitrant web sites with Python & Selenium (20)

Web UI test automation instruments
Web UI test automation instrumentsWeb UI test automation instruments
Web UI test automation instruments
 
Testing web application with Python
Testing web application with PythonTesting web application with Python
Testing web application with Python
 
iOS Automation Primitives
iOS Automation PrimitivesiOS Automation Primitives
iOS Automation Primitives
 
Owasp orlando, april 13, 2016
Owasp orlando, april 13, 2016Owasp orlando, april 13, 2016
Owasp orlando, april 13, 2016
 
How to execute Automation Testing using Selenium
How to execute Automation Testing using SeleniumHow to execute Automation Testing using Selenium
How to execute Automation Testing using Selenium
 
Top100summit 谷歌-scott-improve your automated web application testing
Top100summit  谷歌-scott-improve your automated web application testingTop100summit  谷歌-scott-improve your automated web application testing
Top100summit 谷歌-scott-improve your automated web application testing
 
Automating Django Functional Tests Using Selenium on Cloud
Automating Django Functional Tests Using Selenium on CloudAutomating Django Functional Tests Using Selenium on Cloud
Automating Django Functional Tests Using Selenium on Cloud
 
Nightwatch 101 - Salvador Molina
Nightwatch 101 - Salvador MolinaNightwatch 101 - Salvador Molina
Nightwatch 101 - Salvador Molina
 
Javascript Everywhere
Javascript EverywhereJavascript Everywhere
Javascript Everywhere
 
UI Testing Best Practices - An Expected Journey
UI Testing Best Practices - An Expected JourneyUI Testing Best Practices - An Expected Journey
UI Testing Best Practices - An Expected Journey
 
Server Side JavaScript - You ain't seen nothing yet
Server Side JavaScript - You ain't seen nothing yetServer Side JavaScript - You ain't seen nothing yet
Server Side JavaScript - You ain't seen nothing yet
 
End-to-end testing with geb
End-to-end testing with gebEnd-to-end testing with geb
End-to-end testing with geb
 
Introduction to jQuery
Introduction to jQueryIntroduction to jQuery
Introduction to jQuery
 
Writing automation tests with python selenium behave pageobjects
Writing automation tests with python selenium behave pageobjectsWriting automation tests with python selenium behave pageobjects
Writing automation tests with python selenium behave pageobjects
 
OpenCms Days 2014 - User Generated Content in OpenCms 9.5
OpenCms Days 2014 - User Generated Content in OpenCms 9.5OpenCms Days 2014 - User Generated Content in OpenCms 9.5
OpenCms Days 2014 - User Generated Content in OpenCms 9.5
 
Testing ASP.NET - Progressive.NET
Testing ASP.NET - Progressive.NETTesting ASP.NET - Progressive.NET
Testing ASP.NET - Progressive.NET
 
JSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
JSMVCOMFG - To sternly look at JavaScript MVC and Templating FrameworksJSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
JSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
 
StHack 2014 - Mario "@0x6D6172696F" Heiderich - JSMVCOMFG
StHack 2014 - Mario "@0x6D6172696F" Heiderich - JSMVCOMFGStHack 2014 - Mario "@0x6D6172696F" Heiderich - JSMVCOMFG
StHack 2014 - Mario "@0x6D6172696F" Heiderich - JSMVCOMFG
 
Automation - web testing with selenium
Automation - web testing with seleniumAutomation - web testing with selenium
Automation - web testing with selenium
 
Catch a spider monkey
Catch a spider monkeyCatch a spider monkey
Catch a spider monkey
 

Plus de Roger Barnes

The life of a web request - techniques for measuring and improving Django app...
The life of a web request - techniques for measuring and improving Django app...The life of a web request - techniques for measuring and improving Django app...
The life of a web request - techniques for measuring and improving Django app...Roger Barnes
 
Building data flows with Celery and SQLAlchemy
Building data flows with Celery and SQLAlchemyBuilding data flows with Celery and SQLAlchemy
Building data flows with Celery and SQLAlchemyRoger Barnes
 
Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013Roger Barnes
 
Poker, packets, pipes and Python
Poker, packets, pipes and PythonPoker, packets, pipes and Python
Poker, packets, pipes and PythonRoger Barnes
 
Towards Continuous Deployment with Django
Towards Continuous Deployment with DjangoTowards Continuous Deployment with Django
Towards Continuous Deployment with DjangoRoger Barnes
 
Intro to Pinax: Kickstarting Your Django Apps
Intro to Pinax: Kickstarting Your Django AppsIntro to Pinax: Kickstarting Your Django Apps
Intro to Pinax: Kickstarting Your Django AppsRoger Barnes
 

Plus de Roger Barnes (6)

The life of a web request - techniques for measuring and improving Django app...
The life of a web request - techniques for measuring and improving Django app...The life of a web request - techniques for measuring and improving Django app...
The life of a web request - techniques for measuring and improving Django app...
 
Building data flows with Celery and SQLAlchemy
Building data flows with Celery and SQLAlchemyBuilding data flows with Celery and SQLAlchemy
Building data flows with Celery and SQLAlchemy
 
Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013
 
Poker, packets, pipes and Python
Poker, packets, pipes and PythonPoker, packets, pipes and Python
Poker, packets, pipes and Python
 
Towards Continuous Deployment with Django
Towards Continuous Deployment with DjangoTowards Continuous Deployment with Django
Towards Continuous Deployment with Django
 
Intro to Pinax: Kickstarting Your Django Apps
Intro to Pinax: Kickstarting Your Django AppsIntro to Pinax: Kickstarting Your Django Apps
Intro to Pinax: Kickstarting Your Django Apps
 

Dernier

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Dernier (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Scraping recalcitrant web sites with Python & Selenium

  • 1. Scraping recalcitrant web sites with Python & Selenium Roger Barnes SyPy July 2012
  • 3. Some sites suck - "for your own good" For security reasons, each button is an image, dynamically generated by a hash wrapped in a mess of javascript, randomly placed
  • 4. ...but they work in a web browser! Let's use the web browser to scrape them
  • 5. Enter Selenium Selenium automates browsers That's it
  • 6. Selenium can... ● navigate (windows, frames, links) ● find elements and parse attributes ● interact and trigger events (click, type, ...) ● capture screenshots ● run javascript ● let the browser take care of the hard stuff (cookies, javascript, sessions, profiles, DOM) Comes with various components and bindings ... including python
  • 7. General Recipe Ingredients: ● firefox (or chrome) ● firebug (or chrome dev tools) ● Selenium IDE ○ record a session, write less code ● python and its batteries ● python-selenium ● xvfb and pyvirtualdisplay (optional) ● other libraries to taste ○ eg image manipulation, database access, DOM parsing, OCR
  • 8. General Recipe Method: ● Install requirements (apt-get, pip etc) ○ sudo apt-get install xvfb firefox ○ pip install selenium pyvirtualdisplay ● Start up Firefox and Selenium IDE ● Record a "test" run through site ○ Add in some assertions along the way ● Export test as Python script ● Hack from there ○ Loops ○ Image/data extraction ○ Wrangling data into a database
  • 9.
  • 10. Example from Selenium IDE class Ingdirect2(unittest.TestCase): def setUp(self): self.driver = webdriver.Firefox() self.driver.implicitly_wait( 30) self.base_url = "https://www.ingdirect.com.au" self.verificationErrors = [] def test_ingdirect2(self): driver = self.driver But what about driver.get( self.base_url + "/client/index.aspx") that dang driver.switch_to_frame( 'body') # Had to add this keypad? ... driver.find_element_by_id( "txtCIF").clear() driver.find_element_by_id( "txtCIF").send_keys( "12345678") driver.find_element_by_id( "objKeypad_B1").click() driver.find_element_by_id( "objKeypad_B2").click() driver.find_element_by_id( "objKeypad_B3").click() driver.find_element_by_id( "objKeypad_B4").click() driver.find_element_by_id( "btnLogin").click() self.assertTrue( self.is_element_present(By.ID, "ctl2_lblBalance"))
  • 11. PIL saves the day # Get screenshot for extraction of button images screenshot = driver.get_screenshot_as_base64() im = Image.open(StringIO.StringIO(base64.decodestring(screenshot))) table = driver.find_element_by_xpath( '//*[@id="objKeypad_divShowAll"]/table') all_buttons = table.find_elements_by_tag_name( "input") # Determine md5sum of each button by cropping based on element positions for button in all_buttons: button_image = im.crop(getcropbox(button)) hexid = hashlib.md5(button_image.tostring()).hexdigest() button_mapping[hexid] = button.get_attribute( "id") # Now we know which button is which ( based on previous lookup), enter the PIN for char in self.pin: driver.find_element_by_id(button_mapping[hex_mapping[char]]).click() driver.find_element_by_id( "btnLogin").click() # We're in!!!11one
  • 12. But why do all this? It's my data! ... and I'll graph if i want to * Actual results may vary. Graph indicates open inodes, not high-roller gambling problem
  • 13. That's all folks Slides ● http://bit.ly/scrapium Code ● https://gist.github.com/3015852 Me ● https://twitter.com/mindsocket ● https://github.com/mindsocket ● roger@mindsocket.com.au