SlideShare une entreprise Scribd logo
1  sur  41
Introduction to Scraping in
            Python


By :-
   
        Mayank Jain (firesofmay@gmail.com)
   
        Gaurav Jain (grvmjain@gmail.com)

                   Code is available at
        https://github.com/firesofmay/Null-Pune-
           Intro-to-Scraping-Talk-March-2012
Overview of the ”Presentation”

    What is Scraping?

    So what is this HTTP?

    Tools of Trade

    User Agents

    Firebug

    Using BeautfulSoup and Regular Expressions

    Using Google Translator to post on Facebook in
    hindi

    Shodan

    Robots.txt
What is Scraping?

    Web scraping/Web harvesting/Web data
    extraction is a computer software
    technique of extracting information from
    websites.
So what is this HTTP thing?

    If you goto this page -
    http://en.wikipedia.org/wiki/Python_%28programming_language%29


    To view the HTTP Requests being made
    we use a firefox Pluging called as
    LiveHTTPHeaders
----------Request From Client to Server----------
GET /wiki/Python_(programming_language) HTTP/1.1
Host: en.wikipedia.org
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: http://en.wikipedia.org/wiki/Python
Cookie: clicktracking-session=QgVKVqIpsfsgsgszgvwBCASkSOdw2O;
  mediaWiki.user.bucket:ext.articleFeedback-tracking=8%3Aignore;
  mediaWiki.user.bucket:ext.articleFeedback-options=8%3Ashow
----------End of Request From Client to Server----------
----------Response From Server to Client----------

    HTTP/1.0 200 OK

    Date: Mon, 10 Oct 2011 12:44:46 GMT

    Server: Apache

    X-Content-Type-Options: nosniff

    Cache-Control: private, s-maxage=0, max-age=0, must-revalidate

    Content-Language: en

    Vary: Accept-Encoding,Cookie

    Last-Modified: Sun, 09 Oct 2011 05:01:32 GMT

    Content-Encoding: gzip

    Content-Length: 47407

    Content-Type: text/html; charset=UTF-8

    Age: 10932

    X-Cache: HIT from sq66.wikimedia.org, MISS from sq65.wikimedia.org

    X-Cache-Lookup: HIT from sq66.wikimedia.org:3128, MISS from
    sq65.wikimedia.org:80

    Connection: keep-alive

    ----------End of Response From Server to Client----------
Tools of Trade

    Linux OS is prefered (Installations Command for
    Ubuntu Distro)

    Dreampie IDE (For Quick Prototyping)
        
            $ sudo apt-get install dreampie

    Python 2.x (Preferably 2.6+)

    pip installter for python packages
        
            $ sudo apt-get install python-pip

    Python requests: HTTP for Humans
        
            $ pip install requests

    Python re Library for regular Expressions
    (Inbuilt)

    LiveHTTPHeader Firefox Plugin
        
            https://addons.mozilla.org/en-US/firefox/
            addon/live-http-headers/

    Firebug Firefox Plugin
        
            https://addons.mozilla.org/en-US/firefox/
            addon/firebug/?src=search

    User Agent Switcher Firefox Plugin
        
            https://addons.mozilla.org/en-US/firefox/
            addon/user-agent-switcher/?src=search

    BeautifulSoup Python Library
        
            http://www.crummy.com/software/Beautif
            ulSoup/#Download
Fetching HTML Page (fetch.py)
import requests
url = 'http://en.wikipedia.org/wiki/Python_
  %28programming_language%29'
data = requests.get(url).content
f = open("debug.html", 'w')
f.write(data)
f.close()


#To Run

    $ python fetch.py
Why Does User Agent Matter?

    When software agent operates in a
    network protocol, it often identifies itself,
    its application type, operating system,
    software vendor, or software revision, by
    submitting a characteristic identification
    string to its operating peer.

    In HTTP, SIP, and SMTP/NNTP protocols,
    this identification is transmitted in a
    header field User-Agent. Bots, such as
    Web crawlers, often also include a URL
    and/or e-mail address so that the
    Webmaster can contact the operator of
    the bot.
Demo of How Sites Behave
Differently With Different UAs - I
  
      https://addons.mozilla.org/en-
      US/firefox/addon/user-agent-switcher/
  
      Visit the above site with UA (User Agent)
      as firefox
Demo of How Sites Behave
Differently With Different UAs - I
  
      https://addons.mozilla.org/en-
      US/firefox/addon/user-agent-switcher/
  
      Now visit the above site with UA as IE
  
      To switch your User Agent Use User Agent
      Switcher Addon.
  
      Notice the new banner, asking you to
      install firefox even though you are using
      firefox (based on your user agent
      selected).
Demo of How Sites Behave
Differently With Different UAs - II
 
     https://developers.facebook.com/docs/refe
     rence/api/permissions/
 
     Now visit the above site with UA as IE
         
             Asked for Login? But I don't want to
             Login!!!
 
     Let's try a Google bot as UA
         
             Yayyy!!
 
     Let's try a blank UA
         
             Yayy Again! :D
Inspecting Elements with
               Firebug

    We want to fetch the Given Sale Price
    (19.99)


    Goto this link - http://www.payless.com/store/product/detail.jsp?
    catId=cat10243&subCatId=cat10243&skuId=091151050&productId=68423&lotId=091
    151&category=


    Right Click on $19.99 > Inspect Element
    with firebug
Inspecting Elements with
         Firebug
Demo Payless_Parser.py

    Run the code

    $ python Payless_Parser.py

    Price of this item is 19.99

    Modifiy The url variable to -
    http://www.payless.com/store/product/deta
    il.jsp?
    catId=cat10088&subCatId=cat10243&skuI
    d=094079050&productId=70984&lotId=09
    4079&category=&catdisplayName=Wome
    ns
    Why does this work? Try to understand.
How about Extracting all the
Permissions from this page?
Demo
Extract_Facebook_Permission
            s.py

    Url to extract from :
    https://developers.facebook.com/docs/refe
    rence/api/permissions/

    Check the next slide for Expected output
    and how to run the code

    $ python Extract_Facebook_Permissions.py

    ['user_about_me', 'friends_about_me', 'about', 'user_activities', 'friends_activities',
    'activities', 'user_birthday', 'friends_birthday', 'birthday', 'user_checkins',
    'friends_checkins', 'user_education_history', 'friends_education_history',
    'education', 'user_events', 'friends_events', 'events', 'user_groups',
    'friends_groups', 'groups', 'user_hometown', 'friends_hometown', 'hometown',
    'user_interests', 'friends_interests', 'interests', 'user_likes', 'friends_likes', 'likes',
    'user_location', 'friends_location', 'location', 'user_notes', 'friends_notes', 'notes',
    'user_photos', 'friends_photos', 'user_questions', 'friends_questions',
    'user_relationships', 'friends_relationships', 'user_relationship_details',
    'friends_relationship_details', 'user_religion_politics', 'friends_religion_politics',
    'user_status', 'friends_status', 'user_videos', 'friends_videos', 'user_website',
    'friends_website', 'user_work_history', 'friends_work_history', 'work', 'email',
    'email', 'read_friendlists', 'read_insights', 'read_mailbox', 'read_requests',
    'read_stream', 'xmpp_login', 'ads_management', 'create_event',
    'manage_friendlists', 'manage_notifications', 'user_online_presence',
    'friends_online_presence', 'publish_checkins', 'publish_stream', 'publish_stream',
    'rsvp_event']
How about writing our version
  of Google Translate API?

    Important: Google Translate API v2 is
    now available as a paid service only,
    and the number of requests your
    application can make per day is limited. As
    of December 1, 2011, Google Translate
    API v1 is no longer available; it was
    officially deprecated on May 26, 2011.
    These decisions were made due to the
    substantial economic burden caused by
    extensive abuse. For website translations,
    we encourage you to use the Google
    Website Translator gadget.
Let's understand how it works
        in background.

    Use LiveHTTPHeaders To Understand this

    Important Parameters that are passed

    sl = en (Source Language = English)

    tl = hi (Target Language = Hindi)

    text = hello world


    http://translate.google.com/?
    sl=en&tl=hi&text=hello+world#
How about we post this
converted text to our facebook
           wall? :)

    fbconsole
       
           Facebook Python API
       
           Simplifies things
       
           Very easy to install
       
           https://github.com/facebook/fbconsole
       
           $ sudo pip install fbconsole


    We'll use the permissions we extracted in
    this script :)
Demo
Google_Translator_With_FB_API.py
$ python Google_Translator_With_FB_API.py
Language to Convert from : en
Language to Convert to : hi
Text to Convert : wow
Converted Text : वाह


    Check your facebook wall :)
Translated Text Posted on my
       Facebook Wall
What is Shodan?

    Web search engines, such as Google and
    Bing, are great for finding websites. But
    what if you're interested in finding
    computers running a certain piece of
    software (such as Apache)? Or if you want
    to know which version of Microsoft IIS is
    the most popular? Or you want to see how
    many anonymous FTP servers there are?
    Maybe a new vulnerability came out and
    you want to see how many hosts it could
    infect? Traditional web search engines
    don't let you answer those questions.
What is Shodan?

    SHODAN is a search engine that lets you
    find specific computers (routers, servers,
    etc.) using a variety of filters.

    Public port scan directory or a search
    engine of banners.
Scraping Shodan Data Preview

    http://www.shodanhq.com/

    Python API Is available -
    http://docs.shodanhq.com/

    But you have to get the advanced
    features. :-/

    By default, the following search filters for
    Shodan are disabled: net, country, before,
    after. To unlock those filters buy the
    Unlocked API Add-On. No subscription
    required!

    http://www.shodanhq.com/data/addons
Demo shodanparser_New.py
$ python shodanparser_New.py
Query : country:IN HTTP/1.0 200 OK
3
98.146.42.77United States
178.33.70.221      France
96.217.60.25United States
115.133.223.66     Malaysia
218.250.60.122     Hong Kong
180.177.12.132     Taiwan
178.63.104.140     Germany
76.85.55.178United States
67.159.200.99      United States
75.188.142.2United States
robots.txt

    The Robot Exclusion Standard, also
    known as the Robots Exclusion Protocol
    or robots.txt protocol, is a convention to
    prevent cooperating web crawlers and
    other web robots from accessing all or part
    of a website which is otherwise publicly
    viewable. Robots are often used by
    search engines to categorize and archive
    web sites, or by webmasters to proofread
    source code. The standard is different
    from, but can be used in conjunction with,
    Sitemaps, a robot inclusion standard for
    websites.
robots.txt

    Despite the use of the terms "allow" and
    "disallow", the protocol is purely advisory.
    It relies on the cooperation of the web
    robot, so that marking an area of a site out
    of bounds with robots.txt does not
    guarantee exclusion of all web robots. In
    particular, malicious web robots are
    unlikely to honor robots.txt
facebook.com/robots.txt
User-agent: Googlebot
Disallow: /ac.php
Disallow: /ae.php
Disallow: /album.php
Disallow: /ap.php
Disallow: /autologin.php
Disallow: /checkpoint/
…............
Conculsion

    Scraping has many usecases.

    Most useful to write your own API if the
    website does not provide one or has
    limitations.

    Very useful in combining Exiting APIs with
    websites that do not provide APIs

    Be careful of How badly you hit a server.

    Follow robots.txt or take permissions.
References

    Advance Scraping Video -
       
           http://pyvideo.org/video/609/web-
           scraping-reliably-and-efficiently-pull-data

    Google Python Class Intermediate
       
           http://code.google.com/edu/languages/g
           oogle-python-class/set-up.html
       
           http://www.youtube.com/watch?
           v=tKTZoB2Vjuk&feature=plcp&context=
           C42cb319VDvjVQa1PpcFMzwqYlYKVx
           DoyEu1ISDDTjmz370vY8Xg4%3D
References

    Python Absolute Beginner
       
           http://www.youtube.com/watch?
           v=4Mf0h3HphEA&feature=channel_vide
           o_title


    Siddhant Sanyam's PyCon 11 Slides
       
           https://github.com/siddhant3s/PyCon11-
           Talk/tree/master/talk1_webscrapping
References

    http://firesofmay.blogspot.in/2011/10/http-
    web-scrapping-and-python-part-1.html
from BeautifulSoup import BeautifulSoup


import requests, sys


url = 'http://translate.google.com/?
  sl=en&tl=hi&text=Thank+you+Any+Questions?'


soup = BeautifulSoup(requests.get(url).content,
  convertEntities=BeautifulSoup.HTML_ENTITIES)


print soup.find('div', {'id' : 'gt-res-content'}).find('span',
  {'id':'result_box'}).text
Executing...
शुििया

कोई पश?

Contenu connexe

Tendances

Short Intro to PHP and MySQL
Short Intro to PHP and MySQLShort Intro to PHP and MySQL
Short Intro to PHP and MySQL
Jussi Pohjolainen
 

Tendances (15)

Justmeans power point
Justmeans power pointJustmeans power point
Justmeans power point
 
Justmeans power point
Justmeans power pointJustmeans power point
Justmeans power point
 
Php intro
Php introPhp intro
Php intro
 
Web backends development using Python
Web backends development using PythonWeb backends development using Python
Web backends development using Python
 
The Loop
The LoopThe Loop
The Loop
 
Composer The Right Way - 010PHP
Composer The Right Way - 010PHPComposer The Right Way - 010PHP
Composer The Right Way - 010PHP
 
Composer the right way - SunshinePHP
Composer the right way - SunshinePHPComposer the right way - SunshinePHP
Composer the right way - SunshinePHP
 
Inside a Digital Collection: Historic Clothing in Omeka
Inside a Digital Collection: Historic Clothing in OmekaInside a Digital Collection: Historic Clothing in Omeka
Inside a Digital Collection: Historic Clothing in Omeka
 
ReST-ful Resource Management
ReST-ful Resource ManagementReST-ful Resource Management
ReST-ful Resource Management
 
Composer the Right Way - PHPBNL16
Composer the Right Way - PHPBNL16Composer the Right Way - PHPBNL16
Composer the Right Way - PHPBNL16
 
Various Ways of Using WordPress
Various Ways of Using WordPressVarious Ways of Using WordPress
Various Ways of Using WordPress
 
Introduction to php web programming - get and post
Introduction to php  web programming - get and postIntroduction to php  web programming - get and post
Introduction to php web programming - get and post
 
Introduction to Google API - Focusky
Introduction to Google API - FocuskyIntroduction to Google API - Focusky
Introduction to Google API - Focusky
 
Building a Dynamic Website Using Django
Building a Dynamic Website Using DjangoBuilding a Dynamic Website Using Django
Building a Dynamic Website Using Django
 
Short Intro to PHP and MySQL
Short Intro to PHP and MySQLShort Intro to PHP and MySQL
Short Intro to PHP and MySQL
 

En vedette (7)

Pydata-Python tools for webscraping
Pydata-Python tools for webscrapingPydata-Python tools for webscraping
Pydata-Python tools for webscraping
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Premier pas de web scrapping avec R
Premier pas de  web scrapping avec RPremier pas de  web scrapping avec R
Premier pas de web scrapping avec R
 
Introduction à la cartographie avec R
Introduction à la cartographie avec RIntroduction à la cartographie avec R
Introduction à la cartographie avec R
 
Rapport PFE : Développement D'une application de gestion des cartes de fidéli...
Rapport PFE : Développement D'une application de gestion des cartes de fidéli...Rapport PFE : Développement D'une application de gestion des cartes de fidéli...
Rapport PFE : Développement D'une application de gestion des cartes de fidéli...
 
Le b.a.-ba du web scraping
Le b.a.-ba du web scrapingLe b.a.-ba du web scraping
Le b.a.-ba du web scraping
 
Rapport Projet De Fin D'étude Développent d'une application web avec Symfony2
Rapport Projet De Fin D'étude Développent d'une application web avec Symfony2Rapport Projet De Fin D'étude Développent d'une application web avec Symfony2
Rapport Projet De Fin D'étude Développent d'une application web avec Symfony2
 

Similaire à Introduction to python scrapping

Behavior & Specification Driven Development in PHP - #OpenWest
Behavior & Specification Driven Development in PHP - #OpenWestBehavior & Specification Driven Development in PHP - #OpenWest
Behavior & Specification Driven Development in PHP - #OpenWest
Joshua Warren
 
夜宴42期《Gadgets》
夜宴42期《Gadgets》夜宴42期《Gadgets》
夜宴42期《Gadgets》
Koubei Banquet
 
Plack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and serversPlack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and servers
Tatsuhiko Miyagawa
 

Similaire à Introduction to python scrapping (20)

Where's the source, Luke? : How to find and debug the code behind Plone
Where's the source, Luke? : How to find and debug the code behind PloneWhere's the source, Luke? : How to find and debug the code behind Plone
Where's the source, Luke? : How to find and debug the code behind Plone
 
Introduction to Google App Engine with Python
Introduction to Google App Engine with PythonIntroduction to Google App Engine with Python
Introduction to Google App Engine with Python
 
Simplify your professional web development with symfony
Simplify your professional web development with symfonySimplify your professional web development with symfony
Simplify your professional web development with symfony
 
Web Development in Django
Web Development in DjangoWeb Development in Django
Web Development in Django
 
Crafting APIs
Crafting APIsCrafting APIs
Crafting APIs
 
Behavior & Specification Driven Development in PHP - #OpenWest
Behavior & Specification Driven Development in PHP - #OpenWestBehavior & Specification Driven Development in PHP - #OpenWest
Behavior & Specification Driven Development in PHP - #OpenWest
 
Python tools for testing web services over HTTP
Python tools for testing web services over HTTPPython tools for testing web services over HTTP
Python tools for testing web services over HTTP
 
2023-May.pptx
2023-May.pptx2023-May.pptx
2023-May.pptx
 
Talking to Web Services
Talking to Web ServicesTalking to Web Services
Talking to Web Services
 
EuroPython 2013 - Python3 TurboGears Training
EuroPython 2013 - Python3 TurboGears TrainingEuroPython 2013 - Python3 TurboGears Training
EuroPython 2013 - Python3 TurboGears Training
 
How to implement sso using o auth in golang application
How to implement sso using o auth in golang applicationHow to implement sso using o auth in golang application
How to implement sso using o auth in golang application
 
夜宴42期《Gadgets》
夜宴42期《Gadgets》夜宴42期《Gadgets》
夜宴42期《Gadgets》
 
Banquet 42
Banquet 42Banquet 42
Banquet 42
 
Introduction to web and php mysql
Introduction to web and php mysqlIntroduction to web and php mysql
Introduction to web and php mysql
 
Deploy a PHP App on Google App Engine
Deploy a PHP App on Google App EngineDeploy a PHP App on Google App Engine
Deploy a PHP App on Google App Engine
 
Yahoo is open to developers
Yahoo is open to developersYahoo is open to developers
Yahoo is open to developers
 
Rapid Prototyping Chatter with a PHP/Hack Canvas App on Heroku
Rapid Prototyping Chatter with a PHP/Hack Canvas App on HerokuRapid Prototyping Chatter with a PHP/Hack Canvas App on Heroku
Rapid Prototyping Chatter with a PHP/Hack Canvas App on Heroku
 
Gohan
GohanGohan
Gohan
 
Plack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and serversPlack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and servers
 
URL Design
URL DesignURL Design
URL Design
 

Plus de n|u - The Open Security Community

Plus de n|u - The Open Security Community (20)

Hardware security testing 101 (Null - Delhi Chapter)
Hardware security testing 101 (Null - Delhi Chapter)Hardware security testing 101 (Null - Delhi Chapter)
Hardware security testing 101 (Null - Delhi Chapter)
 
Osint primer
Osint primerOsint primer
Osint primer
 
SSRF exploit the trust relationship
SSRF exploit the trust relationshipSSRF exploit the trust relationship
SSRF exploit the trust relationship
 
Nmap basics
Nmap basicsNmap basics
Nmap basics
 
Metasploit primary
Metasploit primaryMetasploit primary
Metasploit primary
 
Api security-testing
Api security-testingApi security-testing
Api security-testing
 
Introduction to TLS 1.3
Introduction to TLS 1.3Introduction to TLS 1.3
Introduction to TLS 1.3
 
Gibson 101 -quick_introduction_to_hacking_mainframes_in_2020_null_infosec_gir...
Gibson 101 -quick_introduction_to_hacking_mainframes_in_2020_null_infosec_gir...Gibson 101 -quick_introduction_to_hacking_mainframes_in_2020_null_infosec_gir...
Gibson 101 -quick_introduction_to_hacking_mainframes_in_2020_null_infosec_gir...
 
Talking About SSRF,CRLF
Talking About SSRF,CRLFTalking About SSRF,CRLF
Talking About SSRF,CRLF
 
Building active directory lab for red teaming
Building active directory lab for red teamingBuilding active directory lab for red teaming
Building active directory lab for red teaming
 
Owning a company through their logs
Owning a company through their logsOwning a company through their logs
Owning a company through their logs
 
Introduction to shodan
Introduction to shodanIntroduction to shodan
Introduction to shodan
 
Cloud security
Cloud security Cloud security
Cloud security
 
Detecting persistence in windows
Detecting persistence in windowsDetecting persistence in windows
Detecting persistence in windows
 
Frida - Objection Tool Usage
Frida - Objection Tool UsageFrida - Objection Tool Usage
Frida - Objection Tool Usage
 
OSQuery - Monitoring System Process
OSQuery - Monitoring System ProcessOSQuery - Monitoring System Process
OSQuery - Monitoring System Process
 
DevSecOps Jenkins Pipeline -Security
DevSecOps Jenkins Pipeline -SecurityDevSecOps Jenkins Pipeline -Security
DevSecOps Jenkins Pipeline -Security
 
Extensible markup language attacks
Extensible markup language attacksExtensible markup language attacks
Extensible markup language attacks
 
Linux for hackers
Linux for hackersLinux for hackers
Linux for hackers
 
Android Pentesting
Android PentestingAndroid Pentesting
Android Pentesting
 

Dernier

An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 

Dernier (20)

Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 

Introduction to python scrapping

  • 1. Introduction to Scraping in Python By :-  Mayank Jain (firesofmay@gmail.com)  Gaurav Jain (grvmjain@gmail.com) Code is available at https://github.com/firesofmay/Null-Pune- Intro-to-Scraping-Talk-March-2012
  • 2. Overview of the ”Presentation”  What is Scraping?  So what is this HTTP?  Tools of Trade  User Agents  Firebug  Using BeautfulSoup and Regular Expressions  Using Google Translator to post on Facebook in hindi  Shodan  Robots.txt
  • 3. What is Scraping?  Web scraping/Web harvesting/Web data extraction is a computer software technique of extracting information from websites.
  • 4. So what is this HTTP thing?  If you goto this page - http://en.wikipedia.org/wiki/Python_%28programming_language%29  To view the HTTP Requests being made we use a firefox Pluging called as LiveHTTPHeaders
  • 5. ----------Request From Client to Server---------- GET /wiki/Python_(programming_language) HTTP/1.1 Host: en.wikipedia.org User-Agent: Mozilla/5.0 (X11; Linux i686; rv:7.0.1) Gecko/20100101 Firefox/7.0.1 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip, deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Connection: keep-alive Referer: http://en.wikipedia.org/wiki/Python Cookie: clicktracking-session=QgVKVqIpsfsgsgszgvwBCASkSOdw2O; mediaWiki.user.bucket:ext.articleFeedback-tracking=8%3Aignore; mediaWiki.user.bucket:ext.articleFeedback-options=8%3Ashow ----------End of Request From Client to Server----------
  • 6. ----------Response From Server to Client----------  HTTP/1.0 200 OK  Date: Mon, 10 Oct 2011 12:44:46 GMT  Server: Apache  X-Content-Type-Options: nosniff  Cache-Control: private, s-maxage=0, max-age=0, must-revalidate  Content-Language: en  Vary: Accept-Encoding,Cookie  Last-Modified: Sun, 09 Oct 2011 05:01:32 GMT  Content-Encoding: gzip  Content-Length: 47407  Content-Type: text/html; charset=UTF-8  Age: 10932  X-Cache: HIT from sq66.wikimedia.org, MISS from sq65.wikimedia.org  X-Cache-Lookup: HIT from sq66.wikimedia.org:3128, MISS from sq65.wikimedia.org:80  Connection: keep-alive  ----------End of Response From Server to Client----------
  • 7. Tools of Trade  Linux OS is prefered (Installations Command for Ubuntu Distro)  Dreampie IDE (For Quick Prototyping)  $ sudo apt-get install dreampie  Python 2.x (Preferably 2.6+)  pip installter for python packages  $ sudo apt-get install python-pip  Python requests: HTTP for Humans  $ pip install requests  Python re Library for regular Expressions (Inbuilt)
  • 8. LiveHTTPHeader Firefox Plugin  https://addons.mozilla.org/en-US/firefox/ addon/live-http-headers/  Firebug Firefox Plugin  https://addons.mozilla.org/en-US/firefox/ addon/firebug/?src=search  User Agent Switcher Firefox Plugin  https://addons.mozilla.org/en-US/firefox/ addon/user-agent-switcher/?src=search  BeautifulSoup Python Library  http://www.crummy.com/software/Beautif ulSoup/#Download
  • 9. Fetching HTML Page (fetch.py) import requests url = 'http://en.wikipedia.org/wiki/Python_ %28programming_language%29' data = requests.get(url).content f = open("debug.html", 'w') f.write(data) f.close() #To Run  $ python fetch.py
  • 10. Why Does User Agent Matter?  When software agent operates in a network protocol, it often identifies itself, its application type, operating system, software vendor, or software revision, by submitting a characteristic identification string to its operating peer.  In HTTP, SIP, and SMTP/NNTP protocols, this identification is transmitted in a header field User-Agent. Bots, such as Web crawlers, often also include a URL and/or e-mail address so that the Webmaster can contact the operator of the bot.
  • 11. Demo of How Sites Behave Differently With Different UAs - I  https://addons.mozilla.org/en- US/firefox/addon/user-agent-switcher/  Visit the above site with UA (User Agent) as firefox
  • 12.
  • 13. Demo of How Sites Behave Differently With Different UAs - I  https://addons.mozilla.org/en- US/firefox/addon/user-agent-switcher/  Now visit the above site with UA as IE  To switch your User Agent Use User Agent Switcher Addon.  Notice the new banner, asking you to install firefox even though you are using firefox (based on your user agent selected).
  • 14.
  • 15. Demo of How Sites Behave Differently With Different UAs - II  https://developers.facebook.com/docs/refe rence/api/permissions/  Now visit the above site with UA as IE  Asked for Login? But I don't want to Login!!!  Let's try a Google bot as UA  Yayyy!!  Let's try a blank UA  Yayy Again! :D
  • 16.
  • 17. Inspecting Elements with Firebug  We want to fetch the Given Sale Price (19.99)  Goto this link - http://www.payless.com/store/product/detail.jsp? catId=cat10243&subCatId=cat10243&skuId=091151050&productId=68423&lotId=091 151&category=  Right Click on $19.99 > Inspect Element with firebug
  • 19. Demo Payless_Parser.py  Run the code  $ python Payless_Parser.py  Price of this item is 19.99  Modifiy The url variable to - http://www.payless.com/store/product/deta il.jsp? catId=cat10088&subCatId=cat10243&skuI d=094079050&productId=70984&lotId=09 4079&category=&catdisplayName=Wome ns Why does this work? Try to understand.
  • 20. How about Extracting all the Permissions from this page?
  • 21. Demo Extract_Facebook_Permission s.py  Url to extract from : https://developers.facebook.com/docs/refe rence/api/permissions/  Check the next slide for Expected output and how to run the code
  • 22. $ python Extract_Facebook_Permissions.py  ['user_about_me', 'friends_about_me', 'about', 'user_activities', 'friends_activities', 'activities', 'user_birthday', 'friends_birthday', 'birthday', 'user_checkins', 'friends_checkins', 'user_education_history', 'friends_education_history', 'education', 'user_events', 'friends_events', 'events', 'user_groups', 'friends_groups', 'groups', 'user_hometown', 'friends_hometown', 'hometown', 'user_interests', 'friends_interests', 'interests', 'user_likes', 'friends_likes', 'likes', 'user_location', 'friends_location', 'location', 'user_notes', 'friends_notes', 'notes', 'user_photos', 'friends_photos', 'user_questions', 'friends_questions', 'user_relationships', 'friends_relationships', 'user_relationship_details', 'friends_relationship_details', 'user_religion_politics', 'friends_religion_politics', 'user_status', 'friends_status', 'user_videos', 'friends_videos', 'user_website', 'friends_website', 'user_work_history', 'friends_work_history', 'work', 'email', 'email', 'read_friendlists', 'read_insights', 'read_mailbox', 'read_requests', 'read_stream', 'xmpp_login', 'ads_management', 'create_event', 'manage_friendlists', 'manage_notifications', 'user_online_presence', 'friends_online_presence', 'publish_checkins', 'publish_stream', 'publish_stream', 'rsvp_event']
  • 23. How about writing our version of Google Translate API?  Important: Google Translate API v2 is now available as a paid service only, and the number of requests your application can make per day is limited. As of December 1, 2011, Google Translate API v1 is no longer available; it was officially deprecated on May 26, 2011. These decisions were made due to the substantial economic burden caused by extensive abuse. For website translations, we encourage you to use the Google Website Translator gadget.
  • 24. Let's understand how it works in background.  Use LiveHTTPHeaders To Understand this  Important Parameters that are passed  sl = en (Source Language = English)  tl = hi (Target Language = Hindi)  text = hello world  http://translate.google.com/? sl=en&tl=hi&text=hello+world#
  • 25. How about we post this converted text to our facebook wall? :)  fbconsole  Facebook Python API  Simplifies things  Very easy to install  https://github.com/facebook/fbconsole  $ sudo pip install fbconsole  We'll use the permissions we extracted in this script :)
  • 26. Demo Google_Translator_With_FB_API.py $ python Google_Translator_With_FB_API.py Language to Convert from : en Language to Convert to : hi Text to Convert : wow Converted Text : वाह  Check your facebook wall :)
  • 27. Translated Text Posted on my Facebook Wall
  • 28. What is Shodan?  Web search engines, such as Google and Bing, are great for finding websites. But what if you're interested in finding computers running a certain piece of software (such as Apache)? Or if you want to know which version of Microsoft IIS is the most popular? Or you want to see how many anonymous FTP servers there are? Maybe a new vulnerability came out and you want to see how many hosts it could infect? Traditional web search engines don't let you answer those questions.
  • 29. What is Shodan?  SHODAN is a search engine that lets you find specific computers (routers, servers, etc.) using a variety of filters.  Public port scan directory or a search engine of banners.
  • 30. Scraping Shodan Data Preview  http://www.shodanhq.com/  Python API Is available - http://docs.shodanhq.com/  But you have to get the advanced features. :-/  By default, the following search filters for Shodan are disabled: net, country, before, after. To unlock those filters buy the Unlocked API Add-On. No subscription required!  http://www.shodanhq.com/data/addons
  • 31. Demo shodanparser_New.py $ python shodanparser_New.py Query : country:IN HTTP/1.0 200 OK 3 98.146.42.77United States 178.33.70.221 France 96.217.60.25United States 115.133.223.66 Malaysia 218.250.60.122 Hong Kong 180.177.12.132 Taiwan 178.63.104.140 Germany 76.85.55.178United States 67.159.200.99 United States 75.188.142.2United States
  • 32. robots.txt  The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. The standard is different from, but can be used in conjunction with, Sitemaps, a robot inclusion standard for websites.
  • 33. robots.txt  Despite the use of the terms "allow" and "disallow", the protocol is purely advisory. It relies on the cooperation of the web robot, so that marking an area of a site out of bounds with robots.txt does not guarantee exclusion of all web robots. In particular, malicious web robots are unlikely to honor robots.txt
  • 34. facebook.com/robots.txt User-agent: Googlebot Disallow: /ac.php Disallow: /ae.php Disallow: /album.php Disallow: /ap.php Disallow: /autologin.php Disallow: /checkpoint/ …............
  • 35. Conculsion  Scraping has many usecases.  Most useful to write your own API if the website does not provide one or has limitations.  Very useful in combining Exiting APIs with websites that do not provide APIs  Be careful of How badly you hit a server.  Follow robots.txt or take permissions.
  • 36. References  Advance Scraping Video -  http://pyvideo.org/video/609/web- scraping-reliably-and-efficiently-pull-data  Google Python Class Intermediate  http://code.google.com/edu/languages/g oogle-python-class/set-up.html  http://www.youtube.com/watch? v=tKTZoB2Vjuk&feature=plcp&context= C42cb319VDvjVQa1PpcFMzwqYlYKVx DoyEu1ISDDTjmz370vY8Xg4%3D
  • 37. References  Python Absolute Beginner  http://www.youtube.com/watch? v=4Mf0h3HphEA&feature=channel_vide o_title  Siddhant Sanyam's PyCon 11 Slides  https://github.com/siddhant3s/PyCon11- Talk/tree/master/talk1_webscrapping
  • 38. References  http://firesofmay.blogspot.in/2011/10/http- web-scrapping-and-python-part-1.html
  • 39. from BeautifulSoup import BeautifulSoup import requests, sys url = 'http://translate.google.com/? sl=en&tl=hi&text=Thank+you+Any+Questions?' soup = BeautifulSoup(requests.get(url).content, convertEntities=BeautifulSoup.HTML_ENTITIES) print soup.find('div', {'id' : 'gt-res-content'}).find('span', {'id':'result_box'}).text