SlideShare une entreprise Scribd logo
WEB SCRAPING
WITH PYTHON
10/2019
Applied Analytics Club
Set Up
• Google Chrome is needed to follow along with this tutorial.
• *Install the Selector Gadget Extension for Chrome as well.*
• If you haven’t done already, download and install Anaconda
Python 3 Version at:
• https://www.anaconda.com/distribution
• Next, use Terminal or Command Prompt to enter the
following, one by one:
• pip install bs4
• pip install selenium
• pip install requests
• Download all workshop materials @ ^
• In case of errors, raise your hand and we will come around. For those who have
successfully completed the install, please assist others.
bit.ly/2Mmi6vH
Contents
■ Define Scraping
■ Python Basic Components (Data Types, Functions, Containers, For Loops)
■ Applications:
– Beautiful Soup
■ Demonstration with Follow Up
■ Practice Exercise
– Selenium
■ Demonstration with Follow Up
■ Practice Exercise
■ Things to keep in mind when scraping (robots.txt)
■ Challenge Introduction
■ Q & A
Web Scraping
■ Used for extracting data from websites
■ Automates the process of gathering data
which is typically only accessible via a web
browser
■ Each website is naturally different,
therefore each requires a slightly modified
approach while scraping
■ Not everything can be scrapped
Python Basics: Data Types
■ Int e.g. 2,3,4
■ Float e.g. 2.0,3.4, 4.3
■ String e.g. “scraping ftw!”, ”John Doe”
■ Boolean True, False
■ Others (Complex, Unicode etc.)
Python Basics: Functions
■ Functions start with “def” with the following format
– def function1(paramter1,parameter2):
answer = parameter1+paramter2
return answer
■ There are two ways to call functions:
1. Function1()
1. E.g. type(5) # int
2. Object.function1()
1. “python”.upper() # “PYTHON”
– Used under different circumstances (examples to
come later)
Python Basics: Lists
■ Type of data container which is used to store multiple data at the same time
■ Mutable (Can be changed)
■ Comparable to R’s vector
– E.g. list1 = [0,1,2,3,4]
■ Can contain items of varying data types
– E.g. list2 = [6,’harry’, True, 1.0]
■ Indexing starts with 0
– E.g. list2[0] = 6
■ A list can be nested in another list
– E.g. [1 , [98,109], 6, 7]
■ Call the ”append” function to add an item to a list
– E.g. list1.append(5)
Python Basics: Dictionaries
■ Collection of key-value pairs
■ Very similar to JSON objects
■ Mutable
■ E.g. dict1 = {‘r’:4,’w’:9, ‘t’:5}
■ Indexed with keys
– E.g. dict1[‘r’]
■ Keys are unique
■ Values can be lists or other nested dictionaries
■ A dictionary can also be nested into a list e.g. [{3:4,5:6}, 6,7]
Python Basics: For Loops
■ Used for iterating over a sequence (a list, a tuple, a dictionary, a set, or a
string)
■ E.g.
– cities_list = [‘hong kong”, “new york”, “miami”]
– for item in cities_list:
print(item)
# hong kong
# new york
# miami
Beautiful Soup
■ Switch to Jupiter Notebook
– Open Anaconda
– Launch Jupyter Notebook
– Go to IMDB’s 250 movies:
■ https://www.imdb.com/search/title?genres=drama&groups=top_250&sort=us
er_rating,desc
Selenium
■ Download the chrome web driver from
– http://chromedriver.chromium.org/downloads
■ Place the driver in your working directory
■ Continue with Jupyter Notebook
Scraping Ethics
■ Be respectful of websites’ permissions
■ View the website’s robots.txt file to learn which areas of the site are allowed
or disallowed from scraping
– You can access this file by replacing sitename.com in the following:
www.[sitename.com]/robots.txt
– E.g. imdb’s robots txt can be found at https://www.imdb.com/robots.txt
– You can also use https://canicrawl.com/ to check if a website allows
scrapping
■ Don’t overload website servers by sending too many requests. Use
“time.sleep(xx)” function to delay requests.
– This will also prevent your IP address from being banned
Interpreting the robots.txt file
■ All pages of the website can be
scrapped if you see the following:
– User-agent: *
– Disallow:
■ None of the pages of the website can
be scrapped if you see the following:
– User-agent: *
– Disallow: /
■ Example from imdb 
– The sub-directories mentioned
here are disallowed from being
scrapped
Take-home Challenge
■ Scrape a fictional book store: http://books.toscrape.com/?
■ Use what you have learned to create efficiently scrape the following data for
Travel, Poetry, Art, Humor and Academic books:
– Book Title
– Product Description
– Price (excl. tax)
– Number of Reviews
■ Store all of the data in a single Pandas DataFrame
■ The most efficient scraper will be awarded with a prize
■ Deadline for submissions are in a week from today, 4/18/2019 11:59pm
Resources
■ https://github.com/devkosal/scraping_tutorial
– All code provided in this lecture can be found here
■ http://toscrape.com/
– Great sample websites to perform beginner to intermediate scrapping on
■ https://www.edx.org/course/introduction-to-computer-science-and-programming-
using-python-0
– Introduction to Computer Science using Python
– Highly recommended course on learning Python and CS form scratch
■ https://www.promptcloud.com/blog/how-to-read-and-respect-robots-file/
– Further reading on interpreting robots.txt
■ https://canicrawl.com/
– Check scraping permissions for any website

Contenu connexe

Similaire à Python ScrapingPresentation for dummy.pptx

DSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/ExportDSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/Export
DuraSpace
 
Scrapy
ScrapyScrapy
CI_CONF 2012: Scaling - Chris Miller
CI_CONF 2012: Scaling - Chris MillerCI_CONF 2012: Scaling - Chris Miller
CI_CONF 2012: Scaling - Chris Miller
The Huffington Post Tech Team
 
CI_CONF 2012: Scaling
CI_CONF 2012: ScalingCI_CONF 2012: Scaling
CI_CONF 2012: Scaling
Chris Miller
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Jimmy Lai
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
Rich Lee
 
Logs aggregation and analysis
Logs aggregation and analysisLogs aggregation and analysis
Logs aggregation and analysis
Divante
 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyWeb scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
LITTINRAJAN
 
Crawl the entire web in 10 minutes...and just 100€
Crawl the entire web  in 10 minutes...and just 100€Crawl the entire web  in 10 minutes...and just 100€
Crawl the entire web in 10 minutes...and just 100€
Danny Linden
 
Info 2402 irt-chapter_3
Info 2402 irt-chapter_3Info 2402 irt-chapter_3
Info 2402 irt-chapter_3
Shahriar Rafee
 
Capacity planning for your data stores
Capacity planning for your data storesCapacity planning for your data stores
Capacity planning for your data stores
Colin Charles
 
Google Dorks
Google DorksGoogle Dorks
Google Dorks
Adhoura Academy
 
Static Site Generators: what they are and when they are useful
Static Site Generators: what they are and when they are usefulStatic Site Generators: what they are and when they are useful
Static Site Generators: what they are and when they are useful
Paul Walk
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internet
drgath
 
Scraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHPScraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHP
Paul Redmond
 
API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...
API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...
API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...
CA API Management
 
PHP language presentation
PHP language presentationPHP language presentation
PHP language presentation
Annujj Agrawaal
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine Spiders
CJ Jenkins
 
REST Api Tips and Tricks
REST Api Tips and TricksREST Api Tips and Tricks
REST Api Tips and Tricks
Maksym Bruner
 
Scaling Up with PHP and AWS
Scaling Up with PHP and AWSScaling Up with PHP and AWS
Scaling Up with PHP and AWS
Heath Dutton ☕
 

Similaire à Python ScrapingPresentation for dummy.pptx (20)

DSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/ExportDSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/Export
 
Scrapy
ScrapyScrapy
Scrapy
 
CI_CONF 2012: Scaling - Chris Miller
CI_CONF 2012: Scaling - Chris MillerCI_CONF 2012: Scaling - Chris Miller
CI_CONF 2012: Scaling - Chris Miller
 
CI_CONF 2012: Scaling
CI_CONF 2012: ScalingCI_CONF 2012: Scaling
CI_CONF 2012: Scaling
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
 
Logs aggregation and analysis
Logs aggregation and analysisLogs aggregation and analysis
Logs aggregation and analysis
 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyWeb scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
 
Crawl the entire web in 10 minutes...and just 100€
Crawl the entire web  in 10 minutes...and just 100€Crawl the entire web  in 10 minutes...and just 100€
Crawl the entire web in 10 minutes...and just 100€
 
Info 2402 irt-chapter_3
Info 2402 irt-chapter_3Info 2402 irt-chapter_3
Info 2402 irt-chapter_3
 
Capacity planning for your data stores
Capacity planning for your data storesCapacity planning for your data stores
Capacity planning for your data stores
 
Google Dorks
Google DorksGoogle Dorks
Google Dorks
 
Static Site Generators: what they are and when they are useful
Static Site Generators: what they are and when they are usefulStatic Site Generators: what they are and when they are useful
Static Site Generators: what they are and when they are useful
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internet
 
Scraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHPScraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHP
 
API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...
API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...
API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...
 
PHP language presentation
PHP language presentationPHP language presentation
PHP language presentation
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine Spiders
 
REST Api Tips and Tricks
REST Api Tips and TricksREST Api Tips and Tricks
REST Api Tips and Tricks
 
Scaling Up with PHP and AWS
Scaling Up with PHP and AWSScaling Up with PHP and AWS
Scaling Up with PHP and AWS
 

Dernier

Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
imrankhan141184
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
Diana Rendina
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
PECB
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
eBook.com.bd (প্রয়োজনীয় বাংলা বই)
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience
Wahiba Chair Training & Consulting
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
สมใจ จันสุกสี
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
Celine George
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
Celine George
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
WaniBasim
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching AptitudeUGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
S. Raj Kumar
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
iammrhaywood
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 

Dernier (20)

Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching AptitudeUGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 

Python ScrapingPresentation for dummy.pptx

  • 2. Set Up • Google Chrome is needed to follow along with this tutorial. • *Install the Selector Gadget Extension for Chrome as well.* • If you haven’t done already, download and install Anaconda Python 3 Version at: • https://www.anaconda.com/distribution • Next, use Terminal or Command Prompt to enter the following, one by one: • pip install bs4 • pip install selenium • pip install requests • Download all workshop materials @ ^ • In case of errors, raise your hand and we will come around. For those who have successfully completed the install, please assist others. bit.ly/2Mmi6vH
  • 3. Contents ■ Define Scraping ■ Python Basic Components (Data Types, Functions, Containers, For Loops) ■ Applications: – Beautiful Soup ■ Demonstration with Follow Up ■ Practice Exercise – Selenium ■ Demonstration with Follow Up ■ Practice Exercise ■ Things to keep in mind when scraping (robots.txt) ■ Challenge Introduction ■ Q & A
  • 4. Web Scraping ■ Used for extracting data from websites ■ Automates the process of gathering data which is typically only accessible via a web browser ■ Each website is naturally different, therefore each requires a slightly modified approach while scraping ■ Not everything can be scrapped
  • 5. Python Basics: Data Types ■ Int e.g. 2,3,4 ■ Float e.g. 2.0,3.4, 4.3 ■ String e.g. “scraping ftw!”, ”John Doe” ■ Boolean True, False ■ Others (Complex, Unicode etc.)
  • 6. Python Basics: Functions ■ Functions start with “def” with the following format – def function1(paramter1,parameter2): answer = parameter1+paramter2 return answer ■ There are two ways to call functions: 1. Function1() 1. E.g. type(5) # int 2. Object.function1() 1. “python”.upper() # “PYTHON” – Used under different circumstances (examples to come later)
  • 7. Python Basics: Lists ■ Type of data container which is used to store multiple data at the same time ■ Mutable (Can be changed) ■ Comparable to R’s vector – E.g. list1 = [0,1,2,3,4] ■ Can contain items of varying data types – E.g. list2 = [6,’harry’, True, 1.0] ■ Indexing starts with 0 – E.g. list2[0] = 6 ■ A list can be nested in another list – E.g. [1 , [98,109], 6, 7] ■ Call the ”append” function to add an item to a list – E.g. list1.append(5)
  • 8. Python Basics: Dictionaries ■ Collection of key-value pairs ■ Very similar to JSON objects ■ Mutable ■ E.g. dict1 = {‘r’:4,’w’:9, ‘t’:5} ■ Indexed with keys – E.g. dict1[‘r’] ■ Keys are unique ■ Values can be lists or other nested dictionaries ■ A dictionary can also be nested into a list e.g. [{3:4,5:6}, 6,7]
  • 9. Python Basics: For Loops ■ Used for iterating over a sequence (a list, a tuple, a dictionary, a set, or a string) ■ E.g. – cities_list = [‘hong kong”, “new york”, “miami”] – for item in cities_list: print(item) # hong kong # new york # miami
  • 10. Beautiful Soup ■ Switch to Jupiter Notebook – Open Anaconda – Launch Jupyter Notebook – Go to IMDB’s 250 movies: ■ https://www.imdb.com/search/title?genres=drama&groups=top_250&sort=us er_rating,desc
  • 11. Selenium ■ Download the chrome web driver from – http://chromedriver.chromium.org/downloads ■ Place the driver in your working directory ■ Continue with Jupyter Notebook
  • 12. Scraping Ethics ■ Be respectful of websites’ permissions ■ View the website’s robots.txt file to learn which areas of the site are allowed or disallowed from scraping – You can access this file by replacing sitename.com in the following: www.[sitename.com]/robots.txt – E.g. imdb’s robots txt can be found at https://www.imdb.com/robots.txt – You can also use https://canicrawl.com/ to check if a website allows scrapping ■ Don’t overload website servers by sending too many requests. Use “time.sleep(xx)” function to delay requests. – This will also prevent your IP address from being banned
  • 13. Interpreting the robots.txt file ■ All pages of the website can be scrapped if you see the following: – User-agent: * – Disallow: ■ None of the pages of the website can be scrapped if you see the following: – User-agent: * – Disallow: / ■ Example from imdb  – The sub-directories mentioned here are disallowed from being scrapped
  • 14. Take-home Challenge ■ Scrape a fictional book store: http://books.toscrape.com/? ■ Use what you have learned to create efficiently scrape the following data for Travel, Poetry, Art, Humor and Academic books: – Book Title – Product Description – Price (excl. tax) – Number of Reviews ■ Store all of the data in a single Pandas DataFrame ■ The most efficient scraper will be awarded with a prize ■ Deadline for submissions are in a week from today, 4/18/2019 11:59pm
  • 15. Resources ■ https://github.com/devkosal/scraping_tutorial – All code provided in this lecture can be found here ■ http://toscrape.com/ – Great sample websites to perform beginner to intermediate scrapping on ■ https://www.edx.org/course/introduction-to-computer-science-and-programming- using-python-0 – Introduction to Computer Science using Python – Highly recommended course on learning Python and CS form scratch ■ https://www.promptcloud.com/blog/how-to-read-and-respect-robots-file/ – Further reading on interpreting robots.txt ■ https://canicrawl.com/ – Check scraping permissions for any website