Introduction to web scraping from static and Ajax generated web pages with Python, using urllib, BeautifulSoup, and Selenium. The slides are from a talk given at Vancouver PyLadies meetup on March 7, 2016.
1) There are several general methods for acquiring web data through R, including reading files directly, scraping HTML/XML/JSON, and using APIs that serve XML/JSON.
2) Scraping web data involves extracting structured information from unstructured HTML/XML pages when no API is available. Packages like rvest and XML can be used to parse and extract the desired data.
3) Many data sources have APIs that allow programmatic access to search, retrieve, or submit data through a set of methods. R packages like taxize and dryad interface with specific APIs to access taxonomic and research data.
Overview of Solr 6.2 examples, including features they have and challenges they present. A contrasting demonstration of a minimal viable example. A step-by-step deconstruction of "films" example to show what part of shipped examples are not actually needed.
Logstash is a tool for managing logs that allows for input, filter, and output plugins to collect, parse, and deliver logs and log data. It works by treating logs as events that are passed through the input, filter, and output phases, with popular plugins including file, redis, grok, elasticsearch and more. The document also provides guidance on using Logstash in a clustered configuration with an agent and server model to optimize log collection, processing, and storage.
This document provides information about the speaker, including their name, contact information, work experience, projects, and interests. They are a security researcher who previously worked as a VA and now works for HP Application Security Center. They enjoy talking about hacking and drinking beer and gin and tonics. The document also outlines an upcoming workshop they will be conducting on web hacking tools and techniques.
This document provides a top ten list of tips for improving PHP and web application performance. They include tweaking realpath cache settings, using offline processing whenever possible, writing efficient SQL queries, not executing queries in loops, caching data, using a content delivery network, and using APC caching with apc.stat set to 0. The tips cover optimizing PHP, database, and infrastructure performance.
Introduction to web scraping from static and Ajax generated web pages with Python, using urllib, BeautifulSoup, and Selenium. The slides are from a talk given at Vancouver PyLadies meetup on March 7, 2016.
1) There are several general methods for acquiring web data through R, including reading files directly, scraping HTML/XML/JSON, and using APIs that serve XML/JSON.
2) Scraping web data involves extracting structured information from unstructured HTML/XML pages when no API is available. Packages like rvest and XML can be used to parse and extract the desired data.
3) Many data sources have APIs that allow programmatic access to search, retrieve, or submit data through a set of methods. R packages like taxize and dryad interface with specific APIs to access taxonomic and research data.
Overview of Solr 6.2 examples, including features they have and challenges they present. A contrasting demonstration of a minimal viable example. A step-by-step deconstruction of "films" example to show what part of shipped examples are not actually needed.
Logstash is a tool for managing logs that allows for input, filter, and output plugins to collect, parse, and deliver logs and log data. It works by treating logs as events that are passed through the input, filter, and output phases, with popular plugins including file, redis, grok, elasticsearch and more. The document also provides guidance on using Logstash in a clustered configuration with an agent and server model to optimize log collection, processing, and storage.
This document provides information about the speaker, including their name, contact information, work experience, projects, and interests. They are a security researcher who previously worked as a VA and now works for HP Application Security Center. They enjoy talking about hacking and drinking beer and gin and tonics. The document also outlines an upcoming workshop they will be conducting on web hacking tools and techniques.
This document provides a top ten list of tips for improving PHP and web application performance. They include tweaking realpath cache settings, using offline processing whenever possible, writing efficient SQL queries, not executing queries in loops, caching data, using a content delivery network, and using APC caching with apc.stat set to 0. The tips cover optimizing PHP, database, and infrastructure performance.
This document provides an overview and instructions for various content transmission features in DSpace 4.2, including harvesting and disseminating content using OAI/PMH, reading content with REST, exporting and importing content using submission information packages (SIPs) and dissemination information packages (DIPs), depositing content using SWORD, and importing content using the simple archive format (SAF). It discusses configuring the OAI content source, metadata crosswalks, and the REST API. It also provides examples of using the DSpace packager for submissions and disseminations, making deposits via cURL with SWORD, and batch importing content from CSV and MARC XML files using the SAF format.
This document introduces Scrapy, an open source and collaborative framework for extracting data from websites. It discusses what Scrapy is used for, its advantages over alternatives like Beautiful Soup, and provides steps to install Scrapy and create a sample scraping project. The sample project scrapes review data from The Verge website, including the title, number of comments, and author for the first 5 review pages. The document concludes by explaining how to run the spider and store the extracted data in a file.
The document discusses how Huffington Post scales its platform to handle large amounts of traffic. It uses a variety of technologies including Perl, PHP, MySQL, MongoDB, Hadoop, Memcache, Redis, Varnish, and CodeIgniter. Key aspects of scaling include using a CDN, generating flat files from dynamic content, caching with Varnish, edge side includes, hardware SSL offloading, splitting applications across servers, and intelligently purging caches when content changes. The document also proposes using "guilds" or groups to help employees learn and collaborate around different technologies.
The document discusses how Huffington Post scales its platform to handle large amounts of traffic. It uses a variety of technologies including Perl, PHP, MySQL, MongoDB, Hadoop, Memcache, Redis, Varnish, and CodeIgniter. Key aspects of scaling include using a CDN, generating flat files for pages, caching responses with Varnish, controlling caching with custom headers, purging caches, using edge side includes, splitting applications across servers, and encouraging participation in technology-focused guilds.
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
Big data analysis involves several processes: collecting, storage, computing, analysis and visualization. In this slides, the author demonstrates these processes by using python tools to build a data product. The example is based on text-analyzing an online forum.
Centralized log-management-with-elastic-stackRich Lee
Centralized log management is implemented using the Elastic Stack including Filebeat, Logstash, Elasticsearch, and Kibana. Filebeat ships logs to Logstash which transforms and indexes the data into Elasticsearch. Logs can then be queried and visualized in Kibana. For large volumes of logs, Kafka may be used as a buffer between the shipper and indexer. Backups are performed using Elasticsearch snapshots to a shared file system or cloud storage. Logs are indexed into time-based indices and a cron job deletes old indices to control storage usage.
This document discusses logs aggregation and analysis using the ELK stack, which consists of Elasticsearch, Logstash, and Kibana. It describes problems with traditional logging like inconsistent formats and high server loads. It then explains how each tool in the ELK stack addresses these issues. Elasticsearch provides centralized storage and search. Logstash collects, parses, and filters logs from multiple sources. Kibana enables visualization and dashboarding for log analysis. Additional tools like Marvel and plugins are also discussed. Overall, the ELK stack provides a scalable logging solution with consistent structure, centralized management, and interactive analytics dashboards.
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyLITTINRAJAN
Web Scraping Introduction. It will cover cover all the most available libraries and the way they can be handled to scrape our required data. Created by Littin Rajan
Crawl the entire web in 10 minutes...and just 100€Danny Linden
This document describes how to build a web search engine using Common Crawl data at a low cost. Common Crawl crawls over 2 billion URLs per month and stores the data in an accessible format on AWS S3. The data can be processed using Hadoop and Pig on AWS EMR for scalability. Examples show how to load Common Crawl data into Pig and extract metadata like page titles. OnPage offers services to build scalable big data solutions using these techniques.
This document provides information about building a basic web crawler, including:
- Crawlers find and download web pages to build a collection for search engines by starting with seed URLs and following links on downloaded pages.
- Crawlers use techniques like multiple threads, politeness policies, and robots.txt files to efficiently and politely crawl the web at scale.
- Crawlers must also handle challenges like duplicate pages, updating freshness, focused crawling of specific topics, and accessing deep web or private pages.
Databases require capacity planning (and to those coming from traditional RDBMS solutions, this can be thought of as a sizing guide). Capacity planning prevents resource exhaustion. Capacity planning can be hard. This talk has a heavier leaning on MySQL, but the concepts and addendum will help with any other data store.
Google dorks, also known as Google hacking, involves using specific search operators and techniques on Google to find sensitive information on websites. Some key points made in the document include:
- Google dorking uses operators like "site:", "filetype:", "intitle:", and "inurl:" to refine search results and maximize search value. This can be used to uncover private information, configurations, backups, and vulnerabilities.
- Information found through Google hacking may remain exposed in Google's cache for months after being removed from websites.
- Google indexes everything publicly available on the internet, making it a powerful tool for passive information gathering, which is an important first step for hackers.
- The document encourages
This document discusses YQL (Yahoo Query Language) which allows users to query and access data from various web services through a simple SQL-like syntax. It describes how YQL provides a standardized way to access data without having to read documentation for each individual API. The document provides examples of common data queries and lists some of the benefits of using YQL, such as consolidating multiple HTTP requests into a single request. It also notes that YQL simply rewrites queries into HTTP calls under the hood rather than using "voodoo magic".
Scraping the web with Laravel, Dusk, Docker, and PHPPaul Redmond
Jumpstart your web scraping automation in the cloud with Laravel Dusk, Docker, and friends. We will discuss the types of web scraping tools, the best tools for the job, and how to deal with running selenium in Docker.
Code examples @ https://github.com/paulredmond/scraping-with-laravel-dusk
API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...CA API Management
At some point, we all need to design and implement APIs for the Web. What makes Web APIs different than typical component APIs? How can you leverage the power of the Internet when creating your Web API? What characteristics to many "great" Web APIs share? Is there a consistent process you can use to make sure you design a Web API that best fits your needs both now and in the future?
In this session Mike Amundsen describes a clear methodology for designing Web APIs (based on the book "RESTful Web APIs" by Richardson and Amundsen) that allows you to map key aspects of your business into a usable, scalable, and flexible interface that will reach your goals while creating a compelling API for both server and client developers. Whether you are looking to implement a private, partner, or public API, these principles will help you focus on the right metrics and design goals to create a successful API.
PHP is a server-side scripting language that is used for web development. It allows developers to manage dynamic content, databases, sessions, and build entire web applications. PHP code can be embedded within HTML or used on its own. When a web request is made, the PHP code is executed on the server and the output is sent to the browser. PHP supports features like variables, control structures, functions and object-oriented programming. It also allows access to databases and the generation of dynamic page content.
All we know that REST services are almost everywhere now and nearly all new projects use it.
But do we really know how to design proper interfaces? What are pitfalls and how to avoid them?
I did many REST service designs and have a bunch of tips and tricks you definitely would like to use.
It will save you and your team a lot of time in future.
Step-by-step process to scale up a LAMP stack application, using PHP7, Amazon Elastic Beanstalk and other free services. Covers many traps to be avoided when vertical and horizontal scaling.
This document provides an overview and instructions for various content transmission features in DSpace 4.2, including harvesting and disseminating content using OAI/PMH, reading content with REST, exporting and importing content using submission information packages (SIPs) and dissemination information packages (DIPs), depositing content using SWORD, and importing content using the simple archive format (SAF). It discusses configuring the OAI content source, metadata crosswalks, and the REST API. It also provides examples of using the DSpace packager for submissions and disseminations, making deposits via cURL with SWORD, and batch importing content from CSV and MARC XML files using the SAF format.
This document introduces Scrapy, an open source and collaborative framework for extracting data from websites. It discusses what Scrapy is used for, its advantages over alternatives like Beautiful Soup, and provides steps to install Scrapy and create a sample scraping project. The sample project scrapes review data from The Verge website, including the title, number of comments, and author for the first 5 review pages. The document concludes by explaining how to run the spider and store the extracted data in a file.
The document discusses how Huffington Post scales its platform to handle large amounts of traffic. It uses a variety of technologies including Perl, PHP, MySQL, MongoDB, Hadoop, Memcache, Redis, Varnish, and CodeIgniter. Key aspects of scaling include using a CDN, generating flat files from dynamic content, caching with Varnish, edge side includes, hardware SSL offloading, splitting applications across servers, and intelligently purging caches when content changes. The document also proposes using "guilds" or groups to help employees learn and collaborate around different technologies.
The document discusses how Huffington Post scales its platform to handle large amounts of traffic. It uses a variety of technologies including Perl, PHP, MySQL, MongoDB, Hadoop, Memcache, Redis, Varnish, and CodeIgniter. Key aspects of scaling include using a CDN, generating flat files for pages, caching responses with Varnish, controlling caching with custom headers, purging caches, using edge side includes, splitting applications across servers, and encouraging participation in technology-focused guilds.
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
Big data analysis involves several processes: collecting, storage, computing, analysis and visualization. In this slides, the author demonstrates these processes by using python tools to build a data product. The example is based on text-analyzing an online forum.
Centralized log-management-with-elastic-stackRich Lee
Centralized log management is implemented using the Elastic Stack including Filebeat, Logstash, Elasticsearch, and Kibana. Filebeat ships logs to Logstash which transforms and indexes the data into Elasticsearch. Logs can then be queried and visualized in Kibana. For large volumes of logs, Kafka may be used as a buffer between the shipper and indexer. Backups are performed using Elasticsearch snapshots to a shared file system or cloud storage. Logs are indexed into time-based indices and a cron job deletes old indices to control storage usage.
This document discusses logs aggregation and analysis using the ELK stack, which consists of Elasticsearch, Logstash, and Kibana. It describes problems with traditional logging like inconsistent formats and high server loads. It then explains how each tool in the ELK stack addresses these issues. Elasticsearch provides centralized storage and search. Logstash collects, parses, and filters logs from multiple sources. Kibana enables visualization and dashboarding for log analysis. Additional tools like Marvel and plugins are also discussed. Overall, the ELK stack provides a scalable logging solution with consistent structure, centralized management, and interactive analytics dashboards.
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyLITTINRAJAN
Web Scraping Introduction. It will cover cover all the most available libraries and the way they can be handled to scrape our required data. Created by Littin Rajan
Crawl the entire web in 10 minutes...and just 100€Danny Linden
This document describes how to build a web search engine using Common Crawl data at a low cost. Common Crawl crawls over 2 billion URLs per month and stores the data in an accessible format on AWS S3. The data can be processed using Hadoop and Pig on AWS EMR for scalability. Examples show how to load Common Crawl data into Pig and extract metadata like page titles. OnPage offers services to build scalable big data solutions using these techniques.
This document provides information about building a basic web crawler, including:
- Crawlers find and download web pages to build a collection for search engines by starting with seed URLs and following links on downloaded pages.
- Crawlers use techniques like multiple threads, politeness policies, and robots.txt files to efficiently and politely crawl the web at scale.
- Crawlers must also handle challenges like duplicate pages, updating freshness, focused crawling of specific topics, and accessing deep web or private pages.
Databases require capacity planning (and to those coming from traditional RDBMS solutions, this can be thought of as a sizing guide). Capacity planning prevents resource exhaustion. Capacity planning can be hard. This talk has a heavier leaning on MySQL, but the concepts and addendum will help with any other data store.
Google dorks, also known as Google hacking, involves using specific search operators and techniques on Google to find sensitive information on websites. Some key points made in the document include:
- Google dorking uses operators like "site:", "filetype:", "intitle:", and "inurl:" to refine search results and maximize search value. This can be used to uncover private information, configurations, backups, and vulnerabilities.
- Information found through Google hacking may remain exposed in Google's cache for months after being removed from websites.
- Google indexes everything publicly available on the internet, making it a powerful tool for passive information gathering, which is an important first step for hackers.
- The document encourages
This document discusses YQL (Yahoo Query Language) which allows users to query and access data from various web services through a simple SQL-like syntax. It describes how YQL provides a standardized way to access data without having to read documentation for each individual API. The document provides examples of common data queries and lists some of the benefits of using YQL, such as consolidating multiple HTTP requests into a single request. It also notes that YQL simply rewrites queries into HTTP calls under the hood rather than using "voodoo magic".
Scraping the web with Laravel, Dusk, Docker, and PHPPaul Redmond
Jumpstart your web scraping automation in the cloud with Laravel Dusk, Docker, and friends. We will discuss the types of web scraping tools, the best tools for the job, and how to deal with running selenium in Docker.
Code examples @ https://github.com/paulredmond/scraping-with-laravel-dusk
API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...CA API Management
At some point, we all need to design and implement APIs for the Web. What makes Web APIs different than typical component APIs? How can you leverage the power of the Internet when creating your Web API? What characteristics to many "great" Web APIs share? Is there a consistent process you can use to make sure you design a Web API that best fits your needs both now and in the future?
In this session Mike Amundsen describes a clear methodology for designing Web APIs (based on the book "RESTful Web APIs" by Richardson and Amundsen) that allows you to map key aspects of your business into a usable, scalable, and flexible interface that will reach your goals while creating a compelling API for both server and client developers. Whether you are looking to implement a private, partner, or public API, these principles will help you focus on the right metrics and design goals to create a successful API.
PHP is a server-side scripting language that is used for web development. It allows developers to manage dynamic content, databases, sessions, and build entire web applications. PHP code can be embedded within HTML or used on its own. When a web request is made, the PHP code is executed on the server and the output is sent to the browser. PHP supports features like variables, control structures, functions and object-oriented programming. It also allows access to databases and the generation of dynamic page content.
All we know that REST services are almost everywhere now and nearly all new projects use it.
But do we really know how to design proper interfaces? What are pitfalls and how to avoid them?
I did many REST service designs and have a bunch of tips and tricks you definitely would like to use.
It will save you and your team a lot of time in future.
Step-by-step process to scale up a LAMP stack application, using PHP7, Amazon Elastic Beanstalk and other free services. Covers many traps to be avoided when vertical and horizontal scaling.
Similaire à Python ScrapingPresentation for dummy.pptx (20)
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...Diana Rendina
Librarians are leading the way in creating future-ready citizens – now we need to update our spaces to match. In this session, attendees will get inspiration for transforming their library spaces. You’ll learn how to survey students and patrons, create a focus group, and use design thinking to brainstorm ideas for your space. We’ll discuss budget friendly ways to change your space as well as how to find funding. No matter where you’re at, you’ll find ideas for reimagining your space in this session.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
বাংলাদেশের অর্থনৈতিক সমীক্ষা ২০২৪ [Bangladesh Economic Review 2024 Bangla.pdf] কম্পিউটার , ট্যাব ও স্মার্ট ফোন ভার্সন সহ সম্পূর্ণ বাংলা ই-বুক বা pdf বই " সুচিপত্র ...বুকমার্ক মেনু 🔖 ও হাইপার লিংক মেনু 📝👆 যুক্ত ..
আমাদের সবার জন্য খুব খুব গুরুত্বপূর্ণ একটি বই ..বিসিএস, ব্যাংক, ইউনিভার্সিটি ভর্তি ও যে কোন প্রতিযোগিতা মূলক পরীক্ষার জন্য এর খুব ইম্পরট্যান্ট একটি বিষয় ...তাছাড়া বাংলাদেশের সাম্প্রতিক যে কোন ডাটা বা তথ্য এই বইতে পাবেন ...
তাই একজন নাগরিক হিসাবে এই তথ্য গুলো আপনার জানা প্রয়োজন ...।
বিসিএস ও ব্যাংক এর লিখিত পরীক্ষা ...+এছাড়া মাধ্যমিক ও উচ্চমাধ্যমিকের স্টুডেন্টদের জন্য অনেক কাজে আসবে ...
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications.
A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function.
Healing is the body’s response to injury in an attempt to restore normal structure and functions.
Healing can occur in two ways: Regeneration and Repair
There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc.
Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
2. Set Up
• Google Chrome is needed to follow along with this tutorial.
• *Install the Selector Gadget Extension for Chrome as well.*
• If you haven’t done already, download and install Anaconda
Python 3 Version at:
• https://www.anaconda.com/distribution
• Next, use Terminal or Command Prompt to enter the
following, one by one:
• pip install bs4
• pip install selenium
• pip install requests
• Download all workshop materials @ ^
• In case of errors, raise your hand and we will come around. For those who have
successfully completed the install, please assist others.
bit.ly/2Mmi6vH
3. Contents
■ Define Scraping
■ Python Basic Components (Data Types, Functions, Containers, For Loops)
■ Applications:
– Beautiful Soup
■ Demonstration with Follow Up
■ Practice Exercise
– Selenium
■ Demonstration with Follow Up
■ Practice Exercise
■ Things to keep in mind when scraping (robots.txt)
■ Challenge Introduction
■ Q & A
4. Web Scraping
■ Used for extracting data from websites
■ Automates the process of gathering data
which is typically only accessible via a web
browser
■ Each website is naturally different,
therefore each requires a slightly modified
approach while scraping
■ Not everything can be scrapped
5. Python Basics: Data Types
■ Int e.g. 2,3,4
■ Float e.g. 2.0,3.4, 4.3
■ String e.g. “scraping ftw!”, ”John Doe”
■ Boolean True, False
■ Others (Complex, Unicode etc.)
6. Python Basics: Functions
■ Functions start with “def” with the following format
– def function1(paramter1,parameter2):
answer = parameter1+paramter2
return answer
■ There are two ways to call functions:
1. Function1()
1. E.g. type(5) # int
2. Object.function1()
1. “python”.upper() # “PYTHON”
– Used under different circumstances (examples to
come later)
7. Python Basics: Lists
■ Type of data container which is used to store multiple data at the same time
■ Mutable (Can be changed)
■ Comparable to R’s vector
– E.g. list1 = [0,1,2,3,4]
■ Can contain items of varying data types
– E.g. list2 = [6,’harry’, True, 1.0]
■ Indexing starts with 0
– E.g. list2[0] = 6
■ A list can be nested in another list
– E.g. [1 , [98,109], 6, 7]
■ Call the ”append” function to add an item to a list
– E.g. list1.append(5)
8. Python Basics: Dictionaries
■ Collection of key-value pairs
■ Very similar to JSON objects
■ Mutable
■ E.g. dict1 = {‘r’:4,’w’:9, ‘t’:5}
■ Indexed with keys
– E.g. dict1[‘r’]
■ Keys are unique
■ Values can be lists or other nested dictionaries
■ A dictionary can also be nested into a list e.g. [{3:4,5:6}, 6,7]
9. Python Basics: For Loops
■ Used for iterating over a sequence (a list, a tuple, a dictionary, a set, or a
string)
■ E.g.
– cities_list = [‘hong kong”, “new york”, “miami”]
– for item in cities_list:
print(item)
# hong kong
# new york
# miami
10. Beautiful Soup
■ Switch to Jupiter Notebook
– Open Anaconda
– Launch Jupyter Notebook
– Go to IMDB’s 250 movies:
■ https://www.imdb.com/search/title?genres=drama&groups=top_250&sort=us
er_rating,desc
11. Selenium
■ Download the chrome web driver from
– http://chromedriver.chromium.org/downloads
■ Place the driver in your working directory
■ Continue with Jupyter Notebook
12. Scraping Ethics
■ Be respectful of websites’ permissions
■ View the website’s robots.txt file to learn which areas of the site are allowed
or disallowed from scraping
– You can access this file by replacing sitename.com in the following:
www.[sitename.com]/robots.txt
– E.g. imdb’s robots txt can be found at https://www.imdb.com/robots.txt
– You can also use https://canicrawl.com/ to check if a website allows
scrapping
■ Don’t overload website servers by sending too many requests. Use
“time.sleep(xx)” function to delay requests.
– This will also prevent your IP address from being banned
13. Interpreting the robots.txt file
■ All pages of the website can be
scrapped if you see the following:
– User-agent: *
– Disallow:
■ None of the pages of the website can
be scrapped if you see the following:
– User-agent: *
– Disallow: /
■ Example from imdb
– The sub-directories mentioned
here are disallowed from being
scrapped
14. Take-home Challenge
■ Scrape a fictional book store: http://books.toscrape.com/?
■ Use what you have learned to create efficiently scrape the following data for
Travel, Poetry, Art, Humor and Academic books:
– Book Title
– Product Description
– Price (excl. tax)
– Number of Reviews
■ Store all of the data in a single Pandas DataFrame
■ The most efficient scraper will be awarded with a prize
■ Deadline for submissions are in a week from today, 4/18/2019 11:59pm
15. Resources
■ https://github.com/devkosal/scraping_tutorial
– All code provided in this lecture can be found here
■ http://toscrape.com/
– Great sample websites to perform beginner to intermediate scrapping on
■ https://www.edx.org/course/introduction-to-computer-science-and-programming-
using-python-0
– Introduction to Computer Science using Python
– Highly recommended course on learning Python and CS form scratch
■ https://www.promptcloud.com/blog/how-to-read-and-respect-robots-file/
– Further reading on interpreting robots.txt
■ https://canicrawl.com/
– Check scraping permissions for any website