SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
Intro to Data
ScrapingPRESENTED BY
DAVID SELASSIE OPOKU
@sdopoku
13 July 2015
Outline
1. Target audience
2. What is and Why Data Scraping?
3. Use cases
4. Basic steps & Best practices
5. Tools
6. Reference Resources
Target
Audience
This should be useful to ...
● Non-tech-savvy data journalists
● Advanced data journalists
● Web developers & data publishers
● School of Data fellows
● Open Data enthusiasts
What is &
Why Data
Scraping ?
Data Scraping: what is it ?
scrape [ verb ˈskrāp ]
: to remove from a surface by usually repeated strokes of an edged instrument
: to collect by or as if by scraping —often used with up or together <scrape up the
price of a ticket>
- Merriam Webster
“The transformation of unstructured data on the web, typically in HTML format, into
structured data that can be stored and analyzed in a central local database or
spreadsheet.”
- Wikipedia (web scraping)
When should you scrape data ?
● PDF Data
● HTML data
Machine-readable data
Example
Use Cases
Cases when you can scrape
● Create a dataset for a data workshop
● Create a database for a data -driven app
● Create a data visualisation for a story
Best
Practices
Best Practices For Scrapers
1. Scraping is not scary!
a. Use existing tools
2. Use a modern and friendly browser
a. Chrome, Firefox, Opera, Safari
b. Avoid Internet Explorer
3. Map out the process
a. Where does scraping fit in?
Best Practices For Data Publishers
1. Have a consistent structure
a. Websites
b. PDFs
2. Always think about your data end users
a. Before, during & after publishing
Steps
1. Map out the process/pipeline for your data project
2. Identify your data source (website, PDF, API?)
3. Decide on storage format for your scraped data
a. CSV file, Spreadsheet, Google docs
b. Database
4. Select scraping tool
5. Verify and Clean data
Tools
Tools: Web Browsers
Tools: Scraping Apps
1. Point and click
a. Scraper Google Chrome extension
b. ScraperWiki (Classic version)
c. Import.io, Kimono Labs, Webscraper.io
d. Tabula (PDF)
2. Programming (Python libraries)
a. Beautiful Soup
b. Pattern (PDF and HTML)
c. Scrapy
Tools: Storage & Sharing
1. Google Spreadsheets
2. Github
3. Datahub.io
Resources - Readings and Tools
1. Five data scraping tools for would-be data journalists
2. Making data on the web useful: scraping
3. Liberating HTML Data Tables
4. BeautifulSoup
5. Pattern
6. Scrapy
7. Datahub
8. Import.io
9. Kimono
10. Webscraper.io
11. Tabula

Contenu connexe

Tendances

Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python Viren Rajput
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documentsTommy Tavenner
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingCynthiaCruz55
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with PythonMaris Lemba
 
Web scraping &amp; browser automation
Web scraping &amp; browser automationWeb scraping &amp; browser automation
Web scraping &amp; browser automationBHAWESH RAJPAL
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in PythonSatwik Kansal
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analyticsCapgemini
 
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete DeckAI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete DeckSlideTeam
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Hayim Makabee
 
Visualising Data with Code
Visualising Data with CodeVisualising Data with Code
Visualising Data with CodeRi Liu
 
Machine Intelligence Powerpoint Presentation Slides
Machine Intelligence Powerpoint Presentation SlidesMachine Intelligence Powerpoint Presentation Slides
Machine Intelligence Powerpoint Presentation SlidesSlideTeam
 
Introduction to Artificial Intelligence and Machine Learning
Introduction to Artificial Intelligence and Machine Learning Introduction to Artificial Intelligence and Machine Learning
Introduction to Artificial Intelligence and Machine Learning Emad Nabil
 
Federated learning in brief
Federated learning in briefFederated learning in brief
Federated learning in briefShashi Perera
 

Tendances (20)

Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen Scraping
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
WEB Scraping.pptx
WEB Scraping.pptxWEB Scraping.pptx
WEB Scraping.pptx
 
Web scraping &amp; browser automation
Web scraping &amp; browser automationWeb scraping &amp; browser automation
Web scraping &amp; browser automation
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in Python
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analytics
 
Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
 
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete DeckAI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
 
Data visualization
Data visualizationData visualization
Data visualization
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
 
Introduction To Pentaho
Introduction To PentahoIntroduction To Pentaho
Introduction To Pentaho
 
Visualising Data with Code
Visualising Data with CodeVisualising Data with Code
Visualising Data with Code
 
Web scraping
Web scrapingWeb scraping
Web scraping
 
Machine Intelligence Powerpoint Presentation Slides
Machine Intelligence Powerpoint Presentation SlidesMachine Intelligence Powerpoint Presentation Slides
Machine Intelligence Powerpoint Presentation Slides
 
Introduction to Artificial Intelligence and Machine Learning
Introduction to Artificial Intelligence and Machine Learning Introduction to Artificial Intelligence and Machine Learning
Introduction to Artificial Intelligence and Machine Learning
 
Federated learning in brief
Federated learning in briefFederated learning in brief
Federated learning in brief
 
Data science - An Introduction
Data science - An IntroductionData science - An Introduction
Data science - An Introduction
 
Data science
Data scienceData science
Data science
 

Similaire à Skillshare - Introduction to Data Scraping

Data Management and Horizon 2020
Data Management and Horizon 2020Data Management and Horizon 2020
Data Management and Horizon 2020Sarah Jones
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...Projeto RCAAP
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Python Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptxPython Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptxASIMKHAN840563
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data ManagementSarah Jones
 
Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel   Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel geektimecoil
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshopl_ernest
 
Manage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW LibrariesManage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW LibrariesJennifer Muilenburg
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
"Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption" "Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption" J T "Tom" Johnson
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATTony Ross-Hellauer
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATOpenAIRE
 
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu | Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu | EUDAT
 

Similaire à Skillshare - Introduction to Data Scraping (20)

Data Management and Horizon 2020
Data Management and Horizon 2020Data Management and Horizon 2020
Data Management and Horizon 2020
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Python Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptxPython Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptx
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 
DMP & DMPonline
DMP & DMPonlineDMP & DMPonline
DMP & DMPonline
 
What is-rdm
What is-rdmWhat is-rdm
What is-rdm
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel   Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
 
Manage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW LibrariesManage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW Libraries
 
Publishing Linked Data using Schema.org
Publishing Linked Data using Schema.orgPublishing Linked Data using Schema.org
Publishing Linked Data using Schema.org
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
"Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption" "Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption"
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
 
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu | Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
 

Plus de School of Data

School of Data - What is it?
School of Data - What is it?School of Data - What is it?
School of Data - What is it?School of Data
 
Skillshare - Creating Excel Dashboards
Skillshare - Creating Excel DashboardsSkillshare - Creating Excel Dashboards
Skillshare - Creating Excel DashboardsSchool of Data
 
Skillshare - Understanding extractives data
Skillshare - Understanding extractives dataSkillshare - Understanding extractives data
Skillshare - Understanding extractives dataSchool of Data
 
Skillshare - Regression Analysis for Data Journalism
Skillshare - Regression Analysis for Data JournalismSkillshare - Regression Analysis for Data Journalism
Skillshare - Regression Analysis for Data JournalismSchool of Data
 
Skillshare - Building a data literacy community in Nigeria
Skillshare - Building a data literacy community in NigeriaSkillshare - Building a data literacy community in Nigeria
Skillshare - Building a data literacy community in NigeriaSchool of Data
 
Skillshare - Using Kobo Toolbox for mobile data collection
Skillshare - Using Kobo Toolbox for mobile data collectionSkillshare - Using Kobo Toolbox for mobile data collection
Skillshare - Using Kobo Toolbox for mobile data collectionSchool of Data
 
Skillshare - Introduction to Timemapper
Skillshare - Introduction to TimemapperSkillshare - Introduction to Timemapper
Skillshare - Introduction to TimemapperSchool of Data
 
Skillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSkillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSchool of Data
 
From data to diagrams: an introduction to basic graphs and charts
From data to diagrams: an introduction to basic graphs and chartsFrom data to diagrams: an introduction to basic graphs and charts
From data to diagrams: an introduction to basic graphs and chartsSchool of Data
 
Introduction to Data Journalism
Introduction to Data JournalismIntroduction to Data Journalism
Introduction to Data JournalismSchool of Data
 
Skillshare getting feedback from training events
Skillshare  getting feedback from training events Skillshare  getting feedback from training events
Skillshare getting feedback from training events School of Data
 
Activism through the lens [english].pptx
Activism through the lens [english].pptxActivism through the lens [english].pptx
Activism through the lens [english].pptxSchool of Data
 
Gamification skillshare by Yuandra Ismiraldi
Gamification skillshare by Yuandra IsmiraldiGamification skillshare by Yuandra Ismiraldi
Gamification skillshare by Yuandra IsmiraldiSchool of Data
 
Facilitation skill share by Happy Feraren
Facilitation skill share by Happy FerarenFacilitation skill share by Happy Feraren
Facilitation skill share by Happy FerarenSchool of Data
 
Mapping Skillshare with School of Data
Mapping Skillshare with School of DataMapping Skillshare with School of Data
Mapping Skillshare with School of DataSchool of Data
 
Data Visualization & Design with School of Data
Data Visualization & Design with School of DataData Visualization & Design with School of Data
Data Visualization & Design with School of DataSchool of Data
 
Network mapping with School of Data
Network mapping with School of DataNetwork mapping with School of Data
Network mapping with School of DataSchool of Data
 

Plus de School of Data (20)

School of Data - What is it?
School of Data - What is it?School of Data - What is it?
School of Data - What is it?
 
Skillshare - Creating Excel Dashboards
Skillshare - Creating Excel DashboardsSkillshare - Creating Excel Dashboards
Skillshare - Creating Excel Dashboards
 
Skillshare - Understanding extractives data
Skillshare - Understanding extractives dataSkillshare - Understanding extractives data
Skillshare - Understanding extractives data
 
Skillshare - Regression Analysis for Data Journalism
Skillshare - Regression Analysis for Data JournalismSkillshare - Regression Analysis for Data Journalism
Skillshare - Regression Analysis for Data Journalism
 
Skillshare - Building a data literacy community in Nigeria
Skillshare - Building a data literacy community in NigeriaSkillshare - Building a data literacy community in Nigeria
Skillshare - Building a data literacy community in Nigeria
 
Skillshare - Using Kobo Toolbox for mobile data collection
Skillshare - Using Kobo Toolbox for mobile data collectionSkillshare - Using Kobo Toolbox for mobile data collection
Skillshare - Using Kobo Toolbox for mobile data collection
 
Skillshare - Introduction to Timemapper
Skillshare - Introduction to TimemapperSkillshare - Introduction to Timemapper
Skillshare - Introduction to Timemapper
 
Skillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSkillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data Journalism
 
Intro to open refine
Intro to open refineIntro to open refine
Intro to open refine
 
From data to diagrams: an introduction to basic graphs and charts
From data to diagrams: an introduction to basic graphs and chartsFrom data to diagrams: an introduction to basic graphs and charts
From data to diagrams: an introduction to basic graphs and charts
 
Introduction to Data Journalism
Introduction to Data JournalismIntroduction to Data Journalism
Introduction to Data Journalism
 
Skillshare getting feedback from training events
Skillshare  getting feedback from training events Skillshare  getting feedback from training events
Skillshare getting feedback from training events
 
Photography tips
Photography tipsPhotography tips
Photography tips
 
Activism through the lens [english].pptx
Activism through the lens [english].pptxActivism through the lens [english].pptx
Activism through the lens [english].pptx
 
Gamification skillshare by Yuandra Ismiraldi
Gamification skillshare by Yuandra IsmiraldiGamification skillshare by Yuandra Ismiraldi
Gamification skillshare by Yuandra Ismiraldi
 
Facilitation skill share by Happy Feraren
Facilitation skill share by Happy FerarenFacilitation skill share by Happy Feraren
Facilitation skill share by Happy Feraren
 
UX presentation
UX presentationUX presentation
UX presentation
 
Mapping Skillshare with School of Data
Mapping Skillshare with School of DataMapping Skillshare with School of Data
Mapping Skillshare with School of Data
 
Data Visualization & Design with School of Data
Data Visualization & Design with School of DataData Visualization & Design with School of Data
Data Visualization & Design with School of Data
 
Network mapping with School of Data
Network mapping with School of DataNetwork mapping with School of Data
Network mapping with School of Data
 

Dernier

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Dernier (20)

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 

Skillshare - Introduction to Data Scraping

  • 1. Intro to Data ScrapingPRESENTED BY DAVID SELASSIE OPOKU @sdopoku 13 July 2015
  • 2. Outline 1. Target audience 2. What is and Why Data Scraping? 3. Use cases 4. Basic steps & Best practices 5. Tools 6. Reference Resources
  • 4. This should be useful to ... ● Non-tech-savvy data journalists ● Advanced data journalists ● Web developers & data publishers ● School of Data fellows ● Open Data enthusiasts
  • 5. What is & Why Data Scraping ?
  • 6. Data Scraping: what is it ? scrape [ verb ˈskrāp ] : to remove from a surface by usually repeated strokes of an edged instrument : to collect by or as if by scraping —often used with up or together <scrape up the price of a ticket> - Merriam Webster “The transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet.” - Wikipedia (web scraping)
  • 7. When should you scrape data ? ● PDF Data ● HTML data Machine-readable data
  • 9. Cases when you can scrape ● Create a dataset for a data workshop ● Create a database for a data -driven app ● Create a data visualisation for a story
  • 11. Best Practices For Scrapers 1. Scraping is not scary! a. Use existing tools 2. Use a modern and friendly browser a. Chrome, Firefox, Opera, Safari b. Avoid Internet Explorer 3. Map out the process a. Where does scraping fit in?
  • 12. Best Practices For Data Publishers 1. Have a consistent structure a. Websites b. PDFs 2. Always think about your data end users a. Before, during & after publishing
  • 13. Steps 1. Map out the process/pipeline for your data project 2. Identify your data source (website, PDF, API?) 3. Decide on storage format for your scraped data a. CSV file, Spreadsheet, Google docs b. Database 4. Select scraping tool 5. Verify and Clean data
  • 14. Tools
  • 16. Tools: Scraping Apps 1. Point and click a. Scraper Google Chrome extension b. ScraperWiki (Classic version) c. Import.io, Kimono Labs, Webscraper.io d. Tabula (PDF) 2. Programming (Python libraries) a. Beautiful Soup b. Pattern (PDF and HTML) c. Scrapy
  • 17. Tools: Storage & Sharing 1. Google Spreadsheets 2. Github 3. Datahub.io
  • 18. Resources - Readings and Tools 1. Five data scraping tools for would-be data journalists 2. Making data on the web useful: scraping 3. Liberating HTML Data Tables 4. BeautifulSoup 5. Pattern 6. Scrapy 7. Datahub 8. Import.io 9. Kimono 10. Webscraper.io 11. Tabula