SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
Intro to Data
ScrapingPRESENTED BY
DAVID SELASSIE OPOKU
@sdopoku
13 July 2015
Outline
1. Target audience
2. What is and Why Data Scraping?
3. Use cases
4. Basic steps & Best practices
5. Tools
6. Reference Resources
Target
Audience
This should be useful to ...
● Non-tech-savvy data journalists
● Advanced data journalists
● Web developers & data publishers
● School of Data fellows
● Open Data enthusiasts
What is &
Why Data
Scraping ?
Data Scraping: what is it ?
scrape [ verb ˈskrāp ]
: to remove from a surface by usually repeated strokes of an edged instrument
: to collect by or as if by scraping —often used with up or together <scrape up the
price of a ticket>
- Merriam Webster
“The transformation of unstructured data on the web, typically in HTML format, into
structured data that can be stored and analyzed in a central local database or
spreadsheet.”
- Wikipedia (web scraping)
When should you scrape data ?
● PDF Data
● HTML data
Machine-readable data
Example
Use Cases
Cases when you can scrape
● Create a dataset for a data workshop
● Create a database for a data -driven app
● Create a data visualisation for a story
Best
Practices
Best Practices For Scrapers
1. Scraping is not scary!
a. Use existing tools
2. Use a modern and friendly browser
a. Chrome, Firefox, Opera, Safari
b. Avoid Internet Explorer
3. Map out the process
a. Where does scraping fit in?
Best Practices For Data Publishers
1. Have a consistent structure
a. Websites
b. PDFs
2. Always think about your data end users
a. Before, during & after publishing
Steps
1. Map out the process/pipeline for your data project
2. Identify your data source (website, PDF, API?)
3. Decide on storage format for your scraped data
a. CSV file, Spreadsheet, Google docs
b. Database
4. Select scraping tool
5. Verify and Clean data
Tools
Tools: Web Browsers
Tools: Scraping Apps
1. Point and click
a. Scraper Google Chrome extension
b. ScraperWiki (Classic version)
c. Import.io, Kimono Labs, Webscraper.io
d. Tabula (PDF)
2. Programming (Python libraries)
a. Beautiful Soup
b. Pattern (PDF and HTML)
c. Scrapy
Tools: Storage & Sharing
1. Google Spreadsheets
2. Github
3. Datahub.io
Resources - Readings and Tools
1. Five data scraping tools for would-be data journalists
2. Making data on the web useful: scraping
3. Liberating HTML Data Tables
4. BeautifulSoup
5. Pattern
6. Scrapy
7. Datahub
8. Import.io
9. Kimono
10. Webscraper.io
11. Tabula

Contenu connexe

Tendances

Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python Viren Rajput
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction ServicePromptCloud
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with PythonMaris Lemba
 
Web scraping in python
Web scraping in pythonWeb scraping in python
Web scraping in pythonSaurav Tomar
 
Web scraping
Web scrapingWeb scraping
Web scrapingSelecto
 
Scraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHPScraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHPPaul Redmond
 
Web scraping &amp; browser automation
Web scraping &amp; browser automationWeb scraping &amp; browser automation
Web scraping &amp; browser automationBHAWESH RAJPAL
 
Web analytics presentation
Web analytics presentationWeb analytics presentation
Web analytics presentationJim Jansen
 

Tendances (20)

Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Web scraping in python
Web scraping in pythonWeb scraping in python
Web scraping in python
 
Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
 
Web scraping
Web scrapingWeb scraping
Web scraping
 
WEB MINING.
WEB MINING.WEB MINING.
WEB MINING.
 
Scraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHPScraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHP
 
Web mining
Web miningWeb mining
Web mining
 
Web Mining
Web Mining Web Mining
Web Mining
 
Web scraping &amp; browser automation
Web scraping &amp; browser automationWeb scraping &amp; browser automation
Web scraping &amp; browser automation
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Web mining
Web miningWeb mining
Web mining
 
Web scraping
Web scrapingWeb scraping
Web scraping
 
Web mining
Web miningWeb mining
Web mining
 
Web analytics presentation
Web analytics presentationWeb analytics presentation
Web analytics presentation
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web mining
Web mining Web mining
Web mining
 

Similaire à Skillshare - Introduction to Data Scraping

Data Management and Horizon 2020
Data Management and Horizon 2020Data Management and Horizon 2020
Data Management and Horizon 2020Sarah Jones
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...Projeto RCAAP
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data ManagementSarah Jones
 
Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel   Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel geektimecoil
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshopl_ernest
 
Manage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW LibrariesManage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW LibrariesJennifer Muilenburg
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
"Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption" "Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption" J T "Tom" Johnson
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATTony Ross-Hellauer
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATOpenAIRE
 
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu | Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu | EUDAT
 
Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012scorlosquet
 

Similaire à Skillshare - Introduction to Data Scraping (20)

Data Management and Horizon 2020
Data Management and Horizon 2020Data Management and Horizon 2020
Data Management and Horizon 2020
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 
DMP & DMPonline
DMP & DMPonlineDMP & DMPonline
DMP & DMPonline
 
What is-rdm
What is-rdmWhat is-rdm
What is-rdm
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel   Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
 
Manage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW LibrariesManage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW Libraries
 
Publishing Linked Data using Schema.org
Publishing Linked Data using Schema.orgPublishing Linked Data using Schema.org
Publishing Linked Data using Schema.org
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
"Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption" "Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption"
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
 
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu | Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
 
Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012
 

Plus de School of Data

School of Data - What is it?
School of Data - What is it?School of Data - What is it?
School of Data - What is it?School of Data
 
Skillshare - Creating Excel Dashboards
Skillshare - Creating Excel DashboardsSkillshare - Creating Excel Dashboards
Skillshare - Creating Excel DashboardsSchool of Data
 
Skillshare - Understanding extractives data
Skillshare - Understanding extractives dataSkillshare - Understanding extractives data
Skillshare - Understanding extractives dataSchool of Data
 
Skillshare - Regression Analysis for Data Journalism
Skillshare - Regression Analysis for Data JournalismSkillshare - Regression Analysis for Data Journalism
Skillshare - Regression Analysis for Data JournalismSchool of Data
 
Skillshare - Building a data literacy community in Nigeria
Skillshare - Building a data literacy community in NigeriaSkillshare - Building a data literacy community in Nigeria
Skillshare - Building a data literacy community in NigeriaSchool of Data
 
Skillshare - Using Kobo Toolbox for mobile data collection
Skillshare - Using Kobo Toolbox for mobile data collectionSkillshare - Using Kobo Toolbox for mobile data collection
Skillshare - Using Kobo Toolbox for mobile data collectionSchool of Data
 
Skillshare - Introduction to Timemapper
Skillshare - Introduction to TimemapperSkillshare - Introduction to Timemapper
Skillshare - Introduction to TimemapperSchool of Data
 
Skillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSkillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSchool of Data
 
From data to diagrams: an introduction to basic graphs and charts
From data to diagrams: an introduction to basic graphs and chartsFrom data to diagrams: an introduction to basic graphs and charts
From data to diagrams: an introduction to basic graphs and chartsSchool of Data
 
Introduction to Data Journalism
Introduction to Data JournalismIntroduction to Data Journalism
Introduction to Data JournalismSchool of Data
 
Skillshare getting feedback from training events
Skillshare  getting feedback from training events Skillshare  getting feedback from training events
Skillshare getting feedback from training events School of Data
 
Activism through the lens [english].pptx
Activism through the lens [english].pptxActivism through the lens [english].pptx
Activism through the lens [english].pptxSchool of Data
 
Gamification skillshare by Yuandra Ismiraldi
Gamification skillshare by Yuandra IsmiraldiGamification skillshare by Yuandra Ismiraldi
Gamification skillshare by Yuandra IsmiraldiSchool of Data
 
Facilitation skill share by Happy Feraren
Facilitation skill share by Happy FerarenFacilitation skill share by Happy Feraren
Facilitation skill share by Happy FerarenSchool of Data
 
Mapping Skillshare with School of Data
Mapping Skillshare with School of DataMapping Skillshare with School of Data
Mapping Skillshare with School of DataSchool of Data
 
Data Visualization & Design with School of Data
Data Visualization & Design with School of DataData Visualization & Design with School of Data
Data Visualization & Design with School of DataSchool of Data
 
Network mapping with School of Data
Network mapping with School of DataNetwork mapping with School of Data
Network mapping with School of DataSchool of Data
 

Plus de School of Data (20)

School of Data - What is it?
School of Data - What is it?School of Data - What is it?
School of Data - What is it?
 
Skillshare - Creating Excel Dashboards
Skillshare - Creating Excel DashboardsSkillshare - Creating Excel Dashboards
Skillshare - Creating Excel Dashboards
 
Skillshare - Understanding extractives data
Skillshare - Understanding extractives dataSkillshare - Understanding extractives data
Skillshare - Understanding extractives data
 
Skillshare - Regression Analysis for Data Journalism
Skillshare - Regression Analysis for Data JournalismSkillshare - Regression Analysis for Data Journalism
Skillshare - Regression Analysis for Data Journalism
 
Skillshare - Building a data literacy community in Nigeria
Skillshare - Building a data literacy community in NigeriaSkillshare - Building a data literacy community in Nigeria
Skillshare - Building a data literacy community in Nigeria
 
Skillshare - Using Kobo Toolbox for mobile data collection
Skillshare - Using Kobo Toolbox for mobile data collectionSkillshare - Using Kobo Toolbox for mobile data collection
Skillshare - Using Kobo Toolbox for mobile data collection
 
Skillshare - Introduction to Timemapper
Skillshare - Introduction to TimemapperSkillshare - Introduction to Timemapper
Skillshare - Introduction to Timemapper
 
Skillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSkillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data Journalism
 
Intro to open refine
Intro to open refineIntro to open refine
Intro to open refine
 
From data to diagrams: an introduction to basic graphs and charts
From data to diagrams: an introduction to basic graphs and chartsFrom data to diagrams: an introduction to basic graphs and charts
From data to diagrams: an introduction to basic graphs and charts
 
Introduction to Data Journalism
Introduction to Data JournalismIntroduction to Data Journalism
Introduction to Data Journalism
 
Skillshare getting feedback from training events
Skillshare  getting feedback from training events Skillshare  getting feedback from training events
Skillshare getting feedback from training events
 
Photography tips
Photography tipsPhotography tips
Photography tips
 
Activism through the lens [english].pptx
Activism through the lens [english].pptxActivism through the lens [english].pptx
Activism through the lens [english].pptx
 
Gamification skillshare by Yuandra Ismiraldi
Gamification skillshare by Yuandra IsmiraldiGamification skillshare by Yuandra Ismiraldi
Gamification skillshare by Yuandra Ismiraldi
 
Facilitation skill share by Happy Feraren
Facilitation skill share by Happy FerarenFacilitation skill share by Happy Feraren
Facilitation skill share by Happy Feraren
 
UX presentation
UX presentationUX presentation
UX presentation
 
Mapping Skillshare with School of Data
Mapping Skillshare with School of DataMapping Skillshare with School of Data
Mapping Skillshare with School of Data
 
Data Visualization & Design with School of Data
Data Visualization & Design with School of DataData Visualization & Design with School of Data
Data Visualization & Design with School of Data
 
Network mapping with School of Data
Network mapping with School of DataNetwork mapping with School of Data
Network mapping with School of Data
 

Dernier

Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
knowledge representation in artificial intelligence
knowledge representation in artificial intelligenceknowledge representation in artificial intelligence
knowledge representation in artificial intelligencePriyadharshiniG41
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 

Dernier (20)

Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
knowledge representation in artificial intelligence
knowledge representation in artificial intelligenceknowledge representation in artificial intelligence
knowledge representation in artificial intelligence
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 

Skillshare - Introduction to Data Scraping

  • 1. Intro to Data ScrapingPRESENTED BY DAVID SELASSIE OPOKU @sdopoku 13 July 2015
  • 2. Outline 1. Target audience 2. What is and Why Data Scraping? 3. Use cases 4. Basic steps & Best practices 5. Tools 6. Reference Resources
  • 4. This should be useful to ... ● Non-tech-savvy data journalists ● Advanced data journalists ● Web developers & data publishers ● School of Data fellows ● Open Data enthusiasts
  • 5. What is & Why Data Scraping ?
  • 6. Data Scraping: what is it ? scrape [ verb ˈskrāp ] : to remove from a surface by usually repeated strokes of an edged instrument : to collect by or as if by scraping —often used with up or together <scrape up the price of a ticket> - Merriam Webster “The transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet.” - Wikipedia (web scraping)
  • 7. When should you scrape data ? ● PDF Data ● HTML data Machine-readable data
  • 9. Cases when you can scrape ● Create a dataset for a data workshop ● Create a database for a data -driven app ● Create a data visualisation for a story
  • 11. Best Practices For Scrapers 1. Scraping is not scary! a. Use existing tools 2. Use a modern and friendly browser a. Chrome, Firefox, Opera, Safari b. Avoid Internet Explorer 3. Map out the process a. Where does scraping fit in?
  • 12. Best Practices For Data Publishers 1. Have a consistent structure a. Websites b. PDFs 2. Always think about your data end users a. Before, during & after publishing
  • 13. Steps 1. Map out the process/pipeline for your data project 2. Identify your data source (website, PDF, API?) 3. Decide on storage format for your scraped data a. CSV file, Spreadsheet, Google docs b. Database 4. Select scraping tool 5. Verify and Clean data
  • 14. Tools
  • 16. Tools: Scraping Apps 1. Point and click a. Scraper Google Chrome extension b. ScraperWiki (Classic version) c. Import.io, Kimono Labs, Webscraper.io d. Tabula (PDF) 2. Programming (Python libraries) a. Beautiful Soup b. Pattern (PDF and HTML) c. Scrapy
  • 17. Tools: Storage & Sharing 1. Google Spreadsheets 2. Github 3. Datahub.io
  • 18. Resources - Readings and Tools 1. Five data scraping tools for would-be data journalists 2. Making data on the web useful: scraping 3. Liberating HTML Data Tables 4. BeautifulSoup 5. Pattern 6. Scrapy 7. Datahub 8. Import.io 9. Kimono 10. Webscraper.io 11. Tabula