SlideShare une entreprise Scribd logo
1  sur  9
Télécharger pour lire hors ligne
1
Basic Web Scraping
Scraping is a process by which you can extract data from an html page or
pdf, into a CSV or other format, so you can work with it in Excel or another
spreadsheet and use in your visualizations. Sometimes you can just copy
and paste data from an html table into a spreadsheet, and all will be fine.
Other times, not. So, you have to create a computer program that allows
you to identify and scrape the data you want. There are numerous ways in
which you can do this, some that involve programming and some that don’t.
Scraping with Google Spreadsheet
If copying and pasting directly into a spreadsheet doesn’t work, you can try
using Google Spreadsheet functions to scrape the data. Open a new Google
Spreadsheet. We are going to scrape the data from the Texas Music Office
site. The url http://governor.state.tx.us/music/musicians/talent/talent/",
goes to the first page of the directory, listing the bands that start with A.
In the first cell of your spreadsheet, type the function:
=ImportHtml("http://governor.state.tx.us/music/musicians/ta
lent/talent/", "table", 1)
• First argument is the url
• Second argument tells it to look for a table (the other element allowed
here is “list”),
• Third argument is the index of the table (if there are multiple tables on
the page).
You will have to look at the html to find the table from which you are trying
to get the data or through trial and error by changing the third number.
Give it a couple seconds and you should see the data directly from the table
in your spreadsheet. Easy!
Google Spreadsheets has a few other functions that can be helpful in
scraping data.
• ImportFeed will scrape from an RSS feed. Try:
=ImportFeed(http://news.google.com/news?pz=1&cf=all&ned=us&
Coding for Communicators
Dr. Cindy Royal
CU-Boulder
2
hl=en&topic=t&output=rss)
Find any RSS feed by looking for the RSS icon. This link pulls current
items from Google News.
• If your data is already in CSV format, you can save a step to bring it
into your spreadsheet by using ImportData. This will also scrape the
data directly from the site, so it will pull from the most recent version
of the file. Try:
=importData("http://www.census.gov/popest/data/national/tot
als/2014/files/NST_EST2014_ALLDATA.csv")
Of course you can also just open the csv in the spreadsheet program
and follow the instructions for importing it.
Chrome Scraper Extension
https://chrome.google.com/extensions/detail/mbigbapnjcgaffohmbkdlecacce
pngjd - free extension for Chrome. Select content on a page, use the context
menu to Scrape Similar. Can export results to a Google Doc.
Download and install the Chrome Scraper extension. Go to this page:
http://en.wikipedia.org/wiki/List_of_Academy_Award-winning_films
It includes a list of Academy Award-winning Films in a table. Select the first
row of the table. Ctrl-click and choose Scrape Similar. You should see the
entire table. You can easily Export to Google Docs with the button.
3
Notice the XPath description. This is the code the scraper used to find the
table.
//div[4]/table[1]/tbody/tr[td]
It found the first table in the fourth div and extracted elements in the tds.
This method did not get the links. To do that, let’s try just right clicking on
the first item in the table (unselect the entire row). This should find all the
links. Take a look at the difference in XPath code:
//td/i/a
This technique finds all the links within <i> tags in the tds.
4
You can learn more about XPath syntax at
http://www.w3schools.com/xpath/xpath_syntax.asp.
Import.io
Web-based scraping and API creation. This is a great site. You can input a
URL and get the resulting data. You can download it or you can create an
API that allows you to access it live in an application.
5
Using the Import.io App
Tutorial by Becky Larson
Import.io (https://www.import.io/) is a great tool for extracting data from
a website. The Web version of import.io is very powerful. Simply include a
url and let it find the data. But if you have more advanced requirements,
like scraping data from more than one page throughout a site, you can
use the Import.io application.
1. Download the app from the website for your platform -
https://www.import.io/download/download-info/
2. Open the import.io desktop app
3. Click New
a. Choose which option you want, whether the regular “magic”
setting, the extractor, or the crawler. We’ll use the crawler
4. Navigate to a page - Import.io will open a new tab, the top half of
which will look sort of like a normal browser. It will tell you to open a
page from which you’d like to extract data.
a. Open the page you’d like to go to in a regular browser and copy
and paste the URL into the address bar in the tab import.io has
opened.
i. Which page you use will depend on what you want the
crawler to do – see step 6 below and decide what kind of
data you need before choosing your url.
5. Click “I’m there” in the lower right corner of the crawler window once
the page has opened.
6
6. Click “Detect optimal settings” - Import.io will try to detect optimal
settings to display your page, which means it may try to turn off
JavaScript or other advanced functions.
a. If pieces of your page you were hoping to capture or see do not
load, click the option saying so in the lower right crawler window
– import.io will turn those functions back on.
7. Tell import whether the page you’re on gives data for one thing or for
many.
a. In their example on Youtube, the creator is crawling a clothing
site and is using product pages for specific clothing items to train
the crawler. In this example, the page is giving data on one item
(one piece of clothing).
b. In my example of crawling the SXSW Schedule, I am using
pages for specific panels, so same thing: each page has data for
ONE thing. It might be lots of data, but it’s for one panel or one
item.
i. If I only needed data that was available on the main
schedule page, I could crawl from there and choose the
“many items” option, but I want all the specific data each
panel page gives me.
ii. If you were on a page that looked more like a table or list
(like the main sxsw schedule page) and included data for
7
many different items, you would choose that option.
Instructions from here will follow the “one item” format.
8. Yay! Start training!
a. Essentially we are teaching the crawler what we want from this
type of page, and eventually it will have learned what we need
and will crawl the entire website to gather data from all pages
matching that description.
9. Click “add column”, give the column a name and tell the crawler what
kind of data this is – this is your first data point.
i. In my SXSW example, my first column will be the panel
name. That’s important data to have right off the bat. This
column is just “text”.
10. Once you have a column created, highlight what piece of data on
the page you want to be associated with that column (in my example,
I would highlight the panel name where it appears on the page) and
click “Train”.
a. Once you hit train, the value from the page will appear under the
column name in the lower left crawler window.
11. Columns with multiple data points
a. Sometimes a column may have multiple pieces of data
associated with it – in the SXSW example, each panel may have
up to four speakers for each panel. You could create multiple
columns – speaker 1, speaker 2, etc. – and have some without
data, or you can gather all that information in one column.
b. To have multiple entries, highlight the first piece of data and
train the column, and then highlight the second piece and train,
and so on until you have each piece. They should all be similar
types of data (like all the presenter names), don’t try to gather
different types in one column.
12. Continue to create columns, highlight the associated data and
train those columns until you have everything you need on the page.
13. At this point, click “I have what I need”
a. Import.io will prompt you to go to another page.
14. Navigate through import.io to another page like the one you just
entered (another panel page or product page from the clothing site
example). Don’t just copy and paste in another page from your
browser. Letting the crawler follow your navigation through the site
helps it understand how it will navigate through the site to other pages
it needs.
15. Once you’re there, hit “I’m there” and begin the process again.
a. After the first page, much of your data will import automatically.
It’s important to check through all your columns to make sure
the right data has been selected. If it hasn’t, or if the selection is
8
blank, simply click on the column, highlight the right data from
the page and click train.
16. Keep adding pages until you have a minimum of 5.
17. Click “Done training”
18. You’ll now have the option to upload the crawler to import.io –
go ahead and click to do so.
19. You have some advanced options for running your crawler –
what kinds of urls you want it to look for, etc, - but for right now just
go ahead and click run.
a. This will take a while, depending on your Internet connection.
b. Watch as the crawler finds this information on all pages on the
site!
c. When it is finished, you will have the option to download the
data to a csv or json file.
9
More Scraping Tools and Resources
There are many other tools that can be used effectively to scrape data from
Web pages. Here are a few additional resources:
• OutWit Hub – a Firefox extension and Desktop program that can
provide some advanced scraping capabilities
• ProPublica’s Scraping for Journalism: A Guide For Collecting Data -
http://www.propublica.org/nerds/item/doc-dollars-guides-collecting-
the-data
• Getting to Grips with ScraperWiki For Those Who Don’t Code -
http://datamineruk.wordpress.com/2011/07/21/getting-to-grips-with-
scraperwiki-for-those-who-dont-code/
• Web Scraping for Non-Programmers by Michelle Minkoff -
http://michelleminkoff.com/outwit-needlebase-hands-on-lab/

Contenu connexe

Tendances

WordPress for Real Estate
WordPress for Real EstateWordPress for Real Estate
WordPress for Real Estate
Jay Thompson
 

Tendances (20)

Productivity tools teldan 4 2011
Productivity tools teldan 4  2011Productivity tools teldan 4  2011
Productivity tools teldan 4 2011
 
Use WordPress to become a social proprietor
Use WordPress to become a social proprietorUse WordPress to become a social proprietor
Use WordPress to become a social proprietor
 
Introduction to jQuery Mobile
Introduction to jQuery MobileIntroduction to jQuery Mobile
Introduction to jQuery Mobile
 
WordPress for Beginners
WordPress for BeginnersWordPress for Beginners
WordPress for Beginners
 
HTML Tutorial
HTML TutorialHTML Tutorial
HTML Tutorial
 
Hyperlinking
HyperlinkingHyperlinking
Hyperlinking
 
Wordpress for Beginners: 10 Must Knows
Wordpress for Beginners: 10 Must KnowsWordpress for Beginners: 10 Must Knows
Wordpress for Beginners: 10 Must Knows
 
0 How To Use Internet Enhmandah
0 How To Use Internet Enhmandah0 How To Use Internet Enhmandah
0 How To Use Internet Enhmandah
 
Anatomy Of A Domain Name and URL
Anatomy Of A Domain Name and URLAnatomy Of A Domain Name and URL
Anatomy Of A Domain Name and URL
 
Just started with WordPress
Just started with WordPressJust started with WordPress
Just started with WordPress
 
How To Create a Hyperlink in Microsoft Office PowerPoint
How To Create a Hyperlink in Microsoft Office PowerPointHow To Create a Hyperlink in Microsoft Office PowerPoint
How To Create a Hyperlink in Microsoft Office PowerPoint
 
Copycat Site BluePrint - make money online fast
Copycat Site BluePrint - make money online fastCopycat Site BluePrint - make money online fast
Copycat Site BluePrint - make money online fast
 
Migrating from WordPress.com
Migrating from WordPress.comMigrating from WordPress.com
Migrating from WordPress.com
 
Parent & Child Themes
Parent & Child ThemesParent & Child Themes
Parent & Child Themes
 
WordPress for Libraries PreConference Workshop
WordPress for Libraries PreConference WorkshopWordPress for Libraries PreConference Workshop
WordPress for Libraries PreConference Workshop
 
WordPress for Real Estate
WordPress for Real EstateWordPress for Real Estate
WordPress for Real Estate
 
WordCamp Nashville 2015 From Zero to WordPress Publish (Beginner's WordPress)
WordCamp Nashville 2015 From Zero to WordPress Publish (Beginner's WordPress)WordCamp Nashville 2015 From Zero to WordPress Publish (Beginner's WordPress)
WordCamp Nashville 2015 From Zero to WordPress Publish (Beginner's WordPress)
 
Brows tips eng
Brows tips engBrows tips eng
Brows tips eng
 
How to remove search.olivernetko.com manually
How to remove search.olivernetko.com manuallyHow to remove search.olivernetko.com manually
How to remove search.olivernetko.com manually
 
Slideshare = Twelve Technology Tools For Teams And Top Producers
Slideshare = Twelve Technology Tools For Teams And Top ProducersSlideshare = Twelve Technology Tools For Teams And Top Producers
Slideshare = Twelve Technology Tools For Teams And Top Producers
 

Similaire à Scraping Handout

Download and visualise v0 3
Download and visualise v0 3Download and visualise v0 3
Download and visualise v0 3
Noel Hatch
 
Google analytics guide
Google analytics guideGoogle analytics guide
Google analytics guide
Rajiv Kumar
 
School of Data - mapping company networks
School of Data - mapping company networksSchool of Data - mapping company networks
School of Data - mapping company networks
Tony Hirst
 
Master class booklet
Master class bookletMaster class booklet
Master class booklet
Denis Masseni
 
SourceCon Lab- Bookmarklets by Glenn Gutmacher Oct 2014
SourceCon Lab- Bookmarklets by Glenn Gutmacher Oct 2014SourceCon Lab- Bookmarklets by Glenn Gutmacher Oct 2014
SourceCon Lab- Bookmarklets by Glenn Gutmacher Oct 2014
Glenn Gutmacher
 
Cis 407 i lab 1 of 7
Cis 407 i lab 1 of 7Cis 407 i lab 1 of 7
Cis 407 i lab 1 of 7
helpido9
 
183203806 sales force-class-8
183203806 sales force-class-8183203806 sales force-class-8
183203806 sales force-class-8
Amit Sharma
 
Pipes Book - imaginings
Pipes Book - imaginingsPipes Book - imaginings
Pipes Book - imaginings
Tony Hirst
 
Cis407 a ilab 1 web application development devry university
Cis407 a ilab 1 web application development devry universityCis407 a ilab 1 web application development devry university
Cis407 a ilab 1 web application development devry university
lhkslkdh89009
 
Sales force managing-data
Sales force managing-dataSales force managing-data
Sales force managing-data
Amit Sharma
 
Sales force certification-lab-ii
Sales force certification-lab-iiSales force certification-lab-ii
Sales force certification-lab-ii
Amit Sharma
 
Ui path web data extraction
Ui path web data extractionUi path web data extraction
Ui path web data extraction
Adrian Dorache
 
A360 - Guide to Getting Started
A360 - Guide to Getting StartedA360 - Guide to Getting Started
A360 - Guide to Getting Started
Autodesk A360
 
WASPNEWServerDecoumentation
WASPNEWServerDecoumentationWASPNEWServerDecoumentation
WASPNEWServerDecoumentation
James Willis
 

Similaire à Scraping Handout (20)

Download and visualise v0 3
Download and visualise v0 3Download and visualise v0 3
Download and visualise v0 3
 
Google analytics guide
Google analytics guideGoogle analytics guide
Google analytics guide
 
School of Data - mapping company networks
School of Data - mapping company networksSchool of Data - mapping company networks
School of Data - mapping company networks
 
Master class booklet
Master class bookletMaster class booklet
Master class booklet
 
Facet and Search API
Facet and Search APIFacet and Search API
Facet and Search API
 
How to use slideshare
How to use slideshareHow to use slideshare
How to use slideshare
 
SourceCon Lab- Bookmarklets by Glenn Gutmacher Oct 2014
SourceCon Lab- Bookmarklets by Glenn Gutmacher Oct 2014SourceCon Lab- Bookmarklets by Glenn Gutmacher Oct 2014
SourceCon Lab- Bookmarklets by Glenn Gutmacher Oct 2014
 
Cis 407 i lab 1 of 7
Cis 407 i lab 1 of 7Cis 407 i lab 1 of 7
Cis 407 i lab 1 of 7
 
Querying a custom table in google big query
Querying a custom table in google big queryQuerying a custom table in google big query
Querying a custom table in google big query
 
183203806 sales force-class-8
183203806 sales force-class-8183203806 sales force-class-8
183203806 sales force-class-8
 
Pipes Book - imaginings
Pipes Book - imaginingsPipes Book - imaginings
Pipes Book - imaginings
 
Lotus Domino
Lotus DominoLotus Domino
Lotus Domino
 
Birt (business intelligence and reporting tools)
Birt (business intelligence and reporting tools)Birt (business intelligence and reporting tools)
Birt (business intelligence and reporting tools)
 
Cis407 a ilab 1 web application development devry university
Cis407 a ilab 1 web application development devry universityCis407 a ilab 1 web application development devry university
Cis407 a ilab 1 web application development devry university
 
Sales force managing-data
Sales force managing-dataSales force managing-data
Sales force managing-data
 
Sales force certification-lab-ii
Sales force certification-lab-iiSales force certification-lab-ii
Sales force certification-lab-ii
 
Ui path web data extraction
Ui path web data extractionUi path web data extraction
Ui path web data extraction
 
A360 - Guide to Getting Started
A360 - Guide to Getting StartedA360 - Guide to Getting Started
A360 - Guide to Getting Started
 
Secrets of the web inspector
Secrets of the web inspectorSecrets of the web inspector
Secrets of the web inspector
 
WASPNEWServerDecoumentation
WASPNEWServerDecoumentationWASPNEWServerDecoumentation
WASPNEWServerDecoumentation
 

Plus de Cindy Royal

Plus de Cindy Royal (20)

PhDigital 2020: Web Development
PhDigital 2020: Web DevelopmentPhDigital 2020: Web Development
PhDigital 2020: Web Development
 
Redefining Doctoral Education: Preparing Future Faculty to Lead Emerging Med...
Redefining Doctoral Education:  Preparing Future Faculty to Lead Emerging Med...Redefining Doctoral Education:  Preparing Future Faculty to Lead Emerging Med...
Redefining Doctoral Education: Preparing Future Faculty to Lead Emerging Med...
 
Web Development
Web DevelopmentWeb Development
Web Development
 
Product Management
Product ManagementProduct Management
Product Management
 
Digital Product Management
Digital Product ManagementDigital Product Management
Digital Product Management
 
Bending, Breaking and Blending the Academy
Bending, Breaking and Blending the AcademyBending, Breaking and Blending the Academy
Bending, Breaking and Blending the Academy
 
Taking Control of Social Media For Your Career
Taking Control of Social Media For Your CareerTaking Control of Social Media For Your Career
Taking Control of Social Media For Your Career
 
Bootstrap Web Development Framework
Bootstrap Web Development FrameworkBootstrap Web Development Framework
Bootstrap Web Development Framework
 
Web Development Intro
Web Development IntroWeb Development Intro
Web Development Intro
 
PhDigital Bootcamp: Web Development Concepts
PhDigital Bootcamp: Web Development ConceptsPhDigital Bootcamp: Web Development Concepts
PhDigital Bootcamp: Web Development Concepts
 
PhDigital Bootcamp: Digital Product Management
PhDigital Bootcamp: Digital Product ManagementPhDigital Bootcamp: Digital Product Management
PhDigital Bootcamp: Digital Product Management
 
Digital and Social Certifications
Digital and Social CertificationsDigital and Social Certifications
Digital and Social Certifications
 
MiLab Presentation 2018
MiLab Presentation 2018MiLab Presentation 2018
MiLab Presentation 2018
 
Is Your Curriculum Digital Enough?
Is Your Curriculum Digital Enough?Is Your Curriculum Digital Enough?
Is Your Curriculum Digital Enough?
 
Fundamentals of Digital/Online Media
Fundamentals of Digital/Online MediaFundamentals of Digital/Online Media
Fundamentals of Digital/Online Media
 
Bringing Digital Into the Curriculum - AEJMC 2017
Bringing Digital Into the Curriculum - AEJMC 2017Bringing Digital Into the Curriculum - AEJMC 2017
Bringing Digital Into the Curriculum - AEJMC 2017
 
Responsive Design
Responsive DesignResponsive Design
Responsive Design
 
The World of Web Development - 2017
The World of Web Development - 2017The World of Web Development - 2017
The World of Web Development - 2017
 
Why Should Communicators Learn to Code?
Why Should Communicators Learn to Code?Why Should Communicators Learn to Code?
Why Should Communicators Learn to Code?
 
Engaging Audiences with Social Media
Engaging Audiences with Social MediaEngaging Audiences with Social Media
Engaging Audiences with Social Media
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Scraping Handout

  • 1. 1 Basic Web Scraping Scraping is a process by which you can extract data from an html page or pdf, into a CSV or other format, so you can work with it in Excel or another spreadsheet and use in your visualizations. Sometimes you can just copy and paste data from an html table into a spreadsheet, and all will be fine. Other times, not. So, you have to create a computer program that allows you to identify and scrape the data you want. There are numerous ways in which you can do this, some that involve programming and some that don’t. Scraping with Google Spreadsheet If copying and pasting directly into a spreadsheet doesn’t work, you can try using Google Spreadsheet functions to scrape the data. Open a new Google Spreadsheet. We are going to scrape the data from the Texas Music Office site. The url http://governor.state.tx.us/music/musicians/talent/talent/", goes to the first page of the directory, listing the bands that start with A. In the first cell of your spreadsheet, type the function: =ImportHtml("http://governor.state.tx.us/music/musicians/ta lent/talent/", "table", 1) • First argument is the url • Second argument tells it to look for a table (the other element allowed here is “list”), • Third argument is the index of the table (if there are multiple tables on the page). You will have to look at the html to find the table from which you are trying to get the data or through trial and error by changing the third number. Give it a couple seconds and you should see the data directly from the table in your spreadsheet. Easy! Google Spreadsheets has a few other functions that can be helpful in scraping data. • ImportFeed will scrape from an RSS feed. Try: =ImportFeed(http://news.google.com/news?pz=1&cf=all&ned=us& Coding for Communicators Dr. Cindy Royal CU-Boulder
  • 2. 2 hl=en&topic=t&output=rss) Find any RSS feed by looking for the RSS icon. This link pulls current items from Google News. • If your data is already in CSV format, you can save a step to bring it into your spreadsheet by using ImportData. This will also scrape the data directly from the site, so it will pull from the most recent version of the file. Try: =importData("http://www.census.gov/popest/data/national/tot als/2014/files/NST_EST2014_ALLDATA.csv") Of course you can also just open the csv in the spreadsheet program and follow the instructions for importing it. Chrome Scraper Extension https://chrome.google.com/extensions/detail/mbigbapnjcgaffohmbkdlecacce pngjd - free extension for Chrome. Select content on a page, use the context menu to Scrape Similar. Can export results to a Google Doc. Download and install the Chrome Scraper extension. Go to this page: http://en.wikipedia.org/wiki/List_of_Academy_Award-winning_films It includes a list of Academy Award-winning Films in a table. Select the first row of the table. Ctrl-click and choose Scrape Similar. You should see the entire table. You can easily Export to Google Docs with the button.
  • 3. 3 Notice the XPath description. This is the code the scraper used to find the table. //div[4]/table[1]/tbody/tr[td] It found the first table in the fourth div and extracted elements in the tds. This method did not get the links. To do that, let’s try just right clicking on the first item in the table (unselect the entire row). This should find all the links. Take a look at the difference in XPath code: //td/i/a This technique finds all the links within <i> tags in the tds.
  • 4. 4 You can learn more about XPath syntax at http://www.w3schools.com/xpath/xpath_syntax.asp. Import.io Web-based scraping and API creation. This is a great site. You can input a URL and get the resulting data. You can download it or you can create an API that allows you to access it live in an application.
  • 5. 5 Using the Import.io App Tutorial by Becky Larson Import.io (https://www.import.io/) is a great tool for extracting data from a website. The Web version of import.io is very powerful. Simply include a url and let it find the data. But if you have more advanced requirements, like scraping data from more than one page throughout a site, you can use the Import.io application. 1. Download the app from the website for your platform - https://www.import.io/download/download-info/ 2. Open the import.io desktop app 3. Click New a. Choose which option you want, whether the regular “magic” setting, the extractor, or the crawler. We’ll use the crawler 4. Navigate to a page - Import.io will open a new tab, the top half of which will look sort of like a normal browser. It will tell you to open a page from which you’d like to extract data. a. Open the page you’d like to go to in a regular browser and copy and paste the URL into the address bar in the tab import.io has opened. i. Which page you use will depend on what you want the crawler to do – see step 6 below and decide what kind of data you need before choosing your url. 5. Click “I’m there” in the lower right corner of the crawler window once the page has opened.
  • 6. 6 6. Click “Detect optimal settings” - Import.io will try to detect optimal settings to display your page, which means it may try to turn off JavaScript or other advanced functions. a. If pieces of your page you were hoping to capture or see do not load, click the option saying so in the lower right crawler window – import.io will turn those functions back on. 7. Tell import whether the page you’re on gives data for one thing or for many. a. In their example on Youtube, the creator is crawling a clothing site and is using product pages for specific clothing items to train the crawler. In this example, the page is giving data on one item (one piece of clothing). b. In my example of crawling the SXSW Schedule, I am using pages for specific panels, so same thing: each page has data for ONE thing. It might be lots of data, but it’s for one panel or one item. i. If I only needed data that was available on the main schedule page, I could crawl from there and choose the “many items” option, but I want all the specific data each panel page gives me. ii. If you were on a page that looked more like a table or list (like the main sxsw schedule page) and included data for
  • 7. 7 many different items, you would choose that option. Instructions from here will follow the “one item” format. 8. Yay! Start training! a. Essentially we are teaching the crawler what we want from this type of page, and eventually it will have learned what we need and will crawl the entire website to gather data from all pages matching that description. 9. Click “add column”, give the column a name and tell the crawler what kind of data this is – this is your first data point. i. In my SXSW example, my first column will be the panel name. That’s important data to have right off the bat. This column is just “text”. 10. Once you have a column created, highlight what piece of data on the page you want to be associated with that column (in my example, I would highlight the panel name where it appears on the page) and click “Train”. a. Once you hit train, the value from the page will appear under the column name in the lower left crawler window. 11. Columns with multiple data points a. Sometimes a column may have multiple pieces of data associated with it – in the SXSW example, each panel may have up to four speakers for each panel. You could create multiple columns – speaker 1, speaker 2, etc. – and have some without data, or you can gather all that information in one column. b. To have multiple entries, highlight the first piece of data and train the column, and then highlight the second piece and train, and so on until you have each piece. They should all be similar types of data (like all the presenter names), don’t try to gather different types in one column. 12. Continue to create columns, highlight the associated data and train those columns until you have everything you need on the page. 13. At this point, click “I have what I need” a. Import.io will prompt you to go to another page. 14. Navigate through import.io to another page like the one you just entered (another panel page or product page from the clothing site example). Don’t just copy and paste in another page from your browser. Letting the crawler follow your navigation through the site helps it understand how it will navigate through the site to other pages it needs. 15. Once you’re there, hit “I’m there” and begin the process again. a. After the first page, much of your data will import automatically. It’s important to check through all your columns to make sure the right data has been selected. If it hasn’t, or if the selection is
  • 8. 8 blank, simply click on the column, highlight the right data from the page and click train. 16. Keep adding pages until you have a minimum of 5. 17. Click “Done training” 18. You’ll now have the option to upload the crawler to import.io – go ahead and click to do so. 19. You have some advanced options for running your crawler – what kinds of urls you want it to look for, etc, - but for right now just go ahead and click run. a. This will take a while, depending on your Internet connection. b. Watch as the crawler finds this information on all pages on the site! c. When it is finished, you will have the option to download the data to a csv or json file.
  • 9. 9 More Scraping Tools and Resources There are many other tools that can be used effectively to scrape data from Web pages. Here are a few additional resources: • OutWit Hub – a Firefox extension and Desktop program that can provide some advanced scraping capabilities • ProPublica’s Scraping for Journalism: A Guide For Collecting Data - http://www.propublica.org/nerds/item/doc-dollars-guides-collecting- the-data • Getting to Grips with ScraperWiki For Those Who Don’t Code - http://datamineruk.wordpress.com/2011/07/21/getting-to-grips-with- scraperwiki-for-those-who-dont-code/ • Web Scraping for Non-Programmers by Michelle Minkoff - http://michelleminkoff.com/outwit-needlebase-hands-on-lab/