Scraping Handout

1
Basic Web Scraping
Scraping is a process by which you can extract data from an html page or
pdf, into a CSV or other format, so you can work with it in Excel or another
spreadsheet and use in your visualizations. Sometimes you can just copy
and paste data from an html table into a spreadsheet, and all will be fine.
Other times, not. So, you have to create a computer program that allows
you to identify and scrape the data you want. There are numerous ways in
which you can do this, some that involve programming and some that don’t.
Scraping with Google Spreadsheet
If copying and pasting directly into a spreadsheet doesn’t work, you can try
using Google Spreadsheet functions to scrape the data. Open a new Google
Spreadsheet. We are going to scrape the data from the Texas Music Office
site. The url http://governor.state.tx.us/music/musicians/talent/talent/",
goes to the first page of the directory, listing the bands that start with A.
In the first cell of your spreadsheet, type the function:
=ImportHtml("http://governor.state.tx.us/music/musicians/ta
lent/talent/", "table", 1)
• First argument is the url
• Second argument tells it to look for a table (the other element allowed
here is “list”),
• Third argument is the index of the table (if there are multiple tables on
the page).
You will have to look at the html to find the table from which you are trying
to get the data or through trial and error by changing the third number.
Give it a couple seconds and you should see the data directly from the table
in your spreadsheet. Easy!
Google Spreadsheets has a few other functions that can be helpful in
scraping data.
• ImportFeed will scrape from an RSS feed. Try:
=ImportFeed(http://news.google.com/news?pz=1&cf=all&ned=us&
Coding for Communicators
Dr. Cindy Royal
CU-Boulder

2
hl=en&topic=t&output=rss)
Find any RSS feed by looking for the RSS icon. This link pulls current
items from Google News.
• If your data is already in CSV format, you can save a step to bring it
into your spreadsheet by using ImportData. This will also scrape the
data directly from the site, so it will pull from the most recent version
of the file. Try:
=importData("http://www.census.gov/popest/data/national/tot
als/2014/files/NST_EST2014_ALLDATA.csv")
Of course you can also just open the csv in the spreadsheet program
and follow the instructions for importing it.
Chrome Scraper Extension
https://chrome.google.com/extensions/detail/mbigbapnjcgaffohmbkdlecacce
pngjd - free extension for Chrome. Select content on a page, use the context
menu to Scrape Similar. Can export results to a Google Doc.
Download and install the Chrome Scraper extension. Go to this page:
http://en.wikipedia.org/wiki/List_of_Academy_Award-winning_films
It includes a list of Academy Award-winning Films in a table. Select the first
row of the table. Ctrl-click and choose Scrape Similar. You should see the
entire table. You can easily Export to Google Docs with the button.

3
Notice the XPath description. This is the code the scraper used to find the
table.
//div[4]/table[1]/tbody/tr[td]
It found the first table in the fourth div and extracted elements in the tds.
This method did not get the links. To do that, let’s try just right clicking on
the first item in the table (unselect the entire row). This should find all the
links. Take a look at the difference in XPath code:
//td/i/a
This technique finds all the links within <i> tags in the tds.

4
You can learn more about XPath syntax at
http://www.w3schools.com/xpath/xpath_syntax.asp.
Import.io
Web-based scraping and API creation. This is a great site. You can input a
URL and get the resulting data. You can download it or you can create an
API that allows you to access it live in an application.

5
Using the Import.io App
Tutorial by Becky Larson
Import.io (https://www.import.io/) is a great tool for extracting data from
a website. The Web version of import.io is very powerful. Simply include a
url and let it find the data. But if you have more advanced requirements,
like scraping data from more than one page throughout a site, you can
use the Import.io application.
1. Download the app from the website for your platform -
https://www.import.io/download/download-info/
2. Open the import.io desktop app
3. Click New
a. Choose which option you want, whether the regular “magic”
setting, the extractor, or the crawler. We’ll use the crawler
4. Navigate to a page - Import.io will open a new tab, the top half of
which will look sort of like a normal browser. It will tell you to open a
page from which you’d like to extract data.
a. Open the page you’d like to go to in a regular browser and copy
and paste the URL into the address bar in the tab import.io has
opened.
i. Which page you use will depend on what you want the
crawler to do – see step 6 below and decide what kind of
data you need before choosing your url.
5. Click “I’m there” in the lower right corner of the crawler window once
the page has opened.

6
6. Click “Detect optimal settings” - Import.io will try to detect optimal
settings to display your page, which means it may try to turn off
JavaScript or other advanced functions.
a. If pieces of your page you were hoping to capture or see do not
load, click the option saying so in the lower right crawler window
– import.io will turn those functions back on.
7. Tell import whether the page you’re on gives data for one thing or for
many.
a. In their example on Youtube, the creator is crawling a clothing
site and is using product pages for specific clothing items to train
the crawler. In this example, the page is giving data on one item
(one piece of clothing).
b. In my example of crawling the SXSW Schedule, I am using
pages for specific panels, so same thing: each page has data for
ONE thing. It might be lots of data, but it’s for one panel or one
item.
i. If I only needed data that was available on the main
schedule page, I could crawl from there and choose the
“many items” option, but I want all the specific data each
panel page gives me.
ii. If you were on a page that looked more like a table or list
(like the main sxsw schedule page) and included data for

7
many different items, you would choose that option.
Instructions from here will follow the “one item” format.
8. Yay! Start training!
a. Essentially we are teaching the crawler what we want from this
type of page, and eventually it will have learned what we need
and will crawl the entire website to gather data from all pages
matching that description.
9. Click “add column”, give the column a name and tell the crawler what
kind of data this is – this is your first data point.
i. In my SXSW example, my first column will be the panel
name. That’s important data to have right off the bat. This
column is just “text”.
10. Once you have a column created, highlight what piece of data on
the page you want to be associated with that column (in my example,
I would highlight the panel name where it appears on the page) and
click “Train”.
a. Once you hit train, the value from the page will appear under the
column name in the lower left crawler window.
11. Columns with multiple data points
a. Sometimes a column may have multiple pieces of data
associated with it – in the SXSW example, each panel may have
up to four speakers for each panel. You could create multiple
columns – speaker 1, speaker 2, etc. – and have some without
data, or you can gather all that information in one column.
b. To have multiple entries, highlight the first piece of data and
train the column, and then highlight the second piece and train,
and so on until you have each piece. They should all be similar
types of data (like all the presenter names), don’t try to gather
different types in one column.
12. Continue to create columns, highlight the associated data and
train those columns until you have everything you need on the page.
13. At this point, click “I have what I need”
a. Import.io will prompt you to go to another page.
14. Navigate through import.io to another page like the one you just
entered (another panel page or product page from the clothing site
example). Don’t just copy and paste in another page from your
browser. Letting the crawler follow your navigation through the site
helps it understand how it will navigate through the site to other pages
it needs.
15. Once you’re there, hit “I’m there” and begin the process again.
a. After the first page, much of your data will import automatically.
It’s important to check through all your columns to make sure
the right data has been selected. If it hasn’t, or if the selection is

8
blank, simply click on the column, highlight the right data from
the page and click train.
16. Keep adding pages until you have a minimum of 5.
17. Click “Done training”
18. You’ll now have the option to upload the crawler to import.io –
go ahead and click to do so.
19. You have some advanced options for running your crawler –
what kinds of urls you want it to look for, etc, - but for right now just
go ahead and click run.
a. This will take a while, depending on your Internet connection.
b. Watch as the crawler finds this information on all pages on the
site!
c. When it is finished, you will have the option to download the
data to a csv or json file.

9
More Scraping Tools and Resources
There are many other tools that can be used effectively to scrape data from
Web pages. Here are a few additional resources:
• OutWit Hub – a Firefox extension and Desktop program that can
provide some advanced scraping capabilities
• ProPublica’s Scraping for Journalism: A Guide For Collecting Data -
http://www.propublica.org/nerds/item/doc-dollars-guides-collecting-
the-data
• Getting to Grips with ScraperWiki For Those Who Don’t Code -
http://datamineruk.wordpress.com/2011/07/21/getting-to-grips-with-
scraperwiki-for-those-who-dont-code/
• Web Scraping for Non-Programmers by Michelle Minkoff -
http://michelleminkoff.com/outwit-needlebase-hands-on-lab/

Scraping Handout

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Scraping Handout

Similaire à Scraping Handout (20)

Plus de Cindy Royal

Plus de Cindy Royal (20)

Dernier

Dernier (20)

Scraping Handout