http://serai.utsc.utoronto.ca/rrsi2014
"Unlike traditional academic conferences, the Roots & Routes Summer Institute features a combination of informal presentations, seminar-style discussions of shared materials, hands-on workshops on a variety of digital tools, and small-group project development sessions. The institute welcomes participants from a range of disciplines with an interest in engaging with digital scholarship; technical experience is not a requirement. Graduate students (MA and PhD), postdoctoral fellows and faculty are all encouraged to apply."
2. In this presentation
● Part One: preparing to create machine-
readable data at the onset of a research
endeavour
● Part Two: Working with “messy,” datasets
3. Benefits of machine-readable data
● Easier to query for new insights
● Easier to mount in a computing environment
● Easier to share with others
4. Just a .csv + Fusion Tables
● Fusion tables is an experimental, web-based
chrome app
● Took a spreadsheet that Natalie has been
working on and loaded it into the app
● Results have not been massaged at all
● We can expect additional benefits from
having structured data in the future
6. Best Case Scenario
You start by utilizing some best practices
4 Pieces of low-hanging fruit...
7. 1. No word documents
● database (even a spreadsheet) not .docs
● avoid a lot of style information in your
research documents (such as bolding and
italicizing text, or moving things to other
areas of the page using the tab key or
spacebar)
● Why?
8. Look beyond the surface.
& n
&nsbp; &nsbp; &nsbp; &nsbp; no thank you!
http://www.bartleby.com/103/33.html
9. Beauty is more than browser deep
http://www.gutenberg.org/ebooks/18827
10. 2. Use consistent formats for
elements such as date & language
● i.e. dates recorded consistently where
possible (05/25/2014)
11. 3. Taxonomies & Standards
● use controlled vocabularies for keywords,
place names, person names of relevance
o using an open format for a place name can make
geocoding much easier
o stay consistent in a given language
12. 4. Text Encoding
● Ensure you are using Unicode (UTF-8)
● How do you know ?
o Notepad can be your friend
o Test a sample between systems
14. Changing the way you
think about your research
process
Draw a picture
15. 1. Think small.
Atomistic information (what is the smallest
meaningful unit of information you are
collecting?)
For example:
● A person’s name, religion, and DOB
● Mention of a location or name
● Repeated occurrence
16. 2. Connect the dots.
What are the relationships between your data
elements?
Useful tool: The Entity Relationship Diagram
21. Tools for dealing with messy data
● Regular Expressions
● Open Refine
22. Regular Expressions: Find &
Replace on Steroids
● Available in most productivity suites (iWork,
Microsoft Word, Libre Office/Open Office)
● Often syntax is a little different across
systems
23. “The regular expression
(?<=.) {2,}(?=[A-Z]) matches at least two
spaces occurring after period (.) and before an
upper case letter as highlighted in the text
above.”
26. Open Refine
● Similar to spreadsheet
software
● Installed on your computer,
but used through your
browser
● “Power Tool” for messy data
Following will draw heavily from this lesson -
http://programminghistorian.org/lessons/cleaning-data-
with-openrefine (Thanks to Seth van Hooland, Ruben Verborgh, Max
De Wilde)
27. Base Assumption of Open Refine
● You have “structured data”
● some consistent and machine-readable
logic has been applied to your data
o Excel, .csv, XML
● you may have structured data and not
know it
o Check export options from any software you
regularly use
28. 1. Remove duplicates
2. Remove blanks
3. Make data atomistic (smallest meaningful
unit)
4. Keep terms/formats consistent
46. Creating a text facet
on “Categories” brings
up all the options in
this column.
We can “cluster” to
detect similar terms
that might have
variances in spelling
or capitalization
4. Make terms
consistent
52. Finally, projects can be exported as Refine
projects, but also in a number of additional
structured formats.
Do this frequently.
53. Structured data is beautiful data. Make a plan
to create structured data during your research
Clean legacy data or data you inherit, by
becoming a regular expression (regex) expert
and/or using a tool like OpenRefine.
54. Go to your library or ITS department to see if you can get
support. Thanks for listening to me!
Notes de l'éditeur
Take Dragomans File and load it into
padani example
Difficult to install on Windows?
Here I have launched OpenRefine in my browser. The sample file I’m using is located at the url on the slide. Remember that a longer version of this tutorial is available at http://programminghistorian.org/lessons/cleaning-data-with-openrefine