1. Data visualization in the
newsroom
{
“presented by”: “carl v. lewis”,
“for”: “the florida times-union”,
“slides”: “bit.ly/NIXkOD”,
“email”:“carl@carlvlewis.net”
}
2. What is data visualization?
•Data itself is the story; standalone narrative.
•Interactive, communicative, visual.
•Ranges from simple (charts) to complex
(database-driven applications).
•Both a technique and a format.
•Both entertaining and factual.
• See:“The Many Words forVisualization”
3. The history of data journalism
•Grew out of CAR
(computer assisted-reporting)
tradition
•John Snow’s 1854 cholera
map
•Has coincided with the era
of “Big Data”
4. On the emergence of the field of
data journalism:
•"When information was scarce, most of our efforts
were devoted to hunting and gathering. Now that
information is abundant, processing is more
important." –Phillip Meyer, UNC Chapel Hill
5. On the growing importance of
data-driven journalism:
•“Journalists need to be data-savvy . . . Data-driven
journalism is the future.” –Sir Tim Berners Lee.
•“The explosion ofWeb-based tools and ways of
sifting through and sharing data has created
something approaching a revolution, and the
potential benefits for journalism are only just
beginning to reveal themselves.” –Matthew Ingram
6. What data journalism is not:
• Simply incorporating public data into your
textual narrative
• Infographics
• Illustration
• Resource-intensive
• Just about numbers and programming
• Just about making data flashy
7. What data journalism is:
• Visual
• Often evergreen
• Transparent – direct access to primary
source
• Credible
• Engaging
• A good business model
9. Democratization of data
journalism
• Free and open-source
tools (Google Drive,
JavaScript libraries, etc.).
• Open Data laws.
• “Anyone can do it. Data
journalism is the new punk.”
-Simon Rogers,The
Guardian
10. The job of the data journalist
• Part statistician, part journalist, part
programmer.
• “We're statisticians.We don't program.”
• “We’re programmers.We don’t report.”
• “We’re journalists.We don’t code.”
11. Notable examples of data visualization
• “Mapping America: Every City, Every Block,”
NYTimes.com.
• “Where Does My Money Go?”, Open Knowledge
Foundation.
• “Illinois school report cards,” Chicago Tribune
• “We Feel Fine,” Jonathan Harris
• “Top Secret America,” The Washington Post
14. When to use data visualization:
• Show change over time
• Comparing discrete values
• Showing connections and flows
• Showing hierarchy
• Browsing large databases
15. When not to use data
visualization:
• When text or multimedia tells story better
• When you have very few data pints
• When there is no statistical significance
• When a map is not a map
• When a table would do
16. Process of data journalism
1. Research – Think of topic and research
factors.
2. Find the data – Locate and retrieve relevant
public data
3. Analysis and evaluation – Crunch numbers,
look for trends or inconsistencies
4. Visualize – Display the data in appropriate
manner
18. Research
1. Think of a topic – what factors influence it?
2. What public data might shed light on those
factors?
3. Seek out the data
19. Locating public data
• Thousands of public “data dumps” by
government bodies and nonprofits.
• Most commonly in delimited spreadsheet
format (look for .csv, .xls), sometimes in
XML and JSON.
• For geographic data, look for .kml or .shp
• Can be found directly at source or by
search engine keyword
20. Search tips for data retrieval
• If you don’t know which source to
look to find your data, an initial Web
search might help.
• After your keywords, type
“filetype:XLS”,“filetype:CSV”, or
whatever the extension is of the
data you’re seeking, and you’ll see
only files of that type from across
the Web.
• If you get no results, try broadening
your search term to locate sources
that cover the general discipline (i.e.
instead of “malaria deaths,” try
“public health data”)
21. Locating public data
• Federal sources: Data.gov,
Census.gov, OpenSecrets.org,
FollowTheMoney.org, USA.gov,
USGovXML.com (full federal list by
topic/agency here).
• Data catalogs such as
thedatahub.org, datamarket.com,
infochimps.org, datacatalogs.org are
good places to find non-
22. • Florida’s “Sunshine” law requires all state agencies
to provide open access to public records, including
data.
• Chapter 119 of Florida State Statutes mandates
that “any records made or received by any public
agency in the course of its official business are
available for inspection, unless specifically exempted by
the Florida Legislature.”
Florida public data sources
23. • Dozens of useful open data sources
maintained by Florida government
agencies, including
TransparencyFlorida.gov,
FloridaHasARightToKnow.com and
MyFlorida.gov
• Full-list of state-maintained databases
by topic here.
• A few state-maintained databases
worth mentioning: the Division of
Elections’ campaign finance data, the
DOE’s test score reports and the
Department of Law Enforcement’s
arrest and officer reports.
Florida public data sources
24. Florida public data sources
• A number of advocacy groups also maintain useful,
downloadable statewide databases:
• FloridaOpenGov.org, which focuses on public employee
payroll data.
• FloridaRedistricting.org, which provides demographic
data (.csv) and geographic polygons (.shp) for new
district boundaries.
• Florida Housing Data Clearinghouse, which provides
regularly updated property values, housing data (.xls).
(for even more, see my semi-exhaustive list with descriptions here).
nt.aspx?id=235
25. Georgia public data sources
• Although Georgia has no law
requiring all government agencies
to make public data accessible
online, many do anyway.
• In 2008, the Transparency in
Government Act expanded the
public data site,
Open.Georgia.gov, to include all
three branches of government,
regional education service
agencies, local boards of
education, and transactions made
by the General Assembly.
26. Georgia public data sources
• A comprehensive list of downloadable databases from
state agencies in Georgia can be found here.
• The State Ethics Committee has made all campaign
finance reports, lobbyist reports and campaign
contributions available in downloadable spreadsheets.
• OASIS provides a set of web-based tools to browse the
Georgia Department of Public Health’s Data Warehouse,
and download the data yourself if you wish.
27. Locating geographic data
• Most geographic data available
as TIGER/Line Shapefile
packages (archives
containing .shp, .dbf, .prj, .xml,
.shx) from U.S. Census Bureau.
• Google also hosts a directory
of .kml files for most geographic
boundaries here.
• Alternatively, Florida and
Georgia GIS data can be found
at FGDL.org, Geoplan and
Data.GeorgiaSpatial.org.
28. What to look for
• Most numeric spreadsheet data comes either as a comma-separated value
(.csv) or Microsoft Excel (.xls) file. Example of .csv structure:
“Name”,“Date”,“Address”,”Zip”,”State”,”Country”,
• XML (eXtensible Markup Language) stores data hierarchically for the
Web, and is good for building news applications because of its broad
interoperability.
<menu id="file" value="File">
<popup>
<menuitem value="New" onclick="CreateNewDoc()" />
<menuitem value="Open" onclick="OpenDoc()" />
<menuitem value="Close" onclick="CloseDoc()" />
</popup>
</menu>
• JSON (JavaScript Object Notation) – Similar to XML in structure, but has
a “lighter” punctuation, based on JavaScript conventions. May eventually
replace XML as standard.
{"menu": {
"id": "file",
"value": "File",
"popup": {
"menuitem": [
{"value": "New", "onclick": "CreateNewDoc()"},
{"value": "Open", "onclick": "OpenDoc()"},
{"value": "Close", "onclick": "CloseDoc()"}
] } }}
29. Scraping other sources
• Scrape data from an HTML table with
simple Google spreadsheet formula:
=ImportHtml("http://the-url-goes-here", "table", 0)
• For database of HTML tables, try
Haystax.
• For PDFs, try CometDocs.
• Scrape webpages by running or creating
Python script at ScraperWiki.
30. APIs for data retrieval
• APIs (application programming interfaces) are how many
websites and services share content with one another.
• Allows a computer system to fetch, interpret and use data
created on another system, even if it used a different
programming language or structure.
• Examples:Twitter Search API, Google Maps API, NYTimes
Campaign Finance API.
• Usually returns data as XML, JSON or .txt
• Often requires use of an API key.
32. Manipulating datasets
• Data rarely ready for analysis and visualization out-of-the-
box (hence “raw data”).
• Spreadsheet applications most common and easiest way to
work with data (Excel, Google Spreadsheets).
• Allow for complex calculations, formulas, sorting.
• Compatible with a variety of file formats
(.xls, .ods, .csv, .txt, .tsv).
• Scripts may also be written to automate bulk manipulation
(Python).
• R Project (r-project.org)
33. Data analysis
• To figure out what your data
says, you’ll need to crunch the
numbers.
• Statistical significance is litmus
test.
• Skewed or normal distribution?
Why?
• Outliers? If so, error or
unexplained factor?
34. Benchmarks for analysis
• Mean (μ) simplest to calculate, but
susceptible to errors caused by
outliers.
• Median usually a better metric in
determining conclusion, especially
with skewed distribution.
• If mean=mode, no skewness.
• Standard deviation (σ) measures
reliability of data set.
• Z-Score = how many standard
deviations a value is away from the
mean and, thus, its likelihood of
being an outlier.
standard deviation
mean
z-score
35. Calculating values in Excel
• Mean: =AVERAGE(A1-A27)
• Median: MEDIAN(A1-A27)
• Standard deviation: STDEV(A1-A27)
• Z-score of a given value: Subtract mean of dataset from
value. Divide result by the standard deviation
36. Other commonly used Excel
formulas
• Concatenate to merge multiple columns.
• MID to split columns.
• Percent change to display relative change over time
=(new_value-original_value)/ABS(original_value)
• See this guide of helpful Excel tricks for data
journalists, compiled by Mary-Jo Webster of St. Paul
Pioneer Press: https://docs.google.com/file/d/
0ByLyArAQRhaBNDc3NjJjYTUtY2U0Yi00NmIwLThk
NTgtYzNlYThmNGE1ZTEz/edit
37. Refining and cleaning data
• Sometimes Excel and Google
Spreadsheets aren’t enough, especially
when working with large datasets.
• Google Refine – free tool that lets you
explore, power sort and process data.
• Useful for finding and fixing errors
and inconsistencies,“power tool for
working with messy data.”
• Facets to sort data
• Cleaning with clusters
• Shan Carter’s Mr. Data Converter to
convert spreadsheets to more web-
friendly format.
38. Other data analysis tips and tricks
• Put field names in first row.
• Put geographic data in first columns
• When you have two different datasets, a good tool to
merge them is Google Fusion Tables (make sure they
share a common attribute).
• Never round until the end of calculations. Round to
two decimal points for visualization purposes.
• Cut and paste calculations into a new column as values
only.
• Know the principle data types (integer, real, string,
boolean), and make sure numeric data is classified as
either integer (whole numbers only) or real (any
value).
40. Planning your visualization
• Identify your key message
• Choose the best data series to illustrate your point
• Consider the number of points in the data
• Think about complementary/supporting datasets you can
incorporate, e.g. sanitation with poverty.
• Plan for user interaction, i.e. visual feedback.
• Make numerical changes to raw data to enhance your
point, e.g. absolute values vs. percent change
• Brainstorm potential technologies
• Consult experts on topic to back up your interpretation
of data
41.
42. Choosing the right type of
visualization
• Change of single variable over time: line chart.
• Comparison of single variable among multiple classes: bar chart.
• Two variables: scatter plot, bubble chart.
• Hierarchical data: treemap, bubbletree.
• Area charts for area only
• Makeup of whole: pie chart.
• Distribution: histograms, box-and-whisker plots.
• Geographic data (point, polygon, chloropleth and symbol maps).
• Records: searchable database.
• Chronological data: timeline, sparklines.
• Other possibilities: matrices, heatmap, games, slopegraphs, stepper graphics,
43. Visualization design principles
• Typography: clear, consistent, not
distracting.
• Use bold, mix of serif/sans-serif to
provide emphasis.
• Don’t set type at an angle
• Color: Let color correspond to
variable, design for accessibility, choose
from same side of color wheel,
consider cultural associations but avoid
thematic palletes. Use Adobe Kuler or
0to255.com
• Visual overload, emotional design,
skewmorphism.
No white type on
black background
No angled type
44. • Some guidelines for graphical integrity,
according to Edward Tufte in TheVisual
Display of Quantitative Information:
1. Representation of numbers should
be directly proportional to
numerical qualities represented.
2. Clear, detailed labeling throughout.
3. Show data variation, not design
variation.
4. Avoid excessive and unnecessary
use of graphical effects
What Edward Tufte calls “the worst
visualization ever published.”
Visualization design principles
45. • Design for the eye
• User should be able to
discern key message
visually.
• Design for interaction
• Highlighting and details on
demand (example)
• User-driven content
selection (example)
Visualization design principles
50. “Four Ways to Slice
Obama’s Budget Proposal”
• From NYTimes.com: http://
www.nytimes.com/interactive/
2012/02/13/us/politics/2013-
budget-proposal-graphic.html
• What makes this visualization
effective? How does it approach
color, complexity, interactivity
and typography? How does it
avoid visual overload?
51. Wireframing/
prototyping
• Follow a structured grid system
(i.e., 12 column, 960px grid –
see 960.gs and Subtraction).
• Very selectively, you can
break the grid to emphasize
a certain visual element.
• Sketch out/prototype your
wireframe on paper first (print
templates such as this)
52. Selecting tools/technologies
• A wealth of free, open-source
data visualization tools and
libraries exist to shorten
development times
• Examples: Google
Visualization API, Google
Fusion Tables,
Highcharts.js, CartoDB,
d3.js,Tableau Public.
• For everything else, HTML5 +
CSS + JavaScript
54. Web app anatomy
Three components of aWeb app:
1. HTML (structure)
2. CSS (styles)
3. JavaScript (interactivity)
55. Parts of an HTML file
An HTML file is made up of:
1. Doctype declaration
2. Head <head>
3. CSS/JavaScript references
4. Title <title>
5. Body <body>
6. A Div container
7. Divs (IDs and classes)
56. Parts of a CSS file
A CSS file is made up of:
1. Container ID
2. Default paragraph (p) style
3. Default H1,H2, etc. styles
4. Default .body style
5. Styles for all divs
58. Maps 101
• Interactive maps combine
geocoded data – points or
polygons – along with metadata
and/or numeric data.
• KML (keyhole markup language)
quickly becoming popular file
format, but Shapefile (shp.zip) is
still the most widely available
• Geographic data can either be
geocoded, downloaded from the
Web, or custom-drawn.
• Good puveyor of news maps:
The Texas Tribune.
59. Mapping services and libraries
• Google Fusion Tables – Quick, versatile
and classic maps that integrate seamlessly
with the Google Maps JavaScript API.
• CartoDB – A newer open-source tool
much like Fusion Tables, but with a better
looking out-of-the-box experience.
• Leaflet – An open-source, client-side
mapping library with an API that allows
you to achieve a number of advanced
features. Plays nicely with Fusion Tables
and CartoDB-hosted maps. Part of
CloudMade suite.
60. Handy desktop mapping
software
• qGis – Free program that supports
almost every conceivable map file
type, and allows you to add or
manipulate vector data, which can
then be then exported as a KML
or Shapefile package.
• Tilemill – A map creation and
styling software; ideal for those
with little programming
experience. UTF-grid enabled
tilesets only.
61. Primary map types
• Chloropleth – Colors
for each geometry
correspond to numeric
values of a given
variable.
• Point – Locations on a
map displayed by
geocoded markers.
• Less frequently:
proportional maps and
geo maps.
Chloropleth map of Georgia voter turnout
Point map of Jacksonville polling locations
62. Tips and tricks
• If you have street address data, you
can use BatchGeocode to convert
them to lat-long coordinates.
• For chloropleth maps,
• Include no more than five fill
colors or “buckets”
• Don’t define an equidistant color
ramp; use ColorBrewer instead.
• Use MarkerClusterer when there
are too many points for certain
zoom levels.
Using ColorBrewer to define an accurate, accessible color ramp.
Using MarkerClusterer to cluster points at further zoom levels.
63. Tips and tricks
• To convert Shapefiles so they can
be imported into Fusion Tables,
either use Shape to Fusion, or
export it as KML from CartoDB.
• Before using the embed tool in
Fusion Tables or CartoDB, make
sure the map is centered where
you want it.
• Ensure your map is set to
“Public.”
Export a Shapefile as KML in CartoDB.
Making your map public in Fusion Tables
65. Charts
• Basic building block of visualization
• Simple, but also easy to mess up.
• Should always be interactive.
• Should always include data source.
• Should always include a legend.
• Unless necessary, only show labels
on mouseover.
66. Interactive charting tools
• Out-of-the-box: Google Drive
charts, infogr.am.
• More advanced: Google Code
Playground.
• Most agile: Highcharts.js.
• Most extendible:Tableau Public
A combo chart made using Highcharts.js
67. Charting best practices
• Color: Pick palette of no more
than 3-4 colors from same side of
color wheel.
• Increments: Use natural-
increments like (0,2,4,6...) instead
of, say, (0,3,6,9...)
• Scale: Don’t plot two unrelated
series with one scale on left and
one on right.
• Style: Flat and simple. No 3D
effects, shadows, narrow bars or
distracting shading.
Don’t plot two different variables on same scale.
Bars too narrow Distracting shading
Misleading 3D effects Pointless shadows
Source: TheWall Street Journal Guide
to Information Graphics, Dona M.Wong.
68. Charting best practices
• Always set the baseline to
zero.
• Always order starting with
greatest value
• Use broken bars sparingly
• No more than five slices on
pie charts; no “donut” pie
charts.
• No more than 3-4 lines on
line chart
Wrong order Right order
Wrong baseline Right baseline
No donut-pies
Source: TheWall Street Journal Guide
to Information Graphics, Dona M.Wong.
70. Utilizing JavaScript/HTML5 libraries
• Together, JavaScript, HTML5
and jQuery have expanded
boundaries of data
visualization
• Abundance of open-source
libraries and packages mean
less programming required to
produce unique, interactive
visualizations.
• Examples:Timeline.js,
Bubbletree.js, Raphael.js,
ProPublica tools
71. The HTML5 revolution
• Adobe Edge for HTML5
development; end of Flash’s
reign
• Platform-agnostic, mobile-
first movement
• Forking resources and
packages off GitHub
72. Pushing the limits
• RaphaelJS for easier
manipulation of serialized
vector graphics
• Other boundary-pushing data
visualization projects:
Processing!, Gephi, d3.js,
IBM’s Many Eyes. A network map produced using D3.js
73. Helpful resources and communities
• Blogs/Tutorials:
FlowingData.com,Vis4.net,Driven-
by-data.net, Chryswu.com,
datavisualization.ch
• Books: The Data Journalism
Handbook, O’Reilly Media. Flowing
Data Guide toVisualization, Chris
Wyu. TheWall Street Journal Guide to
InformationVisualization, Dona M.
Wong.
• Communities: visual.ly, Hacks/
Hackers, NICAR.
Free data journalism handbook
from O’Reilly Media
74. For slides and list of links,
http://bit.ly/NIXkOD
@carlvlewis