This document provides an overview of visualization techniques and tools for visualizing big data. It begins with general principles of visualization and then discusses specific tools and techniques. These include spreadsheets, programming languages like R and Python, and interactive web tools. Geographic data formats and techniques like geocoding are explained. Issues of scale, overlays, text labeling and points of interest are covered. The goal is to provide guidance on effectively visualizing large and complex datasets.
3. 3
Before we start…
1.There are no rules, only suggestions.
2.Sometimes suggestions are contradictory.
3.Be opinionated.
4.Guidelines may vary depending on your intended
audience.
6. E.q.
Time
1
10
1
2
–2
10
1
10
100
Distance traveled in one day, D (km)
December
November
October
August
September
PaP Metro
Ouest
Sud-Est
Nord
Nord Est
Artibonite
Centre
Sud
Grande Anse
Nord Ouest
Nippes
December
November
October
1.8
September
Out of PaP on EQ
August
E
July
–3
June
1.9
May
10
In PaP on EQ
10
100
Distance traveled in one day, D (km)
2.1
April
D
–3
0.1
March
10
–2
2.2
in PaP at quake
others
February
0.1
F
F
December
Cumulative distribution, P(d
Cumulative distribution, P(d
D)
Dec 10, 2009
Jan 20, 2010
Oct 1, 2010
2.3
D)
1
January
Time
July
–2
June
0
–1.5
May
B
1
April
50
March
2
–1
February
3
–0.5
January
100
×10 5
December
November
October
September
August
July
June
May
April
March
4
0
December
Port-au-Prince
5
150
C 0.5
Population difference since December 1, 2009
km
6
Percentage traveled further than d
d, distance from PaP (km)
50
10
0k
m
15
0k
m
20
0k
m
A
February
200
December
Earthquake
January
Tell a story.!
Time
Fig. 1. Overview of population movements. (A) Shows the geography of Haiti, with distances from PaP marked. The epicenter of the earthquake is marked by
a cross. (B) Gives the proportion of individuals who traveled more than d km between day t − 1 and t. Distances are calculated by comparing the person’s
http://www.pnas.org/content/early/2012/06/11/1203882109.abstract
current location with his or her latest observed location. In (C), we graph the change in the number of individuals in the various provinces in Haiti. (D) Gives a
cumulative probability distribution of the daily travel distances d for people in PaP at the time of the earthquake. (E) Shows the cumulative probability dis-
APPLIED PHYSICAL
The Big Picture
!6
MEDICAL SCIENCES
The increase in average daily travel distances lasted for two to
phone users in PaP. Increased numbers of people are present in
three weeks after the earthquake. It is worth noting that other
PaP during working days, with corresponding smaller numbers
periods also saw sudden increases in average daily travel dispresent during weekends (Fig. 1C). This pattern was restored
tances. These periods coincided with Christmas and New Year
as early as three weeks after the earthquake.
To get a detailed view of the daily travel distances, d, we plot
from around December 20 to January 3—just before the earthfor a few different dates the cumulative probability distributions
quake—as well as the Easter holidays (early April).
The earthquake did not directly affect large parts of Haiti. In
of d for two groups of people: persons present and not present in
the rest of our analyses, we therefore focus on the population of
PaP on the day of the earthquake. The distributions are basically
the heavily affected capital region (PaP). As we show in Fig. 1C,
the same for both groups before the earthquake as well as eight
the population movements after the earthquake on January 12,
months after the earthquake, when social life had stabilized
2010, led to a rapid decrease in the PaP population. Nineteen
considerably. However, right after the disaster there is a striking
deviation in the distribution of travel distances (Fig. 1D), which is
days after the earthquake (January 31), the net population denot present for people located outside PaP on the day of the
crease was an estimated 23% compared to the stable level before
“The (December 1–20, 2009), assuming the phone move- earthquake (Fig. 1E). We fitted the earthquake”
ChristmasPredictability of population displacement after the 2010 Haiti curves in panels D and E
7. 7
Scientific vs. Pop Visualization
Scientific!
Popular!
•
Must maintain data integrity.
•
Quantification more important.
•
•
•
Interpretable by viewers of
different backgrounds.
Be consistent with tradition and
expectations.
•
“Smooth” data to show trends
without losing people in details.
Work within publication medium
(black and white, non-interactive)
•
Experiment with new formats,
styles, designs.
Many principals overlap!
8. !8
Tufte Design Principals
•
•
Maximize data-ink ratio
•
Prof. Edward Tufte!
Statistics
Computer Science
Political Science
Yale University
Maximize data density
Avoid “chartjunk”
10. !10
Data Density
Data Density = (# Data Points) / (Sq. Area)
Faded series provide
context and comparison.
Highlight focal point.
http://projects.flowingdata.com/life-expectancy/
11. !11
Data-Ink Ratio
Data-Ink Ratio= (Data-Ink) / (Total-Ink)
Low
http://www.statmethods.net/advgraphs/ggplot2.html
High
The Visual Display of Quantitative Information - Edward Tufte
13. 13
Tools: Analog
Don’t assume visualization has to be digital!
http://petapixel.com/2011/05/24/long-exposure-night-photos-of-airplanes-taking-off-and-landing/
22. 22
Tools: Web Interactive
Examples: Tableu, Google Fusion Tables
http://www.tableausoftware.com/
http://www.google.com/drive/apps.html#fusiontables
23. 23
Tools: Programming (Figures)
Examples: R (ggplot2), Matlab, Python (Matplotlib)
Scripting figures programmatically for higher control and reproducibility.
http://is-r.tumblr.com
http://flowingdata.com
24. 24
Tools: Programming (Advanced)
Examples: processing.org
•
•
•
•
•
•
Built on java
Use any java library
Relatively fast
Rapid prototyping
Active community
Hard to share
29. !29
Chartjunk (Infographics)
If you have to write every data value on your
chart, re-think your design.
http://junkcharts.typepad.com/junk_charts/2010/04/another-ipad-post.html
30. !30
Chartjunk (Infographics)
Don’t use stick figure people. Just don’t. Please.
http://www.marketplace.org/topics/business/news-brief/us-unemployment-picture-glance-august-2011
37. 37
Visualization Toolkit Models
Data
Scales
Response: categorical
Gender: categorical
Statistical Transform
Bin
Geometry Mapping
Data Interval
Positioning
Stacked
Coordinates
Euclidian Polar
Aesthetic Mappings
Color: Response
Slides by Eugene Wu: http://www.mit.edu/~eugenewu/
38. 38
Visualization Toolkit Models
Data
Scales
Response: categorical
Gender: categorical
Statistical Transform
Bin
Geometry Mapping
Data Interval
Positioning
Stacked
Coordinates
Euclidian Polar
Aesthetic Mappings
Color: Response
Slides by Eugene Wu: http://www.mit.edu/~eugenewu/
39. 39
Visualization Toolkit Models
Data ⨝ DOM el
Array Utilities
Formatting
Shapes
Layout
Data Utilities
Color
Scales
Interaction
http://bost.ocks.org/mike/join/
Slides by Eugene Wu: http://www.mit.edu/~eugenewu/
41. !41
Retinal Variables (Encodings)
How many encodings?
• Size
• Color
• Position
http://www.nytimes.com/interactive/2012/06/11/sports/basketball/nba-shot-analysis.html
43. !43
Basic Charts
What do you want to show?
A trend, a distribution, a
relationship?
Choose a chart that tells your
story.
http://labs.juiceanalytics.com/chartchooser.html
50. !50
Color
What type of relationship do you want to show?
Qualitative or quantitative? Do they blend?
ColorBrewer for maps and more!
Brewer, Cynthia A., 2013. http://www.ColorBrewer.org
56. !56
Geocoding
Coordinate systems: Map Projection
Some terminology:!
• UTM - Universal Transverse Mercator (cartesian)
• UPS - Universal Polar Stereograph (degrees)
• Datum - Reference point (origin)
Standards:
• World Geodetic System (WGS84)
• North American Datum (NAD83)
• UTM Zones
• State Plane Coordinate System (SPCS)
** Be sure all data are using the same projection! **
http://en.wikipedia.org/wiki/Geodetic_system
59. !59
Geocoding
Tips
•
Look for locations accumulating more points than
expected
•
Know where your software defaults at higher
spatial levels (city centers, state centers, etc.)
•
Clean common typos before geocoding
addresses
60. !60
Scale and Scope
Draw a scale on your map, or use common
references.
http://www.theatlantic.com/technology/archive/2012/08/the-apollo-11-landing-site-superimposed-on-a-baseball-diamond/261802/
61. !61
Scale and Scope
How much context do you need?
http://en.wikipedia.org/
ProTip: wikimedia.org has amazing SVG maps!
65. !65
Overlays
An image is placed on top of a geographic map.
•
•
•
Google earth KML
Google Maps API
Need to know the spatial
coordinates of the image
boundaries
75. !75
Effective Distance
What distance do we care about? Time, geographic, number of transfers?
20 min walk.
http://www.mapnificent.net/
20 min drive.
20 min subway.
76. !76
Spatial Patterns
How can we reveal different spreading behaviors?
http://www.historyofinformation.com/index.php?
category=Statistics+%2F+Demography
http://mobs.soic.indiana.edu/projects/contagion-models-andadaptive-behavior
77. !77
Spatial Patterns
Do we need backgrounds, scales, context?
http://cargocollective.com/coopersmith#1327371/Nike-Plus-Visualization
78. !78
Geographic Data
Good start, but what about Tufte’s principals? How could we improve this?
What encodings are could we add?
http://blog.echen.me/2012/07/06/soda-vs-pop-with-twitter/
81. !81
Geographic Data
Eric Fischer is the king of mapping dots.
https://www.mapbox.com/labs/twitter-gnip/locals/#5/38.000/-95.000
https://www.mapbox.com/blog/mapping-millions-of-dots/
http://www.flickr.com/photos/walkingsf/sets/72157627140310742/
http://demographics.coopercenter.org/DotMap/index.html
89. !89
Provide context
Even if you have never been to
Paris, you know how big your
country is.
http://persquaremile.com/2011/01/18/if-the-worlds-population-lived-in-one-city/
91. !91
When maps shouldn’t be maps
“But sometimes the reflexive impulse to map the data
can make you forget that showing the data in another
form might answer other — and sometimes more
important — questions.” - Matthew Ericson
http://www.ericson.net/content/2011/10/when-maps-shouldnt-be-maps/
93. 93
Maps for non-spatial data
Show hierarchy and proportion.
http://bigthink.com/strange-maps/579-a-1939-map-of-physics
94. !94
When maps shouldn’t be maps
• When the interesting patterns
aren’t geographic patterns
• When the geographic data is
more effective for analysis
http://www.ericson.net/content/2011/10/when-maps-shouldnt-be-maps/
95. !95
When maps shouldn’t be maps
Should this be a better map or not a map at all?
http://life.mappinglondon.co.uk/
96. 96
Relationships
Same system, different intent.
http://www.washingtontimes.com/blog/watercooler/
2010/jul/28/republicans-release-new-more-complexobamacare-cha/
http://stevemackley.com/2009/08/healthcare-graphic/
97. 97
Network Visualization
Problem: Networks are high dimensional objects that must be visualized in 2
dimensional space. The same network has many visualizations.
9
Thursday, June 23, 2011
Choose a mapping that gives insight into the structure of your data.
Using just visualization
10
Thursday, June 23, 2011
http://jponnela.com/web_documents/icpsr_visualization.pdf
http://upload.wikimedia.org/wikipedia/commons/d/d2/
Internet_map_1024.jpg
98. 98
Network Flows
Nodes and edges are encoded with color, size, and direction.
http://www.nytimes.com/imagepages/2011/10/22/opinion/
20111023_DATAPOINTS.html?ref=sunday-review
99. 99
Networks
Networks can be stunningly effective if presented correctly.
Watch Eric Berlow explain how
to interpret this network in a
great TED Talk.
http://www.ted.com/talks/
eric_berlow_how_complexity_leads_to_simplicity.html
http://www.nytimes.com/2010/04/27/world/27powerpoint.html
101. 101
Networks
Spatial networks have constrained topology and different statistical properties.
Eric Fischer uses Twitter data to
map important roads.
http://www.flickr.com/photos/walkingsf/6747484741/
102. 102
Networks
5
5
Statistically similar networks can have strikingly different topologies.
Thursday, June 23, 2011
Thursday, June 23, 2011
Using just
Using just metrics metrics
Network A
Barabasi-Albert
Network A
Network B
Barabasi-Albert
Watts-Strogatz
6
Thursday, June 23, 2011 Thursday, June 23, 2011
http://jponnela.com/web_documents/icpsr_visualization.pdf
6
Network B
Watts-Strogatz
108. 108
Network Visualization
Node and edge attributes show important relationships (sometimes)…
http://mashable.com/2010/12/13/facebook-members-visualization/
Population density problem!
109. 109
Mashups make stories
The whole of two data sets is greater than the sum of it’s parts.
http://woj.com/False-Color-Facebook-NASA-Mashup.png
Now its tells a geopolitical story!
111. 111
Mashups make stories
Does this visualization best convey the claim?
“An image of regional communication
diversity and socioeconomic ranking
for the UK. We find that communities
with diverse communication patterns
tend to rank higher (represented from
light blue to dark blue) than the
regions with more insular
communication. This result implies
that communication diversity is a key
indicator of an economically healthy
community.”
http://www.sciencemag.org/content/328/5981/1029.abstract
113. 113
Time
MAPPING PATHS TO PROSPERITY | 81
How do you show changes in order/rank over time?
FIGURE 4.1:
Evolution of the ranking of countries based on ECI between 1964 and 2008. Please see pages 352-353 for a larger version.
CHE 1
SWE 2
AUT 3
GBR 4
JPN 5
FRA 6
USA 7
ITA 8
BEL 9
NOR 10
FIN 11
DNK 12
NLD 13
HKG 14
HUN 15
POL 16
IRL 17
PAN 18
PRT 19
KOR 20
ISR 21
CAN 22
BGR 23
ESP 24
CHN 25
ROU 26
SLV 27
SGP 28
JOR 29
CRI 30
NZL 31
AUS 32
URY 33
GRC 34
MEX 35
CHL 36
GTM 37
IND 38
LBN 39
MAR 40
MRT 41
ARG 42
CUB 43
COL 44
EGY 45
DZA 46
TUN 47
ZAF 48
MNG 49
ZWE 50
ALB 51
PAK 52
VEN 53
JAM 54
HND 55
NIC 56
TTO 57
SEN 58
SYR 59
VNM 60
OMN 61
PER 62
TUR 63
PHL 64
ECU 65
LBR 66
BOL 67
IRN 68
PRY 69
MYS 70
BRA 71
THA 72
ZMB 73
MWI 74
CIV 75
GIN 76
MLI 77
KHM 78
LAO 79
LKA 80
KEN 81
GHA 82
MDG 83
COG 84
DOM 85
ETH 86
SAU 87
IDN 88
CMR 89
AGO 90
PNG 91
MOZ 92
TZA 93
UGA 94
SDN 95
GAB 96
NGA 97
QAT 98
KWT 99
MUS 100
LBY 101
JPN 1
CHE 2
SWE 3
FIN 4
AUT 5
GBR 6
SGP 7
KOR 8
HUN 9
FRA 10
USA 11
ITA 12
DNK 13
IRL 14
ISR 15
BEL 16
MEX 17
POL 18
NLD 19
ESP 20
HKG 21
ROU 22
CHN 23
NOR 24
THA 25
MYS 26
PRT 27
PAN 28
CAN 29
BGR 30
LBN 31
TUR 32
BRA 33
NZL 34
TUN 35
JOR 36
CRI 37
GRC 38
IND 39
COL 40
ZAF 41
ARG 42
URY 43
PHL 44
SLV 45
IDN 46
DOM 47
ALB 48
GTM 49
TTO 50
EGY 51
CHL 52
VNM 53
AUS 54
SAU 55
LKA 56
SYR 57
MUS 58
SEN 59
QAT 60
MAR 61
KEN 62
ZWE 63
HND 64
JAM 65
CUB 66
PRY 67
PAK 68
OMN 69
PER 70
UGA 71
MDG 72
NIC 73
KWT 74
ECU 75
TZA 76
ZMB 77
LAO 78
GHA 79
KHM 80
BOL 81
CIV 82
VEN 83
ETH 84
IRN 85
MWI 86
MNG 87
LBR 88
MOZ 89
MLI 90
GAB 91
LBY 92
CMR 93
DZA 94
GIN 95
NGA 96
PNG 97
AGO 98
COG 99
MRT 100
SDN 101
1964
1966
1968
1970
1972
1974
1976
1978
1980
1982
1984
ranking 4 looks at changes in economic complexity. Here
countries are ranked according to the change in ECI experihttp://www.chidalgo.com/Papers/HidalgoHausmann_DAI_2008.pdf
enced between 1964 and 2008. Because of data availability,
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
position of China in this ranking reflects the fact that China’s transformation built on a productive structure that was
more sophisticated than that of many of its regional neigh-
114. 114
Time
The x-axis is reserved for left and
right political meanings, time is
moved to the y-axis.
http://friggeri.net/research/senate/
115. 115
Time
Aligning different units of time makes for easier comparison.
http://www.vijayp.ca/blog/2012/06/colours-in-movie-posters-since-1914/
116. 116
Streamgraphs
Show relative proportion over time. What is lost?
http://www.nytimes.com/interactive/
2008/02/23/movies/
20080223_REVENUE_GRAPHIC.html
https://euro2012.twitter.com/
117. 117
Know your audience
Who is watching? What do they need to know?
The Weather Channel
http://understandinggraphics.com/visualizations/communicatingcritical-information-hurricane-irene/
New York Times
118. 118
Complexity
“Measures of Complexity a non--exhaustive list”
1. Difficulty of description. Typically measured in bits.
• Information;
• Entropy;
• Algorithmic Complexity or Algorithmic Information
Content;
• Minimum Description Length;
• Fisher Information; Renyi Entropy;
• Code Length (prefix-free, Huffman, Shannon- Fano, errorcorrecting, Hamming);
• Chernoff Information;
• Dimension;
• Fractal Dimension;
• Lempel--Ziv Complexity.
2. Difficulty of creation. Typically measured in time,
energy, dollars, etc.
• Computational Complexity;
• Time Computational Complexity;
• Space Computational Complexity;
• Information--Based Complexity;
• Logical Depth;
• Thermodynamic Depth;
• Cost;
• Crypticity.
3. Degree of organization. This may be divided up into two quantities:
a) Difficulty of describing organizational structure, whether corporate,
chemical, cellular, etc.;
b) Amount of information shared between the parts of a system as the
result of this organizational structure.
a) Effective Complexity
• Metric Entropy; Fractal Dimension; Excess Entropy;
• Stochastic Complexity;
• Sophistication;
• Effective Measure Complexity;
• True Measure Complexity;
• Topological epsilon-machine size;
• Conditional Information;
• Conditional Algorithmic Information Content;
• Schema length;
• Ideal Complexity;
• Hierarchical Complexity;
• Tree subgraph diversity;Homogeneous Complexity;
• Grammatical Complexity.
b) Mutual Information:
• Algorithmic Mutual Information;
• Channel Capacity;
• Correlation;
• Stored Information;
• Organization.
• In addition to the above measures, there are a number of related
concepts that are not
• quantitative measures of complex
Gell-Mann, Murray and Seth Lloyd. Information measures, effective complexity, and total information. Complexity 2 (1996): 44-52.
121. 121
Complexity
When “at a glance” is not enough.
DEVELOPING ALTERNATIVES
8
FIGURE 2. NETWORK REPRESENTATION OF THE 1998–2000 PRODUCT SPACE
Fruit
Fishing
Oil
Vegetable Oils
Vegetables
Forest Products
Vehicles
Mining
Garments
Iron/Steel
Textiles
Machinery
Electronics
Node Color
Petroleum
Chemicals
Raw Materials
Forest Products
Tropical
Agriculture
Animal
Agriculture
Cereals
Labor
Intensive
Capital
Intensive
Link Color
(proximity)
Animal
Agriculture
0.65
0.55
Machinery
0.4
Chemicals
0.4
Node Size
(millions of dollars)
0.3 2
8
40 2000
most poor countries can only reach the levels of
development enjoyed by rich countries if they are
able to jump distances that are quite infrequent
http://www.chidalgo.com/Papers/HidalgoHausmann_DAI_2008.pdf
in the historical record (Figure 2). In other words,
the “stairway to heaven” presents some very tall
steps.
135. 135
Huge list of resources
Matplotlib (python plotting)
ggplot2 (R and Python)
Processing
Unfolding Maps (maps for processing)
D3.js
OpenLayers (javascript drawing on maps)
WebGL
Google Maps API
Open Street Maps API
CloudMade (OSM Style Editor)
Quantum GIS (QGIS)
TileMile (Custom Map Tiles)
ColorBrewer
ColorLouver (Crowdsourced color palettes)
JunkCharts (commentary on bad visualizations)
Flowing Data (tutorials, commentary, and more)
WTFviz (collection of bad examples)
Information is Beautiful (data visualizations)
InfoAesthetics (data visualization blog)
136. 136
Many thanks to
Tom Crawford!
Visual Thinker, Speaker Coach, App Designer, Educator
http://www.viznetwork.com/about.html
Karl Gude!
Designer, Story Teller, Journalist, Educator
http://karlgude.com/