The presentation is broken into two parts. First, it introduces the various core fundamentals of data visualization and then we apply those fundamentals in two case studies. The second part revolves around challenges with data journalism and what is pykih doing about them.
Visualizing Data Journalism (HasGeek Fifth Elephant)
1. Visualizing Data Journalism
Ritvvij Parrikh,
Founder, www.pykih.com
!
!
Fifth Elephant, Delhi Run-up Event,
India Today Mediaplex, June 14, 2014
2. Pykih is a data Visualization company. We build custom visual representations of
large data sets to make data actionable for readers.
We have satisfied customers in six countries.
Introduction
3. • Data Viz.
• Theory
• Case Study 1
• Case Study 2
• Summary
• Challenges in Data Journalism
• What we are doing about it for ourselves
Agenda
5. Let’s explore the humble pie chart…
Party Percentage
E 38%
D 25%
C 20%
B 15%
A 2%
Break the whole into parts.
6. Let’s explore the humble pie chart…
Party Percentage
E 38%
D 25%
C 20%
B 15%
A 2%
Break the whole into parts.
Data: One dimensional Visual Encoding: Area
7. New Terms
• Dimension: Columns by which you group data.!
!
• Facts: Numbers that you can count, sum, average, etc.!
!
• Examples:!
• Seat count by party!
• Seat count by party and state!
!
• Visual Encoding: Area, Position, Colour, Length, Thickness, etc.
12. One-dimensional Charts …
What is wrong here?
Problems:!
• Colour communicates no data!
• 3D communicates no data
Source: thehindu.com
13. One-dimensional Charts …
Source: thehindu.com
#2 - Your goal is to communicate data. Wrong use of visual encoding confuses.
Problems:!
• Colour communicates no data!
• 3D communicates no data
15. One-dimensional Charts …
What is wrong here?
Problems:!
• Colour!
• Too many values. Too
cluttered.
Source: firstpost.com
16. One-dimensional Charts …
Problems:!
• Colour!
• Too many values. Too
cluttered.
#3 - AREA encoding is useful for only few values after which it is unreadable.
Source: firstpost.com
21. Grouped One-dimensional Charts
Group various bubbles by colours
Party Alliance Percentage
A NDA 38%
B NDA 25%
C NDA 20%
D UPA 15%
E Others 2%
#4 - You can always fit in an extra dimension (GROUP) in charts using colour.
22. New Data Set
One dimensional: !
Seat count by party
Grouped One dimensional: !
Seat count by party grouped by alliance
Two dimensional: !
Which party won in which year
25. Two-dimensional Charts…
Scatter Line Area
Bar Column Spider
All these charts require the same data.#5 - Number of dimensions in data determines which chart to use
26. New Data Set
One dimensional: !
Seat count by party
Grouped One dimensional: !
Seat count by party grouped by alliance
Two dimensional: !
Which party won in which constituency
Weighted Two dimensional: !
Which party won in which constituency by what vote margin
28. Weighted Two-dimensional Charts …
Let’s add weight to it, hence now we have three data points
X axis Y axis Weight
A Z 40
B Y 20
C X 1
D V 300
E W 60
28Visual encoding: Position, Length, Area
29. Weighted Two-dimensional Charts …
Weighted Scatter Circle Comparison
All these charts require the same data.#6 - You can always fit in an extra fact (WEIGHT) in charts using size.
30. New Data Set
One dimensional: !
Seat count by party
Grouped One dimensional: !
Seat count by party grouped by alliance
Two dimensional: !
Which party won in which constituency
Weighted Two dimensional: !
Which party won in which constituency by what vote margin
Grouped Weighted Two dimensional: !
Which party won in which constituency by what vote margin grouped by alliance
32. Multi-series Two-dimensional Charts …
RangeGanttMulti-series Line
Group Column Stack Column Group Stack Column
Stack Area Stack Percentage Area
Add more dimensions in creative ways.
33. Multi-series Two-dimensional Charts …
What is right and wrong here?
Source: livemint.com
Is the equities rally percolating into the broader market?
34. Multi-series Two-dimensional Charts …
What is right and wrong here?
Source: livemint.com
Is the equities rally percolating into the broader market?
Bad parts:!
• BSE Small-cap lines is not
visible and that’s the story.
35. Multi-series Two-dimensional Charts …
What is right and wrong here?
Good parts:!
• Y axis from 97 instead of 0
Source: livemint.com
Is the equities rally percolating into the broader market?
Bad parts:!
• BSE Small-cap lines is not
visible and that’s the story.
#7 - Purpose of line chart is to show trend. Focus on it.
37. Multi-series Two-dimensional Charts …
What is wrong here?
Source: livemint.com
Problems:!
• Cannot find the IMF line.
Does IMF wear rose-tinted glasses?
38. Multi-series Two-dimensional Charts …
What is wrong here?
Source: livemint.com
Does IMF wear rose-tinted glasses?
Problems:!
• Cannot find the IMF line.
#8 - Highlight the story for the user. Use color to highlight, not confuse.
39. New Data Set
All the data we encountered so far was RDBMS i.e. could fit in a SpreadSheet.
(rows and columns). !
!
Sometimes data is more complex. It can have“relationships”. !
!
Types of relationships:!
• Hierarchy / Tree!
• Multi-level relationships
51. • One dimensional charts!
• Grouped one dimensional charts!
!
• Two dimensional charts!
• Weighted Two dimensional charts!
• Grouped Two dimensional charts!
• Grouped Weighted Two dimensional charts!
!
• Multi-dimensional Charts!
!
• Tree Charts!
• Grouped Weighted Tree Charts!
!
• Multi-level Relationships Charts!
• Grouped Weighted Multi-level Relationships Charts!
!
• Two-level Relationships Charts!
• Grouped Weighted Two-level Relationships Charts
Taxonomy of Standard Data Visualizations
52. The same data can be visualized in many (MANY!)
ways. Without exploring the data, you will end up
visualizing all your data in pies, lines and bars.
Most Imp. Lesson
57. 57
Ball by ball!
Commentary
Per Batsman Statistics
Per Bowler Statistics
Fall of Wickets
Partnerships
Two innings Pre-match: Toss, Playing 11, Location, Time
Post-match: Win, by how much, Man of the match
Second Innings:
Current Run Rate, Required Run Rate, Target score
58. Overs: Most important data-point
1. Overs = Time!
2. One over !
1. has_many balls!
2. has_one bowler!
3. has_many batsmen!
3. Existence of batsmen across overs is partnerships!
4. Partnerships and Fall of wickets are the same different data set
61. Combine the two
Weighted two-dimensional chart
Y-axis: Balls per over
X-axis: Overs + Bowlers
Gantt chart
Y-axis: Batsmen
X-axis: Overs + Bowlers
All other “zoomable"
information is shown via
interactions
66. Election Counting Day
Data Set:!
• India has 50+ regional parties and two national parties.!
• During Election Counting Day (live), seats are either “Leading” or “Won”!
!
Data Properties / Relationships:!
• Hierarchical Relation between Alliance and Party!
• Won is confirmed. Leading is transient.!
!
What did readers want to know this Election:!
• How badly would UPA lose!
• How big would be the BJP victory!
• How big would the impact of AAP would be!
!
Real world facts to inspire design!
• BJP is a right wing party!
• AAP is left most followed by UPA!
• The Sansad Hall is a semi-circle
67. Election Counting Day
Data Set:!
• India has 50+ regional parties and two national parties.!
• During Election Counting Day (live), seats are either “Leading” or “Won”!
!
Data Properties / Relationships:!
• Hierarchical Relation between Alliance and Party!
• Won is confirmed. Leading is transient.!
!
What did readers want to know this Election:!
• How badly would UPA lose!
• How big would be the BJP victory!
• How big would the impact of AAP would be!
!
Real world facts to inspire design:!
• BJP is a right wing party!
• AAP is left most followed by UPA!
• The Sansad Hall is a semi-circle
—> Group
—> Tree
—> Weight
—> Limitation
}
Hence, all other parties!
can be clubbed into !
other
—> Shape
—> Placement}
68. Choosing the right Grouped Weighted Tree Chart
Packed Circle
Sunburst Tree Rectangle
Tree Bar
Grouped Weighted Tree
68Visual encoding: Position, Size, Colour
75. Data Collection What’s the story Visualize Story
Journalist
Developer
Designer
• Govt. data!
• APIs!
• Scrape!
• Mine web!
• PDFs
• Clean the data!
• Model the data!
• Investigate
• Design!
• Build
Write
Technology is an integral part of data journalism.
Steps in data journalism
76. Data Driven Stories Visualization App
Day-to-day short stories derived
from data
Big apps. to educate large and
important event e.g. budget, election,
etc.
Formats in data journalism
77. Format #1 - Data Driven Stories
Source: http://factchecker.in/data-are-crimes-against-scheduled-castes-on-an-upswing-in-india/
Badaun Case —> Find legit Data —> Analyse —> Plot —> Story
79. Data Collection What’s the story Visualize Story
• Govt. data!
• APIs!
• Scrape!
• Mine web!
• PDFs
• Clean the data!
• Model the data!
• Investigate
• Design!
• Build
Write
Format: Visualization app
Format: Data Driven Stories
Journalist
Developer
Designer
Journalist
Implication
80. High Level
!
1. Quick access to appropriate data set
2. Quick analysis of this data
3. Consistently churn out neat charts, graphs and maps
Challenges
81. High Level
!
1. Quick access to appropriate data set
2. Quick analysis of this data
3. Consistently churn out neat charts, graphs and maps
!
Technical
!
1. Live Data Modelling
2. SEO
3. How to handle high traffic
Challenges
82. High Level
!
1. Quick access to appropriate data set
2. Quick analysis of this data
3. Consistently churn out neat charts, graphs and maps
!
Technical
!
1. Live Data Modelling
2. SEO
3. How to handle high traffic
!
From pykih perspective
!
1. How do you consistently build beautiful, real-time Visualizations?
Challenges
83. What we are doing about it
In-house tool called "Backstage"
84. #1 - Instead of waiting for data to be
standardised, we want to make large scale, high-
velocity, multi-format, data extraction durable.
!
#2 - Instead of expecting data-users / journalists
to have analytical skills, we are:
• simplifying exploration of large data sets
• automating extraction of metadata from
data sets
• simplifying assisted data standardisation
• building tools for assisted analysis
!
#3 - Instead of expecting data-users / journalists
to Visualize data correctly, we are attempting
automate meta-data driven Visualization
!
Other Experiments
• A data-driven blogging software
• Configuration Editor
Principles
—> Demo the worker
—> Demo the census dashboard
—> ISO example
—> Demo NLP based Date Standardiser
—> Story is in the outliers
Example: If data is ordinal then colour
automatically leverages saturation and
if data is ordinal then colour is distinct
85. Data Visualization company => Data and Visualization company
!
!
Effective Data Journalism leverages: You will end up NoSQL, Memory based databases,
NLP, OLAP modelling, Free Text Search, Statistics, etc.
Summary
86. We are at @pykih
Fun fact: The word pykih came to us
in a CAPTCHA. That’s the day we
decided that till we do good work it
does not matter what we are called.