An Introduction to Data Journalism

One
take
on
what
data
journalism
may
or
may
not
be…
a
lecture
presented
to

journalism
students
at
the
University
of
Lincoln,
UK,
February
2014.

1

Let’s
start
with
an
easy(?!)
quesJon
-‐
what
is
journalism?

One
way
of
answering
that
quesJon
is
to
list
some
of
the
funcJons,
or
aMributed,

associated
with
it
–
informing,
educaJng,
holding
to
account,
watchdog
funcJon,

campaigning,
contextualising.

2

Sensemaking
seems
to
me
to
be
an
important
part
of
it…
In
part
contextualisaJon,
in

part
idenJfying
the
bits
that
make
the
diﬀerence,
the
bits
that
make
it
important,
the

bits
that
make
is
news
that
people
need
to
know..

3

Second
quesJon:
what
is
data?
NaJonal
staJsJcs,
sports
results,
polls,
ﬁnancial

ﬁgures,
health
data,
school
league
tables,
etc
etc.

Is
a
book
data?
Or
a
speech?
What
if
I
split
a
speech
up
into
separate
words,
count

the
occurrence
of
each
unique
word
and
then
display
the
result
as
a
“tag
cloud”,
or

word
frequency
diagram.

4

One
way
of
thinking
about
data
is
that
it
is
a
parJcular
sort
of
source,
or
a
source
that

can
respond
to
a
parJcular
style
of
quesJoning
in
a
parJcular
way.

Another
take
on
this
is
that
many
“data
sources”
are
experts
on
a
parJcular
topic,

experts
that
know
a
lot
of
a
very
parJcular
class
of
facts.

5

So
what
is
data
journalism?

One
way
is
to
think
of
it
as
a
process,
as
exempliﬁed
by
Paul
Bradshaw’s
inverted

pyramid
of
data
journalism.
I
see
it
more
as
a
conversaJon
in
which
data
is
one
of
the

conversants.
The
conversaJonal
view
also
allows
us
to
think
about
process,
but
more

important,
for
me,
is
that
in
a
conversaJon,
it
gets
personal…

6

The
inverted
pyramid
gives
us
one
way
of
considering
the
data
journalisJc
process,
or

at
least
idenJfying
some
of
the
steps
involved
in
a
data
invesJgaJon.

But
there
are
many
other
ways
of
conceptualising
the
process
–
for
example,
ﬁnding

stories
and
telling
stories…

7

When
it
comes
to
finding
stories,
do
we:

a)  want
to
find
stories
in
a
dataset
we
are
provided
with,
or

b)  use
data
to
help
draw
out
a
story
lead
we
have
already
been
Jpped
off
to?

8

One
of
the
ways
I
like
to
work
with
data
is
to
have
a
conversaJon
with
it
–
asking

quesJons
of
it
and
then
further
quesJons
based
on
the
responses
I
get.

9

SomeJmes
it
looks
at
first
as
if
we
have
data
in
a
form
where
we
might
be
able
to
do

something
with
it
–
then
we
realise
it
needs
cleaning
and
reshaping.

For
example,
in
this
case
we
have
percentage
signs
contaminaJng
numbers,
data

organised
in
separate
secJons
–
but
how
do
we
get
a
“well
behaved”
view
over

data

from
all
the
wards
–
and
different
sorts
of
data:
votes
polled
per
candidate
versus
the

size
of
the
electorate
in
a
parJcular
ward
for
example.

Walkthrough:
hMp://blog.ouseful.info/2013/05/03/a-‐wrangling-‐example-‐with-‐
openrefine-‐making-‐ready-‐data/

10

One
of
the
ﬁrst
datasets
I
played
with
was
MPs’
expenses
data.
Here
are
a
couple
of
ways
I
started
to
cha
The
bar
chart
Is
ordered,
for
a
parJcular
expenses
area,
by
total
amount
for
each
individual
MP.

The
block
histogram
shows
how
many
MPs
made
a
total
claim
in
parJcular
expenses
area
of
a
parJcular

A
scaMerplot
is
another
very
powerful
sort
of
chart
–
we
can
plot
two
sorts
of
value
against
each
other
to

Some
scaMerplot
tools
allow
you
to
size
or
colour
nodes
according
to
further
dimensions.
Colouring
node

Maps
can
be
used
to
pull
out
different
sorts
of
relaJonships
–
for
example,
plokng

markers
in
the
centre
of
each
MP’s
ward
coloured
by
the
total
value
of
travel

expenses
claim
in
a
parJcular
area,
we
can
easily
see
whether
or
not
an
MP
is

claiming
an
amount
significantly
different
to
MPs
in
neighbouring
wards.
In
this
case
–

travel
expenses
–
we
might
expect

(at
first
glance
at
least)
a
homophiliJc
effect
–
folk

a
similar
distance
away
from
Westminster
should
presumably
make
similar
sorts
of

travel
claim?
At
second
glance,
we
might
then
start
to
refine
our
quesJoning
–
does

ward
size
(in
terms
of
geographical
area)
or
rurality
have
an
effect?
Does
an
MP
travel

to
and
from
home
more
than
neighbours
(or
perhaps
claim
more
in
terms
of

accommodaJon
in
London?)

13

SomeJmes
we
need
to
provide
quite
a
lot
of
explanaJon
when
it
comes
to
making

sense
of
even
a
simple
data
visualisaJon
–
“what
am
I
supposed
to
be
looking
at?”

14

ContextualisaJon
can
take
many
forms
–
Trinity
Mirror
Group
have
a
data
unit
that

produces
parJally
packaged
data
stories
and
lines
for
regional
Jtles,
who
can
then

add
local
colour,
knowledge,
interpretaJon
and
spin
to
the
resulJng
story.

15

For
many
readers
–
it
may
be
that
data
ONLY
makes
sense
when
appropriately

contextualised.

In
passing,
it’s
also
worth
noJng
that
someJmes
the
data
you
don’t
collect

someJmes
aﬀects
the
interpretaJon
of
the
data
you
do…

Foe
example:
hMp://www.open.edu/openlearn/science-‐maths-‐technology/
mathemaJcs-‐and-‐staJsJcs/staJsJcs/diary-‐data-‐sleuth-‐when-‐the-‐data-‐you-‐dont-‐
collect-‐aﬀects-‐the-‐data-‐you-‐do

16

In
passing,
it’s
worth
menJoning
that
one
thing
staJsJcs
does
is
help
provide
context.

Is
this
number
a
big
number
in
the
greater
scheme
of
things?
Is
this
thing
likely
to

happen
by
chance
or
is
there
a
meaningful
causal
relaJonship
between
this
thing
and

another
thing?

The
chart
in
the
corner
is
a
reminder
about
how
surprising
probabiliJes
can
be.
The

chart
shows
the
probability
(y-‐axis)
that
two
people
share
a
birthday
(the
number
of

people
is
given
on
the
x-‐axis).
The
chart
shows
that
if
there
are
23
or
more
people
in
a

room,
there
is
more
than
a
50/50
chance
that
two
of
them
will
share
a
birthday
(that

is,
share
the
same
birth
day
and
month,
though
not
necessarily
same
birth
year).

How
many
people
are
in
the
room?
If
it’s
more
than
23
–
I
bet
that
at
least
two

people
share
a
birthday
(at
least
in
terms
of
day
and
month).

17

The
other
way
of
using
data
is
to
tell
stories.
But
what
does
that
even
mean…?

18

A
common
source
of
stories
based
on
data
are
polls,
either
polls
that
are

commissioned
by
a
publisher
with
a
view
to
generaJng
a
story,
or
commissioned
by
a

lobbying
group
or
PR
form
to
promote
not
only
stories
around
a
parJcular
issue,
but

stories
that
follow
a
line
favourable
to
the
organisaJon
that
commissioned
the
poll

(or
detrimental
to
posiJons
that
whoever
commissioned
the
poll
is
campaigning

against).

When
presented
with
a
press
release
wriMen
around
a
PR
company
commissioned

poll,
look
to
the
raw
data
to
see
where
the
numbers
that
appear
in
the
press
release

quotes
actually
come
from.

In
the
above
example,
I
could
for
example
claim
that
96%
of
people
(creaJve
reading

of
the
numbers)
did
not
appear
to
disagree
with
the
idea
that
press
behaviour
should

be
independently
regulated
(creaJve
reading
of
the
quesJon;
the
repeated
negaJves

also
serve
to
further
confuse
the
clarity
of
what
is,
or
isn’t
actually
being
claimed…).

And
when
reading
raw
results,
or
quoJng
from
them,
take
care
which
numbers
you

quote.
SomeJmes
the
presentaJon
of
the
results
can
lead
to
you
misreading
them
or

the
way
they
add
up.

SomeJmes,
two
or
more
polls
may
be
commissioned
around
the
same
topic
and

appear
to
give
contradictory
results.
For
an
example
of
this,
see:
hMp://
www.open.edu/openlearn/science-‐maths-‐technology/mathemaJcs-‐and-‐staJsJcs/
staJsJcs/two-‐can-‐play-‐game-‐when-‐polls-‐collide

19

Many
polling
organisaJons
publish
press
releases
featuring
“highlight”
results
from
a

poll.
The
more
reputable
ones
also
publish
copies
of
the
poll
or
survey
quesJons
and

the
results
that
were
returned.

YouGov
polls
oren
split
results
down
by
poliJcal
persuasion
or
newspaper

preference,
as
well
as
demographically
segmenJng
responses
by
gender,
age
or

region.

The
majority
of
polling
organisaJons
publish
the
data
via
PDFs
rather
than
“as
data”,

for
example,
in
the
form
of
spreadsheet
datatables.
Tools
such
as
Tabula
(URL)
are

making
it
increasingly
easy
to
extract
the
data
contained
within
PDFs
into
actual

datatables.
Your
local
techie
should
also
be
able
to
“scrape”
the
data
from
a
PDF

document
and
put
it
into
a
data
from.

For
examples
of
how
to
scrape
data
as
well
as
images
from
PDF
documents,
see:

-‐ 
scraping
data
tables
from
PDFs:

-‐ 
extracBng
images
from
PDFs:

Even
if
you
feel
as
if
you
can’t
do
this
yourself,
you
should
make
yourself
aware
of

what
is
possible
and
achievable
by
people
who
have
the
skills
to
performs
these

tasks.

20

Stephen
Few
has
wriMen
several
excellent
books
about
creaJng
data
visualisaJons

and
data
dashboards,
although
you
shouldn’t
necessarily
believe
everything
he
says!

This
quote
gets
across
the
idea
that
just
as
we
use
emphasis
and
tone
in
wriMen

communicaJon,
we
can
also
can
and
should
make
use
of
emphasis
and
tone
in

charts.

Many
newspapers
are
starJng
to
make
use
of
charts
that
show
several
datapoints
(for

example,
several
bars
in
a
bar
chart)
but
highlight
one
or
two
of
them
that
are
the

focus
of
a
parJcular
storyline,
the
other
points
or
bars
being
used
to
provide
context.

In
chart
design,
“less
is
more”
oren
works
(this
reﬂects
a
principle
aMributed
to
data

visualisaJon
guru
Edward
Ture
of
using
“least
ink”
when
creaJng
charts).

21

This
video

-‐
showing
part
of
a
lecture
by
science
ﬁcJon
writer
Kurt
Vonnegut
–
shows

how
simple
lines
can
tell
archetypal
stories.
Note
how
the
narraJon
sets
the
scene

-‐

the
axes
are
explained
then
the
line
is
constructed.
When
the
x-‐axis
represents
Jme,

remember
that
someone
riding
the
line
as
it
was
constructed
does
not
necessarily

know
what
the
future
holds.
When
you
see
a
line
chart
with
Jme
as
an
x-‐axis,

remember
that
it
shows
a
trace
of
a
story
that
unfolded
over
Jme.

Another
powerful
example
of
this
can
be
found
on
Youtube
–
search
for
house
price

rollercoaster
to
ﬁnd
an
animaJon
where
how
price
values
over
Jme
are
visualised
as

an
animated
roller
coaster
ride…

22

This
second
clip
shows
Swedish
health
staJsJcian
made
famous
by
his
“data

performances”,
Hans
Rosling,
narraJng
an
animated
data
visualisaJon
rendered
using

a
dynamic
bubble
chart
technique
that
he
popularised
via
his
Gapminder
website.

Note
how
the
first
30
seconds
of
the
clip
are
spent
explaining
the
set
up
of
the
chart
–

what
the
axes
mean,
what
the
bubbles
represent.
When
you
see
a
rich
data
driven

interacJve
on
a
website,
how
much
coaching
and
contextualisaJon
is
provided
to

help
the
user/reader
make
sense
of
it?

If
you
turn
the
sound
off
on
the
Rosling
clip,
how
much
sense
do
the
moving
bubbles

make
in
terms
of
the
story
they
tell
without
the
benefit
of
Rosling’s
narraBon?
Can

you
tell
where
to
focus
your
aMenJon
to
pull
out
a
meaningful
storyline?
Are
there

many
possible
storylines
that
can
be
pulled
out?
What
tricks
does
Rosling
use
to
focus

your
aMenJon
on
–
and
illustrate
–
the
story
he
is
telling?
Is
there
any
sleight
of
hand

in
terms
of
not
commenJng
on
what
some
of
the
other
bubbles
are
doing
(is
he

using,
or
could
he
potenJally
use,
misdirecJon
to
focus
your
aMenJon
aware
from

possible
stories
he
does
not
want
you
to
pull
out
of
the
data?)

For
more
examples
of
Rosling’s
compelling
performances,
see
the
recent
OU/BBC
Two

co-‐producJon
“Don’t
Panic
–
The
Truth
About
PopulaJon
Change”
available
on
the

Gapminder
website:
hMp://www.gapminder.org/videos/dont-‐panic-‐the-‐facts-‐about-‐
populaJon/

23

Few
suggests
that
graphical
communicaJon
requires
stylisJc
devices
that
emphasise

parJcular
aspects
of
a
graphic.
Hans
Rosling
achieves
this
by
both
poinJng
to
items
of

interest,
reinforcing
with
emphasis
with
both
his
narraJon
and
the
use
of
overlays
on

the
graphic
itself.

So
how
can
we
go
about
drawing
emphasis
within
a
staJc
graphic
or
chart,
such
as

one
might
ﬁnd
in
a
print
publicaJon?

24

To
show
one
way
of
emphasising
parJcular
elements
of
a
graphic,
let’s
produce
a

quick
chart
of
our
own.

The
first
thing
we
need
is
some
data
–
I’m
going
to
use
some
data
from
the
Winter

Olympics,
a
grab
of
the
medal
table
from
the
back
end
of
the
first
week
of
the
2014

games.
The
quesJon
I
want
to
explore
is
the
extent
to
which
the
country
that
is

leading
the
medal
table
as
measured
by
most
number
of
gold
medals
awarded,

compared
to
a
ranking
in
which
the
table
is
ordered
according
to
the
total
number
of

medals
awarded.

The
data
I’m
going
to
use
comes
from
a
Wikipedia
page.
The
medal
table
is
contained

within
an
HTML
table.
To
get
the
data
out
of
the
page
we
are
going
to
screenscrape

the
HTML
table
that
contains
the
data.
There
are
a
variety
of
tools
for
doing
this,
from

browser
extensions
to
scraper
applicaJons
such
as
import.io,
to
environments
such
as

Scraperwiki
that
provide
a
range
of
developer
tools
configured
to
support

screenscraping
based
data
collecJon.

But
the
tool
I’m
going
to
use
is…

25

..Google
(spread)sheets,
and
in
parJcular
a
formula
that
will
import
a
parJcular
HTML

table
–
in
this
case,
the
2nd
table
in
the
page
–
from
a
speciﬁed
URL,
In
this
case
the

URL
of
the
Wikipedia
page
containing
the
medal
table.

The
formula?

=importhtml(“URL”,”table”,
tableNumber)

On
entering
the
formula,
the
spreadsheet
will
pull
the
data
in
from
the
Wikipedia

page
and
make
it
available
as
spreadsheet
data.

We
can
now
use
the
spreadsheet
to
create
charts
within
the
sheet
itself.
If
the
data
in

the
Wikipedia
page
is
updated,
the
data
in
the
spreadsheet
will
be
updated
whenever

the
spreadsheet
is
refreshed.

26

Whilst
we
could
generate
charts
within
the
spreadsheet,
I’m
actually
going
to
use
an

online
tool
called
datawrapper
(available
at
datawrapper.de).

Datawrapper
charts
are
starJng
to
make
an
appearance
in
many
online
news
reports,

such
as
those
published
by
the
Guardian
and
Trinity
Mirror’s
ampp3d,
so
being

familiar
with
this
tool

-‐
and
what
you
can
do
with
it
–
could
be
a
useful
skill
to
have.

To
get
the
data
in
to
datawrapper
you
can
upload
a
CSV
ﬁle,
or
paste
a
copy
of
the

data
in
to
the
upload
area.
I’ve
taken
the
laMer
approach,
highlighJng
and
copying

the
table
from
the
spreadsheet
and
then
pasJng
it
in
to
datawrapper.

27

Having
uploaded
the
data,
we
can
configure
several
properJes
for
each
column.
In

many
cases
datawrapper
should
be
able
to
detect
what
sort
of
content
is
contained

within
each
column
(for
example,
whether
it
is
a
number
or
a
text
field).

If
necessary,
we
can
apply
a
limited
amount
of
processing
to
the
contents
of
a

specified
column.
We
can
also
choose
to
hide
one
or
more
columns
from
the

displayed
view.
In
this
case,
I
am
going
to
hide
the
Rank,
Silver
and
Bronze
columns.

28

We
now
get
to
choose
the
chart
type
–
I’m
going
to
go
for
a
horizontal
bar
chart
and

select
the
default
datawrapper
style.

29

Different
chart
types
have
different
configuraJon
opJons.
I’m
going
to
choose
to

automaJcally
sort
the
bars
based
on
the
selected
value
–
noJce
the
buMons
in
the

chart
that
allow
us
to
select
whether
to
display
the
Gold
medal
count
or
the
Total

medal
count.

30

Now
we
get
to
add
some
emphasis
–
remember
emphasis?
This
is
an
example
about

how
to
show
emphasis
in
a
chart…

In
this
case,
I’m
going
to
emphasise
the
top
2
posiJons
in
the
Gold
medal
ranking
–

the
“point”
of
the
piece
is
to
explore
the
extent
to
which
these
posiJons
hold,
or

don’t
hold,
when
we
rank
the
table
by
total
medal
count.

At
this
point,
we
can
also
give
the
chart
a
Jtle,
and
add
some
provenance
informaJon

describing
and
poinJng
to
the
source
of
the
data.

31

Here’s
an
example
of
the
final
chart,
with
the
ranking
(automaJcally)
sorted
according

to
total
medal
count.
Note
how
the
order
and
posiJoning
of
the
two
highlighted

countries
has
changed.

The
difference
is
further
exemplified
when
switching
between
the
Gold
and
Total

counts
by
the
use
of
animaJon
–
the
highlighted
bars
draw
the
eye
and
allow
you
to

beMer
see
how
their
relaJve
posiJons
change
across
each
of
the
two
ranking

schemes.

32

Having
created
chart,
you
can
now
save
it
to
your
datawrapper
account.
An
embed

code
for
the
chart
is
provided
so
that
you
embed
the
chart
within
your
own
web

page.

33

Bar
charts
are
a
very
eﬀecJve
way
of
displaying
parJcular
sorts
of
informaJon,
such

as
counts.
But
what
other
ways
are
there
of
displaying
data?

34

Datawrapper
provides
a
variety
of
chart
types,
including:

-‐ 
horizontal
and
verJcal
(column)
bar
charts,

-‐ 
grouped
bars
that
collate
different
bars
according
to
groups
(for
example,
elecJon

on
elecJon
percentage
of
the
vote
for
different
poliJcal
parJes),

-‐ 
stacked
column
charts
(for
example,
for
a
selecJon
of
countries
we
could
display
a

column
showing
the
total
number
of
medals
constructed
by
stacking
the
individual

gold,
silver
and
bronze
medal
counts
for
those
countries)

-‐ 
line
charts,
which
are
widely
used
for
plokng
some
value
on
the
verJcal
y-‐axis

against
Jme
on
the
horizontal
x-‐axis

-‐ 
pie
charts,
to
show
proporJons
of
a
whole,
and
variants
thereof,
such
as
the
donut

chart
(a
pie
chart
with
the
middle
cut
out)

-‐ 
simple
data
tables
(never
underesJmate
the
power
of
a
table
–
they
can
be
really

useful
for
showing
specific
values,
and
can
be
very
powerful
when
allowing
the
user

to
sort
the
table
either
by
ascending
or
descending
values
in
parJcular
columns)

-‐ 
maps,
which
as
we
shall
see,
can
draw
out
very
powerful
relaJonships
across
data

elements.

35

We’ve
also
seen
some
other
“basic”
charts
that
can
be
useful
for
displaying
the

distribuJon
of
data
elements:

-‐ 
the
block
histogram
shows
a
count
on
the
y-‐axis
of
data
elements
falling
within

parJcular
ranges
of
values
on
the
x-‐axis

-‐ 
the
scaMerplot
allows
us
to
plot
two
values
against
each
other,
for
example
height

versus
weight.
These
charts
can
provide
us
with
clues
about
possible
correlaJons
or

relaJonships
between
the
two
values.
Some
scaMerplot
tools
further
allow
us
to

colour
each
point
according
to
group
membership
so
that
we
can
look
to
see
whether

numbers
are
clustered
or
grouped
according
to
group
membership.

36

Visualising
data
is
a
powerful
way
of
asking
quesJons
of
data
–
what
data
points
you

choose
to
display
and
how
you
display
them
represent
the
framing
of
the
quesJon.

What
the
data
looks
like
is
the
response,
but
a
response
that
oren
takes
careful

reading.
The
data
source
has
drawn
you
the
answer
–
you
need
to
turn
it
into
words

that
you
can
use
to
formulate
further
quesJons
to
check
your
understanding
of
the

answer
ﬁrst
provided.
(Each
quesJon
(each
chart)
typically
leads
to
another…
or
more

than
one
other…)

Asking
quesJons
that
have
a
graphical
answer
is
one
way
of
querying
a
data
source
–

but
are
there
other
approaches?

Let’s
explore
that
a
liMle
more
–
what
do
we
mean
by
asking
quesJons
of
data?

37

A
database
that
most
of
us
use
every
day
is
the
Google
web
search
engine.
We
put
in

a
key
term
or
phrase
and
Google
finds
web
pages
ranked
according
to
a
variety
of

criteria
that
are
deemed
most
relevant
to
the
query
you
(and
it
could
well
be
who
you

actually
are
that
affects
the
ranking)
have
made.

SomeJmes
we
may
know
what
websites
we
actually
want
to
search
over.
Google

Custom
Search
Engines
provide
one
way
of
defining
your
own
search
engine
that
just

searches
over
part
of
the
web
that
you
are
interested
in.

One
of
the
custom
search
engines
I
have
developed
searches
over
websites
that
act

as
wire
services
for
press
releases:
hMps://www.google.com/cse/publicurl?
cx=016419300868826941330:wvfrmcn2oxc

This
allows
us
to
track
down
the
source
of
many
a
news
item
and
explore
the
extent

to
which
a
given
news
story
has
just
churned
a
press
release.

See
also:
hMp://blog.ouseful.info/2014/02/06/polling-‐the-‐news/
This
post
also

describes
how
to
create
a
bookmarklet
that
allows
you
to
highlight
a
quote
in
a
news

report
and
search
for
press
releases
that
contain
that
quote.

38

Here’s
an
example
of
the
search
engine
in
acJon
–
I’ve
used
a
bookmarklet
that
takes

a
highlighted
quote
from
a
news
story
and
passes
it
to
the
custom
search
engine,

allowing
me
to
easily
see
the
source
of
the
quote,
and
the
story
itself.

I’ve
also
started
deﬁning
another
related
custom
search
engine
that
allows
us
to

search
news
sites
and
polling
companies
for
stories
about,
and
sources
of,
polls
and

surveys:

hMps://www.google.com/cse/publicurl?cx=016419300868826941330:ewbi9skvnmq

39

Custom
search
engines
are
a
powerful
tool
for
helping
us
developed
focussed
web

search
tools
that
limit
results
to
a
parJcular
part
of
the
web
we
are
interested
in,

either
by
locaJon
or
topic.

We
can
also
use
(advanced)
search
limits
in
‘everyday’
web
queries
using
the
major

web
search
engine.

For
example,
the
query
shown
on
this
slide
searches
for
the
word
underspend

appearing
in
Excel
spreadsheets
(filetype:xls)
that
can
be
found
on
UK
government

websites
(or
more
specifically,
websites
hosted
on
the
gov.uk
domain
(site:gov.uk)).

Another
query
limit
combinaJon
I
have
found
useful
is:

confidenBal
filetype:ppt

This
can
turn
up
presentaJons
that
have
been
delivered
at
closed
corporate
events

but
that
have
leaked
on
to
the
web…

40

Even
if
you
don’t
consider
yourself
a
geek
or
database
expert,
wriJng
advanced

search
queries
using
search
limits
is
but
a
small
step
away
from
wriJng
queries
over

databases
themselves.

One
of
the
most
widely
used
languages
for
querying
databases
is
SQL.
The
above
slide

shows
a
simple,
made
up
SQL
query
that
could
have
a
similar
eﬀect
to
the
simpler

search
engine
query
made
over
a
very
simple
search
engine
database.

The
idea
is
that
we
select
those
webPages
where
the
text
content
of
the
webpage

contains
the
word
underspend
anywhere
–
the
%
signs
denote
wildcard
characters
so

the
underspend
word
can
appear
preceded
or
followed
by
any
number
of
arbitrary

characters.
We
also
want
the
query
to
be
limited
to
pages
that
have
a
parJcular

ﬁletype
and
domain.

Far
more
complicated
queries
can
be
wriMen
over
far
more
complex
databases.

What’s
important
is
that
you
develop
an
idea
of
what
sorts
of
database
structure
and

query
are
possible,
not
necessarily
that
you
can
run
and
query
such
databases

yourself.

For
more
examples,
see:

Asking
QuesJons
of
Data
–
Garment
Factories
Data
ExpediJon
–
hMp://
schoolofdata.org/2013/05/24/asking-‐quesJons-‐of-‐data-‐garment-‐factories-‐data-‐
expediJon/

Asking
QuesJons
of
Data
–
Some
Simple
One-‐Liners
hMp://schoolofdata.org/
2013/05/13/asking-‐quesJons-‐of-‐data-‐some-‐simple-‐one-‐liners/

41

One
of
the
simplest,
but
oren
one
of
the
most
useful,
things
we
can
do
is
to
count

things.
You
just
need
to
be
creaJve
in
what
you
count!

One
of
the
nice
features
about
working
with
database
query
languages
such
as
SQL
is

that
we
can
write
queries
that
count
the
number
of
responses
and
allows
us
to
rank

results
on
that
basis.
For
example,
in
a
database
of
public
spending
transacJons
with

diﬀerent
companies,
we
could
count
the
number
of
transacJons
with
a
parJcular

company,
sum
the
value
of
transacJons
carried
out
with
a
parJcular
company,
or
ﬁnd

the
companies
with
the
largest
total
amount
spent
with
a
parJcular
company.

42

As
has
already
been
menJoned,
a
key
part
of
the
journalisJc
exercise
is
pukng
things

into
context.

When
working
with
data,
interpreJng
what
the
data
says
oren
depends
on

understanding
the
context
and
more
importantly,
the
caveats,
that
arise
by
virtue
of

asking
a
parJcular
quesJon
of
a
parJcular
dataset
that
has
been
collected
in
a

parJcular
way
under
parJcular
condiJons.

That
said,
given
a
parJcular
data
set,
are
there
any
obvious
quesJons
we
can
ask
of

it?

43

When
results
are
ranked,
as
for
example
in
the
case
of
league
tables,
there
are
oren

easy
picking
stories
to
be
had
around
top
3/boMom
three
posiJons.
In
naJonal

rankings,
local
news
stories
can
be
idenJﬁed
if
your
local
schools
or
council
appears

in
either
of
those
extremes.

For
contextualisaJon
purposes,
it
oren
makes
sense
to
look
at
distribuJons.
Many

summary
staJsJcs
report
on
the
mean
value,
but
looking
at
measures
of
variaJon,
or

spread,
about
a
mean,
as
well
as
the
posiJon
of
a
median
value,
can
oren
change
the

context
of
a
story.

If
the
lecture
room
has
20
students
in
it
on
an
income
of
£6,000
maintenance
loan

per
year,
the
total
income
is
£120,000
and
their
average
mean
income
is
£6,000.
If
an

academic
in
the
room
is
on
£40,000,
the
total
income
for
the
room
is
£160,000.
The

average
mean
income
is
now
just
a
liMle
over
£7,500.
If
we
deﬁne
a
poverty
level
as
a

mean
income
below
£10,
000,
the
members
of
the
room
are,
on
average,
in
poverty.

If
a
senior
academic
such
as
professor
on
an
income
over
£65,000
wanders
into
the

room,
the
total
income
goes
to
over
£225,000.
With
22
people
now
in
the
room,
the

average
mean
income
is
now
over
£10,000:
the
room
is
out
of
poverty.
The
median

average
income,
however,
is
sJll
at
£6,000.

As
well
as
top,
boMom,
mean
and
median,
we
should
also
look
to
outliers.
If
Bill
Gates

or
Mark
Zuckerberg
walks
into
a
bar,
the
average
net
worth
of
people
in
that
bar
is

likely
to
go
up
to
a
level
of
previously
unimagined
wealth.

Here
are
several
reasons
why
you
should
pay
aMenJon
to
outliers:

-‐ 
they
may
be
‘dirty’
or
incorrect
data
points
that
need
to
be
corrected
and
that
may

well
raise
quesJons
about
data
quality;

-‐ 
the
outlier
may
truly
be
an
outlier,
a
remarkable
point
and
a
story
in
its
own
right;

-‐ 
the
outlier
may
skew
other
measures,
such
as
mean
values
or
other
summary

staJsJcs.
In
such
cases,
it
may
make
sense
to
use
other
measures
or
to
rerun
the

44

This
rather
dense
graphic
is
a
view
over
local
council
spending
data
in
my
local
area
as

relates
to
spend
on
libraries.
The
separate
charts
show
the
accumulated
spend
over
a

period
of
Jme
with
different
suppliers.
The
intenJon
of
the
display
was
to
provide
at

a
glance
a
view
of
accumulated
spend
with
different
companies
across
different

directorates
and
spending
areas
to
see
whether
any
companies
had
a
significant

spend
compared
to
other
companies.

The
table
at
the
boMom
shows
the
top
of
a
league
table
of
companies
with
the
largest

accumulated
spend
by
directorate
and
expense
type.

At
first
glance,
the
spend
on
phone
lines
with
different
suppliers
seems
to
outweigh

the
spend
on
books.
How
can
that
be?
Are
the
librarians
spending
their
Jme
calling

premium
rate
phone
lines?

If
we
guess
at
20
libraries
and
a
6
month
spend
period,
then
assume
that
the
phone

lines
correspond
to
broadband
data
bills,
do
the
monthly
payments
per
library
sJll

seem
outrageous?
These
assumpJons
are
testable
via
quesJons
to
the
relevant

authoriJes,
of
course,
but
demonstrate
the
care
we
need
to
take
when
trying
to

understand
why
a
number
that
may
appear
to
be
large
is
that
large.

See
also:
Local
Council
Spending
Data
–
Time
Series
Charts
hMp://blog.ouseful.info/
2013/11/06/local-‐council-‐spending-‐data-‐Jme-‐series-‐charts/

45

As
well
as
looking
for
outliers,
we
should
also
look
for
similariJes
between
things
we

expect
to
be
diﬀerent
and
diﬀerences
between
things
we
expect
to
be
the
same,
or
at

least,
similar.

46

Looking
again
at
some
of
my
local
council’s
spending
data,
I
noJced
a
search
on

“music”
pulled
back
what
appeared
to
be
a
shir
in
responsibility
between

directorates
for
spend
on
school
music
service
provision.

An
obvious
quesJon
that
follows
is:
if
the
service
did
change
hands
(something
we

can
check),
was
there
a
resulJng
diﬀerence
in
the
way
that
the
directorates
were

spending?
Could
we,
for
example,
idenJfy
whether
any
projects
got
dropped
(or
at

least,
renamed
out
of
scope!)?

This
forensic
approach
can
also
be
used
to
track
the
consequences
of
a
shir
in
control

of
a
service,
if
we
know
it
has
happened.
When
a
service
changes
hand,
we
can
keep

a
note
of
the
fact
and
then
a
year
on
look
for
evidence
in
whether
treatment
of
the

service
has
changed,
at
least
in
consequences
for
spending.

See
also:
What
Role,
If
Any,
Does
Spending
Data
Have
to
Play
in
Local
Council
Budget

ConsultaJons?
hMp://blog.ouseful.info/2013/11/03/what-‐role-‐if-‐any-‐does-‐spending-‐
data-‐have-‐to-‐play-‐in-‐local-‐council-‐budget-‐consultaJons/

47

When
asking
quesJons
of
data,
one
quesJon
can
oren
lead
to
another.

For
example,
a
query
over
my
local
council
spending
data
about
amounts
spent
with

the
local
newspaper,
the
Isle
of
Wight
Country
Press,
idenJﬁed
a
variety
of
expense

types
associated
with
those
spending
transacJons.
One
such
expense
type
was

AdverBsing
&
Publicity.
This
led
to
me
now
steering
the
conversaJon
I
was
having

with
this
expert
(data)
source
on
council
spending
and
taking
it
on
to
a
slightly

diﬀerent
tack:
so
who
else
have
you
been
spending
adverBsing
and
publicity
budgets

with?

48

If
you
in
the
posiJon
of
paying
for
energy
supply
bills
–
electricity
and
gas
–
you’ll

probably
be
familiar
with
the
idea
that
payments
are
set
so
you
tend
to
overpay
on
a

monthly
basis.
Arer
collecJng
the
interest
on
your
overpayments,
the
uJlity

companies
may
eventually
get
round
to
sending
you
a
small
repayment
to
cover
the

excess
(ex-‐
of
any
interest,
of
course…).

Is
the
same
true
at
the
council
level?

One
thing
I
noJced
in
the
spend
my
local
council
spent
with
supplier
Southern

Electric
was
that
there
appeared
to
be
more
than
a
few
“negaJve
payments”.
So

where
were
these
coming
from?
The
chart
shown
in
this
slide
has
posiJve
payments

made
by
date
(not
ordered
on
an
evenly
space
Jmeline)
in
black,
and
the
magnitude

of
negaJve
payments
shown
in
red.
Where
a
red
triangle
sits
over
a
black
dot,
this

shows
that
a
posiJve
and
negaJve
payment
of
the
same
amount
were
made
on
the

same
day.
Why’s
that?

Some
days
show
several
negaJve
payments
–
again,
what’s
happening?
There’s
not

necessarily
anything
suspicious
going
on,
but
what
story
does
this
chart
appear
to
tell

us,
parJcularly
in
terms
of
the
similariJes
in
amount
of
certain
posiJve
and
negaJve

spends?

49

Just
by
the
by,
this
chart
reﬁnes
the
quesJon
I’m
asking
of
the
spend
with
Southern

Electric,
asking
for
more
informaJon
about
posiJve
and
negaJve
payments
made
on

the
gas
and
electricity
accounts
separately.

50

As
well
as
similariJes
and
diﬀerences,
data
can
tell
us
tales
about
trends…

51

Regular
releases
from
the
ONS
–
the
Office
of
NaJonal
StaJsJcs
–
provide
bread
and

buMer
news
stories
on
a
regular
basis
according
to
a
known
schedule.

For
example,
monthly
job
seeker
figures
get
a
monthly
write-‐up
in
OnTheWight,
the

hyperlocal
news
blog
local
to
me.
The
report
makes
a
comparison
between
the

current
figures
and
figures
from
the
previous
month
and
from
the
same
month
of
the

previous
year.
The
aim
is
is
so
that
we
can
see
how
the
numbers
have
changed
month

on
month,
and
year
on
year.

I
started
to
explore
a
simple
script
that
would
take
data
directly
from
the
ONS
and

produce
assets
that
could
be
reused
in
a
news
story
–
for
example,
to
produce
a
table

showing
the
change
in
figures
over
recent
months.

I
also
started
to
explore
ways
in
which
we
could
automate
the
producJon
of
prose

from
the
data
[code:
hMps://gist.github.com/psychemedia/7536017].
For
example,

the
following
phrase
was
generated
automaJcally
from
monthly
figures:

The
total
number
of
people
claiming
Job
Seeker's
Allowance
(JSA)
on
the
Isle
of
Wight

in
October
was
2781,
up
94
from
2687
in
September,
2013,
and
down
377
from
3158

in
October,
2012.

The
words
up
and
down
were
selected
based
on
simple
if-‐then
rule
that
compared

figures
to
see
which
was
the
greater.
The
numbers
and
dates
are
pulled
in
from
the

data.
The
other
words
are
canned
phrases.

The
automated
producJon
of
text
from
data
is
something
that
has
received
aMenJon

from
several
companies,
parJcular
in
the
area
of
baseball
reports
and
financial

reporJng.
See
for
example:
hMp://blog.ouseful.info/2013/05/22/notes-‐on-‐narraJve-‐
science-‐and-‐automated-‐insight/

52

If
we
plot
a
line
chart
with
some
quanJty
against
a
Jme
axis,
we
can
oren
see

increasing
or
decreasing
trends
over
Jme.
If
we
are
looking
for
constant
rates
of

increase
in
some
value,
it
oren
makes
sense
to
use
a
log/logarithmic
scale
to
display

that
value
on
the
y-‐axis
Periodic
trends
can
also
be
seen
as
‘waves’
appearing
in
the

line
over
Jme,
but
other
displays
can
draw
out
periodicity
or
seasonality
in
a
more

visually
compelling
way.

For
example,
in
these
charts
–
of
jobless
figures
on
the
Isle
of
Wight
once
again
–
we

have
months
ordered
along
the
horizontal
x-‐axis
and
the
number
of
job
allowance

claimants
on
the
verJcal
y-‐axis.
The
separate
coloured
lines
represent
different
years.

On
the
ler,
we
use
a
legend
to
idenJfy
the
lines,
on
the
right
is
an
example
of

labeling
the
lines
directly.

The
lines
show
strong
seasonality
in
behaviour.
Being
a
tourist
desJnaJon,
job
seeker

figures
tend
to
fall
over
the
summer
months.
Pukng
lines
for
several
years
on
the

same
axis
allows
us
to
compare
annual
cycles
over
Jme.

53

Another
trend
we
can
try
to
pull
out
is
change
over
years
for
each
given
month.
Here,

the
horizontal
x-‐axis
blocks
out
the
months,
as
before,
but
within
each
month
we

have
an
ordered
range
of
years.
The
line
within
each
block
thus
represents
the
year-‐
on-‐year
change
in
numbers
within
a
given
month.

The
step
change
within
each
month
suggests
that
the
way
the
figures
were
calculated

changed
significantly
several
years
ago.

Further
reading:
a
good
guide
to
staJsJcs
as
used
by
government,
include
a

descripJon
of
the
way
that
“seasonal
adjustments”
are
handled,
is
provided
by
the

House
of
Commons
Library’s
StaJsJcal
Literacy
Guide
hMp://www.parliament.uk/
business/publicaJons/research/briefing-‐papers/SN04944/staJsJcal-‐literacy-‐guide

54

As
well
as
the
paMerns
we
can
see
over
Jme
by
plokng
data
against
a
Jme
axis,
we

can
also
look
for
paMerns
in
space…

55

In
part
because
they
are
so
recognisable
to
the
majority
of
people
as
an
idea

as
well

as
an
artefact,
maps
are
widely
used
in
many
publicaJons.

I
have
already
menJoned
how
the
use
of
a
map
to
compare
travel
claims
by
MPs

based
on
their
consJtuency
locaJons
provided
a
way
of
making
a
parJcular
sort
of

comparison
between
MPs
(in
parJcular,
a
comparison
based
on
geographical

locaJon).

But
we
can
take
the
idea
of
a
map
more
generally,
as
a
spaJal
distribuJon
of
points

that
are
related
in
some
way,
with
strong
relaJons
represented
as
spaJal
proximity.

Things
that
are
close
together
on
the
page
are
taken
to
be
close
together
in
some
sort

of
space,
a
space
which
may
be
conceptual
or
social,
not
just
(or
not
even)

geographic.

56

Take
this
map,
for
example,
a
map
of
TwiMer
users
commonly
followed
by
a
sample
of

followers
of
@UL_journalism.

The
map
has
been
laid
out
so
that
TwiMer
users
who
are
heavily
interlinked
are

grouped
closely
together
(for
the
most
part,
at
least).
A
network
staJsJc
has
been

used
in
an
aMempt
to
colour
clusters
of
nodes
with
high
interconnecJon.
The

coloured
regions
thus
represent
a
ﬁrst
aMempt
at
idenJfying
diﬀerent
groupings
of

TwiMer
user.
You
will
note
how
the
spaJal
layout
algorithm
and
the
grouping/
colouring
algorithm
complement
each
other
well
–
they
both
seem
to
tell
a
similar

story,
where
the
story
is
that
certain
groups
of
individuals
are
somehow
alike.

About
the
technique:
hMp://schoolofdata.org/2014/02/14/mapping-‐social-‐
posiJoning-‐on-‐twiMer/

Let’s
have
a
closer
look
at
some
of
the
regions…

57

This
area
seems
to
be
TwiMer
accounts
that
relate
in
large
part
to
the
University
of

Lincoln
and
its
related
organisaJons
and
acJviJes.

58

This
area
of
the
map
contains
accounts
associated
with
Lincoln
more
generally.
Such
a

map
may
be
useful
for
idenJfying
companies
that
are
used
by
students
and
as
such

may
be
useful
leads
for
adverJsing
agents
looking
to
sell
adverts
appearing
in

university
magazines
or
poster
areas.

59

This
area
of
the
map
actually
conflates
several
different
groupings,
at
least,
on
my

reading
of
it.
In
fact,
it
may
make
sense
to
try
to
find
clusters
within
this
group
on
its

on
and
then
recolour
accordingly.

So
what
groups
can
I
see?
BoMom
ler
there
looks
to
be
Lincoln
local
media
outlets.

Moving
counter-‐clockwise
between
the
6
and
3
o’clock
posiJons
we
see
photography

related
users
moving
up
into
celebriJes.
As
we
move
further
up
towards
the
twelve

o’clock
posiJon,
we
see
news
sites,
both
“popular”
and
more
industry
related

(@journalismnews,
for
example).

That
there
does
not
appear
to
be
a
strong
independent
cluster
of
journalists
and

industry
related
sites
suggests
that,
from
the
sampled
followers
of
UL_Journalism
at

least,
there
isnlt
necessarily
a
very
strong
noJon
of
following
these
industry
lights…

60

One
of
the
things
to
menJon
about
mapping
data
mapping
and
visualisaJon

techniques
is
that
they
oren
tells
us
things
we
already
(think
we)
know;
in
that
sense,

they
are
not
news.
But
they
may
also
tell
us
things
we
know
in
new,
visually
appealing

ways.
And
by
making
use
of
such
‘conﬁrmatory’
visualisaJons
and
displays
we
can

build
conﬁdence
within
an
audience
that
they
know
how
to
interpret
these
sorts
of

representaJon.

61

As
the
audience
becomes
comfortable
reading
the
charts
and
making
sense
of
data,

when
there
is
something
new
or
surprising
in
the
data,
the
surprise
manifests
itself
in

the
reading
of
the
data
or
chart.

For
journalists
working
with
data,
developing
a
sense
of
familiarity
with
how
to

interpret
and
read
data
when
it
is
just
confirming
what
you
already
know
helps
to

refine
your
senses
for
spokng
things
that
are
odd,
noteworthy,
or
newsworthy.

Taking
a
liMle
bit
of
Jme
each
day
to:

-‐ 
read
charts
as
if
they
were
stories;

-‐ 
look
behind
the
data
to
find
original
sources,
such
as
polls
or
data
containing
news

releases,
and
then
compare
the
original
release
with
the
way
it
is
reported,
paying

parJcular
aMenJon
to
the
points
that
are
highlighted,
and
how
the
data
is

contextualised;

will
help
you
develop
some
of
the
skills
you
will
need
if
you
want
to
be
able
to

idenJfy,
develop
and
treat
some
of
the
stories
that
your
specialist
source
that
is
data

can
provide
you
with,
of
only
you
ask…

62

And
ﬁnally,
a
couple
of
handy
books
and
resources
on
data
journalism
if
you’re

interested
in
reading
more
generally
around
the
subject…

63

An Introduction to Data Journalism

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (11)

En vedette

En vedette (13)

Similaire à An Introduction to Data Journalism

Similaire à An Introduction to Data Journalism (20)

Plus de Tony Hirst

Plus de Tony Hirst (20)

Dernier

Dernier (20)

An Introduction to Data Journalism