Monkigras 2012: Networks Of Data

networks of data

Matt Biddulph
@mattb | matt@hackdiary.com

Every data scientist has their own favourite way of representing their data. For some people
it’s Excel, and they think in rows and columns. For others it’s matrices, and they use linear
algreba to interrogate their data. For me, it’s graphs.

We’re all pretty used to the idea that you can model human relationships in a social graph.

“Social network analysis
views social relationships in
terms of network theory
consisting of nodes and ties.
Nodes are the individual actors
within the networks, and ties
are the relationships between
the actors.”

There’s a pretty deep area of mathematical study called Social Network Analysis that goes
back at least 20 years. It tries to create insight by analysing the structure of social networks,
and usually doesn’t incorporate any elements of culture or sociology in doing so.

Centrality
measures

It led to the creation of techniques like centrality measures, that try to ﬁnd the nodes that are
most central to the network. These might be the kind of people on Twitter who have the
highest chance of being retweeted.

Community
detection

There are also community detection algorithms that try to ﬁnd the most tightly-knit
subgraphs and cluster those nodes together. If you ran this over the network of people I
follow on Twitter, it might be able to pick out my work colleagues or the people I socialise
with face-to-face.

People you
may know

Sites like LinkedIn build almost-telepathic “people you may know” features by walking around
the graph starting at your node and looking for people that show up a lot in your
neighbourhood that you haven’t connected with yet.

To demonstrate what these techniques can do, I downloaded some data from Github’s API. I
wanted to identify and map London’s most-connected developers.

acastro

si
mikewest

lawrencec guioconnor

spjwebster
muffinresearch

osde8info
tyru dannyamey
IanPouncey
dennyhalim
ejeliot
kulor
dorward cyrildoussin
cheeaun

marcusramberg
andyhd isofarro
aphillipo
pierslowe

acme jason23z
kraih
nefarioustim
carlo sh1mmer
cdent

melo minty

dann
BenJam

SteveMarshall
yncyrydybyl
gfx

FND
fhelmberger
rjray

barbie sartak rozza
thrudigital

NeilCrosbyginader
nothingmuch
tcaine perigrin bricas
arcanez
petemounce
bingos
gugod
themattharris
tomyan
philhawksworth

davorg rafl

bobtfish bradleywright
richardc richardhodgson

norm phae
salfield

greut
simonmaddox

rjw1 stig ashb psd

deanwilson tmtmtmtm
drewm

gillesruppert
miyagawa
BenWard cbetta tommorris
natbat
garethr
jjl dwhittle
dhilton mojodna

thesmith
sammyt evilstreak pjbarry voodoochild AndrewDisley
willi
iamdanw
matth

c9s andybeeching
alfredwesterveld
georgebrock
simonw riklomas
samsoir threebytesfull mikesten
richardkeen
jtweed Rodreegez

dsingleton skarab molily
danieljohnmorris
dstrelau

mattb
ask

webiest
atl
abecciu
lingrch

rondevera

philnash

bruntonspall sriprasanna
Jonty
Allinthedata fidothe
whomwah
superfeedr
dvydra
tonytw1 jensy
cc
bbcpete
gklopper
monkchips
straup

rux
russss
kenlim

tackley
steppenwells
memespring
vancaem bob-p
kurtjx

jaygooby metade

james filipeamoreira
chrismear
hungryblank the-experimenters

jwheare hubgit

jystewart
jonocole
camelpunch

evangineer
fredrikmollerstrand

craigw
baseonmars harry-m pkqk
jberkel
dougma

eartle thommay
otfrom
tonyg stever mokele
Roelven

danski kanzure braindeaf
thmghtd andrew
charlenopires
julians
blaine
e1i45
muesli

tims tobypadilla edouard

rmetzler holizz
joshbuddy nogeek
cwninja rarepleasures
hdurer matagus bileckme

aubergene
mxcl esneko
tim

ntoll
mcroydon
liquid tomtaylor haifeng
snowblink
georgepalmer eightbitraptor

threedaymonk
micrypt deepak

brett
pusewicz
zachinglis digdog
zaczheng
crowbot
thechrisoshow
twoism-dev

monadic
jcoglan lrug professionalnerd colin
danwrong
techbelly ja

maccman rlivsey
floehopper

nevali
melito elliottcable lifo

chris-d-adams
libin flunder
andrewmcdonough natematias

svetlyak40wt Floppy
dwo
smtlaissezfaire
tonylpurzelrakete
ejdraper
bumi
lazyatom
danlucraft jasoncale kalv

stonegao nikolay matthewford
robmckinnon

reddavis bru chrisroos
topfunky
tomafro
grillpanda newbamboo jibes21

stinie
timcowlishaw
baob
ebrett
matclayton benpickles felixcohen

tomdyson
timd
alexstubbs cv
wakatara
gerhard
Marak geoffgarside jaikoo
BenHall
olly
jaigouk

pablete

This diagram, created in 2009, has several dimensions. Each node is a London developer with
a github account. Lines show follower relationships. Nodes are sized according to number of
followers, and coloured according to network centrality (red for most-central). The layout
shows community structure - for example the top-left cluster is mostly Perl developers.

carlo

rozza
SteveMarshall FND

NeilCrosbyginader themattharris
tomyan
philhawks

radleywright richardhodgson

norm phae greut
simonmaddox

psd

drewm

gillesruppert

BenWard cbetta tommorris
natbat
garethr dwhittle
dhilton mojodna

the
myt evilstreak pjbarry voodoochild AndrewDisley
willi
iamdanw
andybeeching
sterveld
georgebrock
simonw samsoir rik mikesten
richardkee

dsingletonskarab molily
danieljohn

mattb
webiest
atl

sanna fidothe
Jonty
Allinthedata

russss
jensy

superfeed
memespring
rux
straup

jaygooby
monkchips

vancae

jonocole
jwheare james filipeamore
chrismear
hubgit

jystewart

Let’s go beyond purely social data. James Governor suggested I explore the connection
between music taste and choice of programming language. I wrote a script to correlate
last.fm usernames with github usernames and created a graph structure linking the music
genre taste of each developer to the languages their github projects are implemented in.

This diagram is just a small sample amongst the people I follow on Github and last.fm - not
enough to provide a statistically-signiﬁcant judgement.

in this small sample we can see that my Ruby-coding friends tend towards sing-songwriter
acoustic folk, and the Javascript coders are all about rock and indie.

This is a great book that goes into these techniques in depth. However it’s useful for any
networked data, not just social networks. And it’s useful to anyone, not just startups.

So let’s take a step back and think about what other kinds of graph we could form, from what
kinds of data.

I used to work in location apps at Nokia, and so I naturally think of places. Wouldn’t it be
interesting to study the connections between cities instead of people? For example, people
probably ﬂy more often between NYC and LA than they do between NYC and New Jersey. We
could re-draw the map based on closeness in the travel network.

In 2011 I turned to the Hadoop cluster at Nokia and took a sample of several weeks of logs
from our routing servers. These are used every time someone uses our maps application to
request a driving route from one place to another. Every time someone drove from A to B, I
made an edge in a “place graph” from A to B.

I ran the data through Gephi and asked it to cluster it based on the strength of connections
between towns. The result is a not-quite-geographic new map of the world, where two cities
are close to each other if people often drive between them.

UK

China
Korea,
Japan, etc

Spain Most of Europe

India
Pakistan
Finland Russia

As you’d expect, the UK is an island and so people don’t drive in and out of it very often.
Spain and Portugal are not islands, but they appear separate because they’re attached to the
rest of Europe by a very narrow neck of land. So people are much more likely to ﬂy than drive
out of Spain.

Times Square = Piccadilly Circus
New York London
What kind of questions can this data answer? Say I’m coming to London for the ﬁrst time and
I’m familiar with New York. I could ask a friend what the equivalent of Times Square is in
London. If they know both towns, they’d probably tell me that Times Square is the Piccadilly
Circus of New York.

What is the Holborn of
Amsterdam?

... the De Pijp of New York?

... the Williamsburg of London?

But if we delve into the place graph, we could answer much more interesting questions, and
create a “neighbourhood isomorphism” from city to city. People who like the Mission in SF
and Shoreditch in London could ﬁnd out that Williamsberg is probably the best place for
them to stay in New York.

the
Place Graph
is just like the
Social Graph

This is just one example of viewing data as a graph and then using Social Graph analytics on
it. There are many more possible - the link structure of Wikipedia, the co-occurrence of
topics in a newspaper, the implicit social network of @replies on Twitter, etc.

Thanks!
Matt Biddulph
@mattb | matt@hackdiary.com

Monkigras 2012: Networks Of Data

Recommandé

Recommandé

Contenu connexe

Similaire à Monkigras 2012: Networks Of Data

Similaire à Monkigras 2012: Networks Of Data (20)

Plus de Matt Biddulph

Plus de Matt Biddulph (12)

Dernier

Dernier (20)

Monkigras 2012: Networks Of Data