Congressional PageRank: Finding Influential US Legislators Using Graph Analytics

Congressional PageRank:
Graph Analytics Of US Congress
William Lyon
Graph Day - Austin, TX
January 2016

About me
Software Developer @Neo4j
william.lyon@neo4j.com
@lyonwj
lyonwj.com
William Lyon

Agenda
• Brief intro to Neo4j graph database
• Modeling US Congress as a graph
• Exploring the 114th Congress
• Finding influential legislators

Neo4j – Key Features
Native Graph Storage 
Ensures data consistency and
performance
Native Graph Processing 
Millions of hops per second, in real time
“Whiteboard Friendly” Data Modeling 
Model data as it naturally occurs
High Data Integrity 
Fully ACID transactions
Powerful, Expressive Query
Language 
Requires 10x to 100x less code than
SQL
Scalability and High Availability 
Vertical and horizontal scaling
optimized for graphs
Built-in ETL 
Seamless import from other databases
Integration 
Drivers and APIs for popular languages
MATCH 
(A)

The Whiteboard Model Is the Physical Model

Relational Versus Graph Models
Relational Model Graph Model
KNOWS
KNOWS
KNOWS
ANDREAS
TOBIAS
MICA
DELIA
Person FriendPerson-Friend
ANDREAS
DELIA
TOBIAS
MICA

Property Graph Model Components
Nodes
• The objects in the graph
• Can have name-value properties
• Can be labeled
Relationships
• Relate nodes by type and
direction
• Can have name-value properties
CAR
DRIVES
name: “Dan”
born: May 29, 1970
twitter: “@dan”
name: “Ann”
born: Dec 5, 1975
since:  
Jan 10, 2011
brand: “Volvo”
model: “V70”
LOVES
LOVES
LIVES WITH
OW
NS
PERSON PERSON

Cypher: Powerful and Expressive Query
Language
CREATE (:Person { name:“Dan”} ) -[:LOVES]-> (:Person { name:“Ann”} )
LOVES
Dan Ann
LABEL PROPERTY
NODE NODE
LABEL PROPERTY

MATCH (boss)-[:MANAGES*0..3]->(sub),
(sub)-[:MANAGES*1..3]->(report)
WHERE boss.name = “John Doe”
RETURN sub.name AS Subordinate,  
count(report) AS Total
Express Complex Queries Easily with Cypher
Find all direct reports and how
many people they manage,  
up to 3 levels down
Cypher Query
SQL Query

Getting Data into Neo4j
Cypher-Based “LOAD CSV” Capability
• Transactional (ACID) writes
• Initial and incremental loads of up to  
10 million nodes and relationships
Command-Line Bulk Loader
neo4j-import
• For initial database population
• For loads with 10B+ records
• Up to 1M records per second
4.58 million things
and their relationships…
Loads in 100 seconds!

Neo4j
Graph Database
• Property graph datamodel
• Nodes and relationships
• Native graph processing
• Cypher query language

https://github.com/legis-graph/legis-graph

https://github.com/legis-graph/legis-graph
LOAD CSV WITH HEADERS
FROM “file:///legislators.csv” AS line
MERGE (l:Legislator (thomasID: line.thomasID})
SET l = line
MERGE (s:State {code:line.state})<-[:REPRESENTS]-(l)
…
US Congress

What Legislators represent Texas?
MATCH (s:State {code: "TX"})<-[:REPRESENTS]-(l:Legislator)
RETURN l,s;

…include congressional body and party
MATCH (s:State {code: "TX"})<-[:REPRESENTS]-(l:Legislator)
MATCH (p:Party)<-[:IS_MEMBER_OF]-(l)-[:ELECTED_TO]->(b:Body)
RETURN b,l,s,p;

How to find influential legislators?

• Cosponsors are
“influenced by” bill
sponsors
• Add INFLUENCED_BY
relationships

Betweenness centrality
The number of times a node acts as a bridge
along the shortest path between two other nodes.
https://en.wikipedia.org/wiki/Betweenness_centrality

image credit: https://en.wikipedia.org/wiki/PageRank

image credit: https://en.wikipedia.org/wiki/PageRank
?

PageRank
Cypher approximation
UNWIND range(1,10) AS round
MATCH (l:Legislator)
WHERE rand() < 0.1
MATCH (l:Legislator)-[:INFLUENCED_BY]->(o:Legislator)
SET o.rank = coalesce(o.rank,0) + 1;
http://neo4j.com/blog/using-neo4j-hr-analytics/

Neo4j server extensions with Java

Neo4j server extensions with Java
curl http://localhost:7474/service/v1/pagerank/Person/KNOWS

PageRank
Graph processing server extension
https://github.com/maxdemarzi/graph_processing
curl http://localhost:7474/service/v1/pagerank/Person/KNOWS

PageRank
neo4j-noderank
https://github.com/graphaware/neo4j-noderank

Two issues
• Local vs global
• Iterative algorithms and graph complexity

Local vs global
Local Global
Offline / batchOLTP / realtime

For iterative algorithms like PageRank, it’s all about complexity of the graph
Lots of paths. Lots of iterations
Graph complexity

PageRank
Graph global!
Iterative!

• Efficient in-memory data processing and
machine learning platform
• Graph analytics with GraphX
• In-memory message passing algorithm
Apache Spark is a fast and general engine for large-scale data processing.
http://spark.apache.org/

PageRank
Spark with Neo4j - Scala
https://github.com/AnormCypher/AnormCypher

import org.anormcypher._
import org.apache.spark.graphx._
import org.apache.spark.graphx.lib._
val total = 100000000
val batch = total/1000000
val links = sc.range(0,batch).repartition(batch).mapPartitionsWithIndex( (i,p) => {
val dbConn = Neo4jREST("localhost", 9474, "/db/data/", "neo4j", "test")
val q = "MATCH (l1:Legislator)-[:INFLUENCED_BY]->(l2:Legislator) RETURN id(l1)
as from, id(l2) as to skip {skip} limit 1000000"
p.flatMap( skip => {
Cypher(q).on("skip"->skip*1000000).apply()(dbConn).map(row =>
(row[Int]("from").toLong,row[Int]("to").toLong)
)
})
})
links.cache
links.count
val edges = links.map( l => Edge(l._1,l._2, None))
val g = Graph.fromEdges(edges,"none")
val v = PageRank.run(g, 5).vertices
Extract subgraph. Run PageRank using Spark GraphX.

val res = v.repartition(total/100000).mapPartitions( part => {
val localConn = Neo4jREST("localhost", 9474, "/db/data/", "neo4j", "test")
val updateStmt = Cypher("UNWIND {updates} as update MATCH (p) where id(p) =
update.id SET p.pagerank = update.rank")
val updates = part.map( v => Map("id"->v._1.toLong, "rank" -> v._2.toDouble))
val count = updateStmt.on("updates"->updates).execute()(localConn)
Iterator(part.size)
})
Write back to graph

PageRank
Mazerunner
http://www.kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html
• Enables two-way ETL between
Spark and Neo4j
• Run GraphX jobs from data in
Neo4j
• Write results back to Neo4j

PageRank
Mazerunner
http://www.kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html
• Enables two-way ETL between
Spark and Neo4j
• Run GraphX jobs from data in
Neo4j
• Write results back to Neo4j
• Support for:
• PageRank
• Closeness Centrality
• Betweenness Centrality
• Triangle Counting
• Connected Components
• Strongly Connected Components

https://github.com/neo4j-contrib/neo4j-mazerunner

curl http://localhost:7474/service/mazerunner/analysis/pagerank/INFLUENCED_BY

Who are the influential legislators?

Influential legislators by topic

http://portal.graphgist.org/challenge/index.html

Links
• http://www.lyonwj.com/2015/09/20/legis-graph-congressional-data-
using-neo4j/
• http://www.lyonwj.com/2015/10/11/congressional-pagerank/
• https://github.com/legis-graph/legis-graph
• https://github.com/neo4j-contrib/neo4j-mazerunner
• http://www.kennybastani.com/2014/11/graph-analytics-docker-
spark-neo4j.html
• http://www.kennybastani.com/2015/03/spark-neo4j-tutorial-
docker.html

Congressional PageRank: Finding Influential US Legislators Using Graph Analytics

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à Congressional PageRank: Finding Influential US Legislators Using Graph Analytics

Similaire à Congressional PageRank: Finding Influential US Legislators Using Graph Analytics (20)

Dernier

Dernier (20)

Congressional PageRank: Finding Influential US Legislators Using Graph Analytics