The document discusses modeling the US Congress as a graph database using Neo4j and analyzing relationships between legislators to identify influential members. It describes loading congressional data into Neo4j, querying relationships between legislators and states they represent. Methods for identifying influential legislators include degree centrality, betweenness centrality, and PageRank computed on a bill co-sponsorship graph using both Neo4j and Apache Spark with GraphX.
3. Agenda
• Brief intro to Neo4j graph database
• Modeling US Congress as a graph
• Exploring the 114th Congress
• Finding influential legislators
4. Neo4j – Key Features
Native Graph Storage
Ensures data consistency and
performance
Native Graph Processing
Millions of hops per second, in real time
“Whiteboard Friendly” Data Modeling
Model data as it naturally occurs
High Data Integrity
Fully ACID transactions
Powerful, Expressive Query
Language
Requires 10x to 100x less code than
SQL
Scalability and High Availability
Vertical and horizontal scaling
optimized for graphs
Built-in ETL
Seamless import from other databases
Integration
Drivers and APIs for popular languages
MATCH
(A)
7. Relational Versus Graph Models
Relational Model Graph Model
KNOWS
KNOWS
KNOWS
ANDREAS
TOBIAS
MICA
DELIA
Person FriendPerson-Friend
ANDREAS
DELIA
TOBIAS
MICA
8. Property Graph Model Components
Nodes
• The objects in the graph
• Can have name-value properties
• Can be labeled
Relationships
• Relate nodes by type and
direction
• Can have name-value properties
CAR
DRIVES
name: “Dan”
born: May 29, 1970
twitter: “@dan”
name: “Ann”
born: Dec 5, 1975
since:
Jan 10, 2011
brand: “Volvo”
model: “V70”
LOVES
LOVES
LIVES WITH
OW
NS
PERSON PERSON
13. Getting Data into Neo4j
Cypher-Based “LOAD CSV” Capability
• Transactional (ACID) writes
• Initial and incremental loads of up to
10 million nodes and relationships
Command-Line Bulk Loader
neo4j-import
• For initial database population
• For loads with 10B+ records
• Up to 1M records per second
4.58 million things
and their relationships…
Loads in 100 seconds!
17. https://github.com/legis-graph/legis-graph
LOAD CSV WITH HEADERS
FROM “file:///legislators.csv” AS line
MERGE (l:Legislator (thomasID: line.thomasID})
SET l = line
MERGE (s:State {code:line.state})<-[:REPRESENTS]-(l)
…
US Congress
19. What Legislators represent Texas?
MATCH (s:State {code: "TX"})<-[:REPRESENTS]-(l:Legislator)
RETURN l,s;
20. …include congressional body and party
MATCH (s:State {code: "TX"})<-[:REPRESENTS]-(l:Legislator)
MATCH (p:Party)<-[:IS_MEMBER_OF]-(l)-[:ELECTED_TO]->(b:Body)
RETURN b,l,s,p;
31. Betweenness centrality
The number of times a node acts as a bridge
along the shortest path between two other nodes.
https://en.wikipedia.org/wiki/Betweenness_centrality
35. PageRank
Cypher approximation
UNWIND range(1,10) AS round
MATCH (l:Legislator)
WHERE rand() < 0.1
MATCH (l:Legislator)-[:INFLUENCED_BY]->(o:Legislator)
SET o.rank = coalesce(o.rank,0) + 1;
http://neo4j.com/blog/using-neo4j-hr-analytics/
46. • Efficient in-memory data processing and
machine learning platform
• Graph analytics with GraphX
• In-memory message passing algorithm
Apache Spark is a fast and general engine for large-scale data processing.
http://spark.apache.org/
49. import org.anormcypher._
import org.apache.spark.graphx._
import org.apache.spark.graphx.lib._
val total = 100000000
val batch = total/1000000
val links = sc.range(0,batch).repartition(batch).mapPartitionsWithIndex( (i,p) => {
val dbConn = Neo4jREST("localhost", 9474, "/db/data/", "neo4j", "test")
val q = "MATCH (l1:Legislator)-[:INFLUENCED_BY]->(l2:Legislator) RETURN id(l1)
as from, id(l2) as to skip {skip} limit 1000000"
p.flatMap( skip => {
Cypher(q).on("skip"->skip*1000000).apply()(dbConn).map(row =>
(row[Int]("from").toLong,row[Int]("to").toLong)
)
})
})
links.cache
links.count
val edges = links.map( l => Edge(l._1,l._2, None))
val g = Graph.fromEdges(edges,"none")
val v = PageRank.run(g, 5).vertices
Extract subgraph. Run PageRank using Spark GraphX.
50. val res = v.repartition(total/100000).mapPartitions( part => {
val localConn = Neo4jREST("localhost", 9474, "/db/data/", "neo4j", "test")
val updateStmt = Cypher("UNWIND {updates} as update MATCH (p) where id(p) =
update.id SET p.pagerank = update.rank")
val updates = part.map( v => Map("id"->v._1.toLong, "rank" -> v._2.toDouble))
val count = updateStmt.on("updates"->updates).execute()(localConn)
Iterator(part.size)
})
Write back to graph