This presentation delivers a tour of the graph analytics and open source projects from the graph team at PokitDok. This talk is from the inaugural GraphDay in Austin, TX on Jan 17th, 2016.
Graph Day Texas: Open Source Graph Projects from PokitDok
1. A tour of the PokitDok Health Graph and
some open source graph projects
Graph Day Texas, Jan 2016
Denise Gosnell, PhD
Twitter and Github:
@pokitdok
@denisekgosnell
2. Confidential 2
PokitDok APIs:
The business of health,
for developers.
https://platform.pokitdok.com/
Twitter and Github:
@pokitdok
@denisekgosnell
6. 6
What we built.
The HealthGraph
What we’ve open sourced.
A Gremlin-Python Library
Custom Titan Build
Dynamic JSON Graph [WIP]
HealthGraph DSL [WIP]
Talk Outline:
Twitter and Github:
@pokitdok
@denisekgosnell
10. Confidential 10
Health Graph: Transaction as Trees
• We treat transactions as
first-class objects in the
graph
• Buried in the depth of an
X12 transactions are the
entities of interest
Twitter and Github:
@pokitdok
Interactive graph available at:
https://fullmetalhealth.com/dsl/
14. Confidential 14
HealthGraph: Predictive Models
• What is the probability claim X will be denied?
• A new customer just searched for “family practice”;
recommend the best provider within 10 miles.
• Given a CPT code, what is the expected
reimbursement rate from insurance company A in zip
code 37601?
Twitter and Github:
@pokitdok
@denisekgosnell
17. Confidential 17
Our HealthGraph
Production Stack
• Titan 0.5.3
• TinkerPop’s
Blueprints 2.50
• Cassandra
and Elastic Search
Gremlin-Python
Twitter and Github:
@pokitdok
@denisekgosnell
18. Confidential 18
• Lighter Context Switching between
development tools and environments
• Incompatible syntax issues between
Gremlin and Python
• Using Python.
Gremlin-Python Motivation
Twitter and Github:
@corbinbs
@denisekgosnell
19. Confidential 19
Option 1: Grab our docker container
1. Install Docker
https://www.docker.com/docker-toolbox
2. Jump in the “Docker Quickstart Terminal”
3. Fire up our example container:
docker run -i -t pokitdok/gremlin-python-test-drive
Option 2: Shell script install
1. Clone our repo:
https://github.com/pokitdok/gremlin-python
2. Run the set-up scripts:
$./test_drive/setup.sh &&./test_drive/run.sh
Gremlin-Python Test Drive
Twitter and Github:
@corbinbs
@denisekgosnell
27. Confidential 27
Motivation for Release of Custom Build:
Graph Production Stack:
Titan 0.5.x ships with Hadoop 2.2
API Production Stack:
contains Cloudera’s CDH5 containers and Hadoop 2.6.0
You guessed it:
infrastructure dependency errors upon integration
the Hadoop 2.6.0 API is not fully backwards compatible
with Hadoop 2.2
Twitter and Github:
@pokitdok
28. Confidential 28
Released:
A modification of the Titan 0.5.3 build
to upgrade to Hadoop 2.6.0 and
resolve numerous conflicts among
transitive dependencies.
… someone had to do it.
Grab it here:
https://github.com/pokitdok/titan/tree/
0.5.3-hadoop2.6.0
Tested for Cassandra but not
Hbase.
Twitter and Github:
@pokitdok
31. Confidential 31
1. Extract PokitDok HealthGraph specific features
2. Move to Titan 1.0 and TP3 compatibility
3. Release on PokitDok GitHub
Dyanmic JSONLoader Future Work
Twitter and Github:
@pokitdok
36. Confidential 36
1. Move to Titan 1.0 and TP3 compatibility
2. Release on PokitDok GitHub
3. Current Open Question:
We are looking for(ward to) more documentation on
implementing custom gremlin steps(DSLs) in TP3
DSL Future Work
Twitter and Github:
@pokitdok
39. A tour of the PokitDok Health Graph and
some open source graph projects
Graph Day Texas, Jan 2016
Denise Gosnell, PhD
Twitter and Github:
@pokitdok
@denisekgosnell
Notes de l'éditeur
Personal story of how I got into graph analytics; graph lineage
we made all of our stuff available via API.
For something the crowd can go see ---
Relevant Timing: Xerox is powered by Pokitdok
we are tackling two while fields.
navigating the wild and quickly change space of graph technology while also trying to modernize healthcare
transitional purposes only
what kind of data do we have
We are using graph paths to calculate a high density of providers with a co-occurance across payors – we can also find this by plan.
GOAL: infer provider networks across plans – or whichever slice of the data we prefer
we can also answer all sorts of questions
Current healthcare infrastructure is fractured and antiquated… they can’t answer these questions.
4.3 million providers
This is a slide about why
data management:
data engineering: loading of data into a database
data science: probabilistic inferences
updates to transitive dependencies aren’t sexy, but aren’t you glad you don’t have to do this now? Someone had to do it.
There were people on the titan users group who suggested they had built titan 0.5 for hadoop 2.6 themselves, but we could not find any publically. That is why we released this.
slightly more interesting than dependency whack a mole --
Bulk load of JSON from squenced HDFS files
Bulk load of JSON from squenced HDFS files
We have created a groovy-gremlin based graph DSL for entity retrieval. The DSL is accessible from client scripts in python or groovy, or via TinkerPop’s gremlin console.
Personal story of how I got into graph analytics; graph lineage