Graph Realities

Graph Realities
Paco Nathan @pacoid

Personal Background
• applied math, machine learning, distributed systems
• R&D for neural networks, incl. hardware (1986-1997)
• “guinea pig” for early cloud (2005-ff)
• led data teams in industry
• assisted popular open source projects:  
Spark, Jupyter, etc.
• development focus on natural language plus adjacent
knowledge graph use cases
• since 2018, increasingly working at the intersection  
of public sector + enterprise + open source

Motivations
Not all that long ago, graph applications were considered
exotic and expensive.
Until recently, few software engineers had much experience
putting graphs to work; however, those use cases have now
become much more commonplace.
This talk explores a practical use case, one that addresses
key issues of data governance and reproducible research,
and depends on sophisticated use of graph technology.
First, some perspectives and industry analysis…

Perspectives
• the ubiquity of linked data
• the tyranny of “thinking relational”
• the primacy of working with graphs  
(and their math analog, tensors)
• nouns vs. verbs vs. adjectives 
(extreme nominalization)
• evolution of hardware, cloud,  
and cluster topologies
• the power of graph embeddings

Historical Context
“Data Science: Past and Future” 
Rev 2 (2019-05-24) slides
“What is Data Science?” 
IBM Data Science Community (2019-03-04)
Just Enough Math 
O’Reilly Media (2014)
• John Tukey: data analytics as an intrinsically empirical  
and interdisciplinary field (1962)
• most popular data frameworks leveraged some graph  
processing, albeit obscured, ad-hoc, clumsy…
• they did well, given the hardware available at the time

Beauty in sparsity…
SuiteSparse Matrix Collection:  
a widely used set of sparse matrix
benchmarks collected from a wide  
range of applications
sparse.tamu.edu/
…for when you really, really, need
some interesting graph data

Theme 1: Stuffing graphs into matrices
algebraic graph theory allowed reuse of linear algebra impl
v
u
w
x
u v w x
u 0 1 0 1
v 1 0 1 1
w 0 1 0 1
x 1 1 1 0
• e.g., transform graph to an adjacency matrix
• most will be relatively sparse
• use LINPACK, BLAS, or libraries built atop
• much to leverage: SVD, power method, QR decomp, etc.

Theme 1: Stuffing graphs into matrices
for many real-world problems, the data are essentially graphs
1. real-world data
2. graph theory for representation
3. convert to sparse matrix for production
4. cost-effective parallel processing at scale
ergo, leverage low dimensional structure in high dimensional data

N Dims good, 2 Dims baa-d
However, complex graphs cannot be represented
as 2D matrices without serious information loss.
Ideally, tensors would be a better representation  
to use for linear algebra libraries.
While tensor decomposition is a hard problem,  
the general class of problems became much  
more interesting after 2012…

N Dims good, 2 Dims baa-d
However, complex graphs cannot be represented
as 2D matrices without serious information loss.
Ideally, tensors would be a better representation  
to use for linear algebra libraries.
While tensor decomposition is a hard problem,  
the general class of problems became much  
more interesting after 2012…
“The real problem is that programmers have spent far too 
much time worrying about efficiency in the wrong place 
and at the wrong times; premature optimization is the 
root of all evil (or at least most of it) in programming.”
Don Knuth

Theme 2: Nouns, Verbs, Adjectives
Tracing back to the origins of relational databases,  
Edgar Codd was furious about how badly SQL and RDBMS
had misinterpreted his mathematical modeling of relations.
Years of EDW reinforced a sense of an extreme nominalization
with so much of the data representation being reduced into
dimensions, facts, indexes

Theme 2: Nouns, Verbs, Adjectives
a carry-over of extreme nominalization into graph DBs also 
over-emphasizes the role of nodes and centrality for adjusting
the granularity of graph representations:
• discounts the importance of relations
• “mostly nouns, a few verbs, some adjectives”
• serious information loss
IMO: graph DB frameworks tend to err in this aspect,  
both in terms of representation and algorithm support.

Part of a long-term narrative arc in IT…
• arguably, circa 2001 was the heyday of DW+BI – later  
acting as an “embedded institution” w.r.t. data science
• Agile Manifesto became another “embedded institution”
• a generation of developers equated “database” with “relational”, 
with a belief that legibility of systems == legibility of the data
• even so, first-movers collectively made a sudden turn  
toward NoSQL, partly in reaction to RDBMS pricing
• see also:  
“Statistical Modeling: The Two Cultures” 
Leo Breiman UC Berkeley (2001)

Adjusting data resolution in graphs
In contrast, consider:
“Extracting the multiscale backbone of complex weighted networks”
M. Ángeles Serrano, Marián Boguña, Alessandro Vespignani 
PNAS (2009-04-21)
Filtering large noisy graphs based on both  
nodes and edges can be useful for automated  
approaches in knowledge graph construction,  
see: github.com/DerwenAI/disparity_filter

An emerging trend disrupts the past 15-20 years  
of software engineering practice:
hardware > software > process
Hardware is now evolving more rapidly than software,
which is evolving more rapidly than effective process
Moore’s Law is all but dead, although ironically  
many inefficiencies had been based on it
See also: Pete Warden (2018) regarding 
TensorFlow.js on low-power devices
Theme 3: Hardware in perspective

Theme 3: Evolution of cloud patterns
UC Berkeley published a 2009 report
about early use cases for cloud
computing, which foresaw the shape of
industry deployments over much of the
next decade, and led directly to Apache
Mesos and Apache Spark
It’s fascinating to study the contrasts
between that 2009 report and its 2019
follow-up.
(minor footnote: vimeo.com/3616394)
2009

Theme 3: Evolution of cloud patterns
Early cloud was intentionally “dumbed down”
to resemble popular virtualization software –
recognizable by IT staff – to support migration.
That approach is no longer needed.
Also, the physics + economics of cloud use
tend to imply less “framework” layers.
More contemporary patterns will force a
restructuring – for efficiency and security –  
i.e., decoupling computation and storage.
2019

Theme 3: Cluster topologies, by generation
Opinion: one problem with software/hardware interface for distributed
systems is that it’s taken decades to prioritize the need for handling
graphs/tensors directly within popular, accessible open source libraries,
without having some commercial database vendor intermediate.
1990s mid-2000s current

Theme 3: Cluster topologies, by generation
1990s mid-2000s current
see also: Jeff Dean (2013)
youtu.be/S9twUcX1Zp0
NB: graph

“Two Cultures” for AI
A more useful distinction:
• ML is about the tools and technologies
• AI is about use case impact on social systems

Industry surveys for AI and Cloud adoption
• “Three surveys of AI adoption reveal key advice  
from more mature practices” 
Ben Lorica, Paco Nathan O’Reilly Media (2019-02-20)
• Episode 7, Domino: surveying “ABC” adoption in enterprise 
(2019-03-03)

Trends: Knowledge Graphs
We found that close to one
quarter of respondents were
using knowledge graphs.

Healthcare is adopting use  
of knowledge graphs more  
so than in Finance.

Mature practices show more
interest in use of knowledge
graphs than firms which are  
still evaluating ML use cases.

Trends: an accelerating gap in AI funding
Note: firms with early advantage
are investing more, moving still
further away from the pack.

Overview of Data Governance
Paco Nathan @pacoid
Overview of Data Governance
derwen.ai/s/6fqt
cloud
3
cloud
2
cloud
1
security
security
security
compliance
db
mobile
devices
mobile
devices
mobile
devices
edge
cache
web
servers
dw
business
analytics
dat
govdurable
store
logs
models
other
data sci
workﬂows
cluster
compute
external
data
external
data
external
data
edge
inference
edge
inference
edge
inference
models
streaming data
dat
gov
dat
gov
dat
gov
dat
gov
dat
gov
dat
gov
dat
gov
we noted a resurgence in data  
governance – this report examines 
key themes, vendors, issues, etc.

Unpacking AutoML
derwen.ai/s/yvkg
we noted an uptick in adoption for
a third aspect, co-evolving along
with DG and MLOps
meta-learning feature
selection
hyperparameter
optimization
model
selection
auto
scaling
feature
engineering
train
models
evaluate
results
integrate,
deploy
data
prep
data platform
usecases

Data Gov dovetails with MLOps and AutoML
meta-learning feature
selection
hyperparameter
optimization
model
selection
auto
scaling
feature
engineering
train
models
evaluate
results
integrate,
deploy
data
prep
data platform
usecases
data gov
trends
augment
AutoML
data gov
practices
follow
MLOps

Emerging category: watch the “AI Natives”
Projects (mostly OSS) that leverage knowledge graph  
of metadata about datasets and their usage:
• Amundsen @ Lyft 
data discovery and metadata
• Databook @ Uber 
manage metadata about datasets (pending OSS)
• Marquez @ WeWork, Stitch Fix 
collect, aggregate, visualize metadata
• Data Hub @ LinkedIn 
data discovery and lineage
• Metcat @ Netflix 
data discovery, metadata service
• Dataportal @ Airbnb 
integrated data-space (not OSS)

part 3: 
case study – rich context

Administrative Data Research Facility
Coleridge Initiative 
Julia Lane, et al. NYU Wagner
• FedRAMP-compliant ADRF framework on AWS GovCloud:  
“public agency capacity to accelerate the effective use of  
new datasets”
• for research projects using cross-agency sensitive data,  
in US and EU (and UK) – now in use by 15+ agencies
• cited as the first federal example of Secure Access to
Confidential Data in the final report of the Commission  
on Evidence-Based Policymaking
• augments Data Stewardship practices; collaboration  
with Project Jupyter on the related data gov features
• funding by Schmidt Futures, Sloan, Overdeck

ADRF and Rich Context
Coleridge Initiative 
Julia Lane, et al. NYU Wagner
• Rich Context: knowledge graph of metadata about datasets,
used for entity linking, link prediction, recommendations, etc.
• benefits: agencies, researchers, publishers, data stewards,
data providers – see white paper
• ongoing ML competition for linking research publications
with dataset attribution (first comp. won by Allen AI)
• see “Human-in-the-loop AI for scholarly infrastructure”
• upcoming book: 
Rich Search and Discovery for Research Datasets: Building
the next generation of scholarly infrastructure

AI for Scholarly Infrastructure
Rich Context
overall scope
leaderboard
competition
publisher
use cases
HITL:
RePEC, etc.
authors
accept/reject
links
models
infer links
corpus
research
pubs
leaderboard
evals results
inferred
linked data
1
2
3
• collaboration with SAGE Pub, Digital Science, 
RePEc, etc.; partnering with Bundesbank (EU)
• knowledge graph vocabulary integrates W3C
metadata standards: DCAT, PAV, DCMI, CITO,
FaBiO, FOAF, etc.
• data as a strategic asset: knowledge graph  
produces an open corpus for the leaderboard
competition
• human-in-the-loop AI used to infer metadata  
then confirm with authors via RePEC, etc.
• adjacent work: graph embedding, meta-learning,
persistent identifiers, reproducible research

Related work at Project Jupyter
Make datasets and projects top-level constructs,
support metadata exchange and privacy-preserving
telemetry from notebook usage (due Oct 2019):
• JupyterLab Commenting and real-time collab  
similar to Google Docs
• JupyterLab Data Explorer: register datasets  
within research projects
• JupyterLab Metadata Explorer: browse metadata
descriptions, get recommendations through
knowledge graph inference (via extension)
• Data Registry (original proposal)
• Telemetry (privacy-preserving, reports usage)

Related work at Project Jupyter

Active Learning as a data strategy
Experts decide
about edge cases,
providing examples
Experts learn through
Customer interactions
Customers request
Sales, Marketing,
Service, Training
Experts gain insights
via Model explanations
ML
Models
Models focus Experts
(e.g., weak supervision)
Organizational
Learning
Human
Experts
Examples,
Actions
Customers
Models act on decisions
when possible
Customer
Use Cases
Models explore
uncertainty when needed
derwen.ai/s/d8b7
teams of people + machines, 
leveraging the complementary 
strengths of both

Parting thought
In many ways, we’re at a point in the industry with
graph data – particularly for use of knowledge graph
of metadata about dataset usage – which resembles
conditions immediately before “Web 2.0” became
big news.
The emerging category of “AI natives” projects
mentioned earlier could be parlayed into data utilities
more flexible than the AI services which the current
hyperscalers are fielding.
Watch this space.

Just Enough Math Rich Context Hylbert-Speys Themes + Confs
per Pacoid
publicaXons, interviews, conference summaries…
https://derwen.ai/paco 
@pacoid
Rev conf

Graph Realities

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Graph Realities

Similaire à Graph Realities (20)

Plus de Connected Data World

Plus de Connected Data World (20)

Dernier

Dernier (20)

Graph Realities