2. PwC | Data-centric design and the knowledge graph
Agenda
2
A useful vision
Growth and emerging markets
Data-centric methods
Use cases
Adoption
Takeaways
4. PwC | Data-centric design and the knowledge graph
In the mirrorworld,
everything will have a
paired twin.
Kevin Kelly in Wired
Feb 12, 2019
June 2019
4
5. PwC | Data-centric design and the knowledge graph
What’s a digital twin? Depends on who you ask
5
GE: “At its core, the Digital Twin consists of sophisticated models or system
of models based on deep domain knowledge of specific industrial assets.
The Digital Twin is informed by a massive amount of design,
manufacturing, inspection, repair, online sensor and operational data.”
Goals: Predictive analytics, knowledge representation, etc.
From “What is a digital twin?” GE Digital, 2019
Finger Food, “We Are Industry-leading Digital Twin Holographic Service
Providers….
Imagine taking all of your disparate data sets from multiple spreadsheets
and diagrams and combining them into one live-streaming visual
holographic representation of your data – at full scale.”
Goals: “We can take your data from your spreadsheets and turn it into
clear, actionable context like never before…”
From “Digital Twin Solutions to Improve your Bottom Line,” Finger Food
Advanced Technology Group,“ 2019
6. PwC | Data-centric design and the knowledge graph
What digital twin environments and AI need versus what they have
6
•What they need: Contextualized, disambiguated, highly relevant
and specific integrated data, flowing to the point of need
•What they have: Single batch datasets cleaned up to be good
enough by data scientists, who spend 80% of their time
on cleanup
•What they need: Knowledge engineers, and many bold Data
Visionaries in addition to big D Data Scientists, data-centric
architects, pipeline engineers, specialists in many new data niches
•What they have: A growing group of tool users versed only in
probability theory, neural networks, python and R, including small
D data scientists, engineers and architects, plus scads of
entrenched application-centric developers
Finance
Operations
Marketing
Input Output
Input
layer
Hidden
layer 1
Hidden
layer 2
Output
layer
7. PwC | Data-centric design and the knowledge graph
Consider how long it took to build out the world’s oil &
gas infrastructure.
Now think about where we are with traditional data
management:
• How do we free ourselves from legacy IT?
• How do we build sharable digital twins?
• How do we scale a shared data infrastructure?
The mirrorworld poses a
massive global data
infrastructure challenge
7
8. PwC | Data-centric design and the knowledge graph
Why treating smart data as a strategic asset is so critical right now
8
Challenge of the 2020s: Feeding your AIs enough
relevant, quality data
• Emerging tech often gets adopted just in pockets,
• That’s particularly the case with AI.
• Retraining, hiring new people, or buying more tools
isn’t enough.
• Many never figure out how to take advantage of
important AI-enabling tech. They’ll just use it in ad-
hoc projects or subscribe to AI-enhanced apps.
• But the impact on decision making will be minimal
without an industrial-scale approach to data and
flow.
Opportunity of the 2020s:
Pipelines, distribution networks and
volumes of quality, contextualized
smart data flowing to the point of
need
The challenge we face is the same
as the oil and gas industry faced in
the 1920s:
• Collecting enough raw material
• Refining and enriching it
• Distributing it to the places that
need it most
• Creating enough supply to
generate massive demand and
drive down the cost of AI
10. PwC | Data-centric design and the knowledge graph
Global software market — knowledge graph addressable segments
10
Excerpted from Andrew Bartels, “Global Tech Market Outlook Update For 2019 To 2020,” Webinar, Forrester Research, 2019
68 percent of DBMS growth is cloud related, says Gartner.
76 percent of total software growth is cloud related, says
Forrester, who says that cloud services infrastructures “are
becoming the new data management platform.”
$billions
11. PwC | Data-centric design and the knowledge graph
Emerging techs – How are all these things interrelated?
Are they addressable too?
Knowledge graphs—the manifestation of a data-
centric architecture--can empower the other
technologies in these ways:
1. Accelerate machine learning training set
development
2. Enable multi-domain virtual
assistants/chatbots
3. Add reasoning to conversational ai platforms
4. Become means of sharing and interoperation
of digital twins
11
12. PwC | Data-centric design and the knowledge graph
Emerging markets — related to most relevant hype cycle techs
12
Total projected revenue: $58.2 billion (2021)
Source: Tractica, Grandview Research and PwC analysis, 2019
13. PwC | Data-centric design and the knowledge graph
The DaaS market in some forms has existed for eons—some segments are quite
mature, others such as KGaaS brand new
13
8.5
9.0
9.5
10.0
10.5
11.0
2018 2019 2020 2021 2022 2023
$billions
Data as a Service
Organization domain only
Lynne Schneider, Worldwide Data-as-a-Service Organization Domain Forecast, 2019–2023, IDC 201
Providers include Bloomberg, D&B, Lexis-Nexis, Moody's, Refinitiv, et al.)
• Overall CAGR: 2.6%
• Knowledge graphs in
use but growth
potential here as well,
just in pockets
• Knowledge graphs as
a service (KGaaS)
from GraphPath and
Refinitiv offer discovery
platforms, access to
libraries, etc.
Refinitiv via Giovanni Tummarello of Siren.io. “Seven great
advancements in enterprise knowledge graphs in 2018”,
https://siren.io/enterprise-knowledge-graphs-advancements/
14. PwC | Data-centric design and the knowledge graph
Platform as a service providers include collaboration environments for working
with multiple datasets in the cloud
14
$billions
15. PwC | Data-centric design and the knowledge graph
Summary: A very large available market, but of course there’s a catch….
15
4%
5%
5%
8%
8%
9%
14%
13%
8%
26%
Summary of global target markets for
knowledge graph technology, 2021
Digital twins PaaS--data mgmt.
DaaS (org. domain) Virtual assistants
Conversational AI Deep learning
PaaS--integration, orchestration Info mgmt software
Integration software DBMS software
Total: $205 Billion Sources: Gartner (hype cycle only),
IDC, Tractica, PwC analysis, 2019
16. PwC | Data-centric design and the knowledge graph
Semantic PaaSes are becoming more collaboration friendly
16
Exploring Knowledge Graphs on Amazon Neptune Using Metaphactory,
AWS Partner Network Blog, January 11, 2019
18. PwC | Data-centric design and the knowledge graph
Three steps to understanding smart data
18
Step I: A logical, unified model in data at the data layer clears a path to actionable understanding
Under-
standing
Knowledge
Interpretation
Contextualization
Recognition
Data collection
Smart data for decisionmaking
Actionable
Data maturity levels
5: Unified
model
3 to 4: Competency with
knowledge graphs
1 to 2: Struggles with basic entity resolution
Enables IT
rationalization
and AI at scale
19. PwC | Data-centric design and the knowledge graph
• “Carole Cole disappeared in 1970 after running away from a
juvenile detention center in Texas. She was 17.
• A year later an unidentified murdered body was found in
Louisiana. It was Carole, but Louisiana police had no idea.
They couldn’t identify her. Carole’s disappearance went cold,
as did the unidentified body.
• Thirty-four years later Carole’s sister posted messages on
Craigslist asking for clues into her sister’s disappearance. At
nearly the same time, a sheriff’s department in Louisiana
made a Facebook page asking for help identifying the Jane
Doe body found 34 years before.
• Six days later, someone connected the dots between the two
posts.
• What stumped detectives for almost four decades was solved
by Facebook and Craigslist in less than a week.”
From “Three Big Things: The Most Important Forces Shaping the
World,” blogpost, Morgan Housel, Oct 4, 2019
19
One big force shaping the world
How do enterprises make
these connections more
reliably so that machines can
replicate human discovery?
Put more people and machines
together in feedback loops.
20. PwC | Data-centric design and the knowledge graph
Largest changes in market cap by global company, cross industry, 2018
20
1. Change in market cap from IPO date
2. Market cap at IPO date
Source: Bloomberg and PwC analysis
• Other major tech, FS and pharma cos. are also working on cross-enterprise knowledge graphs
• Many have cross-enterprise knowledge graph ambitions, but most are focused on a single use case
• S&P does cross-enterprise data management using relational tech
Company name Location Industry
Change in market cap
2009 – 2018 ($bn)
Market cap
2018 ($bn)
1 Apple United States Technology 757 851
2 Amazon.Com United States Consumer Services 670 701
3 Alphabet United States Technology 609 719
4 Microsoft Corp United States Technology 540 703
5 Tencent Holdings China Technology 483 496
6 Facebook United States Technology 3831 464
7 Berkshire Hathaway United States Financial 358 492
8 Alibaba China Consumer Services 3021 470
9 JPMorgan Chase United States Financials 275 375
10 Bank of America United States Financials 263 307
Known knowledge
graph builders
Operator of
Taobao and AliBot
KG builder
Known KG
builders
The most value-creating companies in the world are using knowledge graphs
21. PwC | Data-centric design and the knowledge graph
Why traditional data management doesn’t scale
21
1. Relational databases don’t treat relationship
data as a first-class citizen
2. As a result, most companies have buried or are
missing the relationship data they need for
contextualization
3. Tables alone don’t help you dynamically model
your data or share the models
4. Managing large numbers of tables soon gets
unwieldy
5. Limiting your database resources to tabular
methods ensures you won’t take full advantage
of today’s compute, networking and storage
Relationship
richness
Relationship
sparseness
Static selective
fragmented
labor intensive
Additive
Index friendly
Immutable
versioning possible
More dynamic
More inclusive
More integrated
More machine assisted
Relational:
Row and column headers
And up-front taxonomies
Document:
Nested, cumulative
hierarchies
Graph:
Any-to-any
relationships
PwC, 2016
When overused, RDBMSes
perpetuate the provincial data
mentality of the 1980s, back
when computing didn’t scale
Lots of data is missing from relational
datasets—namely the contextual clues
needed for disambiguation via entity
resolution and, therefore, large-scale
integration
22. PwC | Data-centric design and the knowledge graph
The consequence of logic and data siloing – App-centric system-level complexity
and disconnectedness spinning out of control (Result – Table and code sprawl)
22
Hardware
DBMS
OS
Custom code
Hardware
Lots of OSes
1,000+ SQL/
NoSQL DBs
Custom code
ERP+ suites
Hardware
A few more
OSes
More
DBMSes
Custom code
ERP+ suites
Hardware
Lots more OSes
5,000+
databases
Componentized
suites
Custom code
Cloud layer
Hardware
More types
of OSes
10,000+ DBs +
blockchains
Multicloud layer
Suites as
services
Various SaaSes
Custom code
Hardware
A few
DBMSes
A few OSes
ERP+ suites
Custom code
Threat of more
application centric
sprawl
Early1990s Late 1990s 2000s 2010s1973-1990sPre 1970 2020s
23. PwC | Data-centric design and the knowledge graph
Data-centric design at the micro level brings human and machines together, with
the humans helping the machines build and scale relationship data
23
Relationship logic to shared at scale needs to be created in human-machine feedback loops and
embedded in a standard form at the data layer for full reuse—not trapped in app silos
Relationship-
sparse, but
highly
articulated
data context
that humans
need to help
machines
refine and
enrich
Relationship-
rich smart
data that
uses
description or
predicate
logic to scale
integration,
context and
interoperation
24. PwC | Data-centric design and the knowledge graph
The key opportunity – Large-scale integration and model-driven intelligence in
a de-siloed and de-duplicated way
24
Previously dominant
Rule-based systems (includes KR)
Handcrafted knowledge” is the term DARPA
uses; rule-based programming + procedure
replication in process automation, + some
knowledge representation (KR)
• Strong on logical reasoning in specific
concrete contexts
- Procedural + declarative programming +
set theory, etc.
- Deterministic
• Can’t learn or abstract
• Still exceptionally common and useful
On the rise and rapidly improving
Statistical machine learning
• Probabilistic
• From Bayesian algorithms to neural nets
(yes, deep learning also)
• Strong on perceiving and learning
(classifying, predicting)
• Weak on abstracting and reasoning
• Quite powerful in the aggregate but
individually (instance by instance) unreliable
• Can require lots of data
Perceiving
Learning
Abstracting
Reasoning
Perceiving
Learning
Abstracting
Reasoning
Perceiving
Learning
Abstracting
Reasoning
Example: Consumer tax software Example: Facial recognition using
deep learning/neural nets
John Launchbury of DARPA (https://www.youtube.com/watch?v=N2L8AqkEDLs), Estes Park Group and PwC research, 2017
Nascent, just beginning
Contextualized, model-driven approach
• Contextualized modeling approach-allows
efficiency, precision and certainty
• Combines power of deterministic,
probabilistic and description logic
• Allows explanations to be added
to decisions
• Accelerates the training process with the
help of specific, contextual human input
• Takes less data
Example: Explains first how handwritten
letters are formed so machines can decide-
less data needed, more transparency.
25. PwC | Data-centric design and the knowledge graph
Origins of data-centric thinking
25
Software
Wasteland
How the Application-
Centric Mindset
is Hobbling our
Enterprises
Dave McComb The Data-Centric Manifesto
Principles
1. Data is a key asset of any organization.
2. The current enterprise software paradigm is
“Application-Centric.”
3. Hoarding data in proprietary and complex
apps is a mistake.
4. Most of the excessive cost and complexity in
Enterprise Apps stems from the relationship
of the apps to the data.
5. We are committed to reversing this trend.
6. We understand that there is money to be
made in the applciation-centric paradigm.
http://datacentricmanifesto.org/principles/
Data-centric Architecture Forum
Fort Collins, CO|February 3 – 5, 2020
February 2019 we hosted the inaugural Data-Centric Conference
where we started a profound conversation about the exploding costs
of enterprise systems, discussed strategic to reverse the application-
centric mindset and committed to move the needle in the right
direction forging data-centric projects going forward. We are very
pleased to announce we’ll do this again February 2020 as the Data-
centric architecture Forum. The theme of next year’s forum will be
experience reports on attempting to implement portions of the
architecture. Join us and our mission to get more people involved and
skilled in data-centricity. Here’s a quick summary of our 2019 event to
give you an idea of what to expect.
Hold the date, and save some money: Super Early Bird Discount
of $300 off if you register by June 30, 2019!
https://www.semanticarts.com/dcc/
26. PwC | Data-centric design and the knowledge graph
The solution – Data-centric architecture reduces both application and
database sprawl
26
Trapped app code and databases
Application centric versus Data centric
Semantic model/rules
Data lake or hub
Applets
Applications for execution only
Models exposed with the data
27. PwC | Data-centric design and the knowledge graph
Rationalize – Identify and declare the few hundred business rules you need
as a model
27
“In every company I’ve ever studied, there are only a few hundred key concepts and relationships that the entire business runs on. Once you
understand that, you realize all of these millions of distinctions are just slight variations of those few hundred important things.”
--Dave McComb, author of Software Wasteland, quoted in Strategy + Business
See “Are you Spending Way too Much on Software at
https://www.strategy-business.com/article/Are-You-
Spending-Way-Too-Much-on-Software?
28. PwC | Data-centric design and the knowledge graph
Reuse – Call the model to reuse those rules whenever necessary
28
“You discover that many of the slight variations aren’t variations at all. They’re really the same things with different names, different structures,
or different labels. So it’s desirable to describe those few hundred concepts and relationships in the form of a declarative model that small
amounts of code refer to again and again.”
--Dave McComb (as previously cited)
See “Are you Spending Way too Much on Software at
https://www.strategy-business.com/article/Are-You-
Spending-Way-Too-Much-on-Software?
30. PwC | Data-centric design and the knowledge graph
State of the art knowledge graph – Blue Brain Nexus (1 of 2)
30
How do scientists record the provenance, curate, share in open
source and collaborate on what they’re documented using 3D
imaging techniques generated with the help of a supercomputer,
such as the slices of a rat’s brain?
From the EPFL Blue Brain Portal Gallery, https://portal.bluebrain.epfl.ch/gallery-2/
31. PwC | Data-centric design and the knowledge graph
State of the art knowledge graph – Blue Brain Nexus (2 of 2)
31
Bogdan Roman, “Blue Brain Nexus Technical Introduction,” March 2018, https://www.slideshare.net/BogdanRoman1/bluebrain-nexus-technical-introduction-91266871
32. PwC | Data-centric design and the knowledge graph
Montefiore’s semantic data lake
32
Montefiore Health, Franz, Intel and PwC research, 2017
Various data sources,
some structured, some
not, now all part of
a knowledge graph with
a simple patient
care-centric ontology
Hadoop cluster with
high-performance
processors and memory
Scalable graph database
supporting open W3C
semantic standards
Standard open source
querying, ML and
analytics frameworks, API
accessibility
HL7 feed
Web
services
EMR LIMS Legacy
OMICs CTMS
Claims
Annotation
engine
HDFS
Hadoop
HDFS
Hadoop
HDFS
Hadoop
HDFS
Hadoop
HDFS
Hadoop
HDFS
Hadoop
HDFS
Hadoop
HDFS
Hadoop
HDFS
Hadoop
HDFS
Hadoop
AllegrographAllegrographAllegrographAllegrograph Allegrograph
SDL loader
ML-LIB/R SPARQL
Prolog
Spark
Java API
Doctors can query the graph or
harness ML + analytics and receive
answers from the system at the
point of care via their handhelds.
The system also acts as a giant
feedback-response or learning loop
which learns from the data collected
via user/system interactions.
33. PwC | Data-centric design and the knowledge graph
A semantic knowledge graph could enable the model-driven organization (a digital
twin) at the data layer
33
Step One: Model the relevant
elements of the organization, how
they relate to one another
and interoperate
Step Two: Embed the model where
it lives as machine-readable data
Step Three: Integrate the source
datasets as a target knowledge
graph with model-driven mappings
Step Four: Browse, query,
disambiguate, detect and discover
via the resulting knowledge graph
Capability
enables
process
Process uses
information
https://virtualdutchman.com/2018/10/14/moving-to-a-model-based-enterprise-the-business-model/
Clearvision, 2019. Used with permission.
Prog/proj
creates
information
Prog/proj
Supports
process
Prog/proj
Has person
Prog/proj
creates
technology
Person uses
process
Person uses
information
Person
creates
information
Person uses
technology
Person uses
capability
Capability uses
technology
Information
uses
technology
Technology
Supports
process
Prog/proj
has risk
Portfolio
has person
Risk owned
by personPerson
Identified risk
Company
employs person
Portfolio
Has prog/proj
Prog/proj
outputs
Work package
Prog/proj
Has role
Prog/proj
Has parente prog/pro
Company
Has prog/proj
Prog/proj
Delivers strategy
Prog/proj
Has milestone
Company
has portfolio
Strategy
has milestone
Company
Has role
Role needs
competenceWork package
Needs competence
Work
package
Process
Information
Person
Risk
Portfolio
Milestone
Strategy
Company
Role
Competence
Technology
Capability
Capability uses
information
Prog/proj
Uses information
Prog/proj
Uses technology
Prog/proj
delivers
capability
Prog/proj
Work Package
has person
Person has
competence
35. PwC | Data-centric design and the knowledge graph
Seven obstacles to semantics and knowledge graph adoption and ways around them
35
Obstacle to adoption Nature of the problem Ways to overcome
1. Tribalism Each tribe works off on its own, rarely with
other tribes
Encourage activist leadership and hire to emphasize
the blended nature of the solution
2. Low awareness in the
trenches
Few seem to acknowledge or care about what’s
actually happening
Find those who want to learn and be inspired
3. Magic bullet mentality Inflated, unrealistic expectations regarding “AI”,
RPA, blockchain, etc.
Promote foxes (breadth) rather than hedgehogs
(depth)
4. Indifference about the
back end
While the front end seems always bright and shiny,
few seem to care about the plumbing
Highlight the end user benefits the back end and
a systems approach enables
5. Lack of university
coursework
Few universities in the US seem to offer courses
in semantics
Follow the European example
6. Misplaced belief in the
centrality of the app layer
Shallow understanding of data + logic, declarative
versus imperative programming, etc.; reinforcement
of the status quo
Focus on less mature areas where alternative
approaches are more likely to be accepted
7. Buy rather than build habit Enthusiasm for the latest new products and services Focus on the system rather than the piece parts
37. PwC | Data-centric design and the knowledge graph
Graphs (including hybrids) complete the picture of your transformed data lifecycle
and how it’s managed
37
38. PwC | Data-centric design and the knowledge graph
Bottom line – The 4D approach to insight
38
1. De-silo: Integrate all the relevant sources in a declarative fashion that enables reuse, cross-enterprise scalability, and continuous refinement.
2. Disambiguate: Triangulate using set theory and linguistic description logic in addition to statistical methods, enabling precise
entity resolution.
3. Detect: Uncover weaker signals by articulating the most relevant and distant relationships between entities, via richer contextualization.
4. Discover: Radically expand the ability to discover insights, moving beyond keywords to concepts.
Outlook and conclusion
Kevin Kelly's concept of the mirrorworld describes the future vision, which he says will take 25
years to materialize.
Poor data management is the main reason we're stuck at the starting gate with the mirrorworld.
Moving to the cloud will not fix your data management problems.