DCAF 2023 1 and 2.pdf

DCA: Current Themes and
Trends*
Alan Morrison
Data-Centric Architecture Forum
May 2023
1
Alain
Audet
at
https://pixabay.com/photos/lake-foggy-lake-nature-landscape-6839357/
*Separate talk to cover NLP/LLMs

Business goals enabled by a connected, shared data
ecosystem
2
Buying Helping
Making Selling
Sharing

Inhibitors to ecosystem-level sharing
● Data feudalism
● Poorly defined regulatory challenges
● Weak public sector
● Public apathy
● Technology + investor inertia and lack of clear vision
● Magic bullet syndrome
● Media groupthink
● Idol worship
● Pervasive myopia
● Lack of organization fox empowerment over hedgehogs
3

Unclaimed data market territory
FAIR*
Actionability
Immediacy
Divining purpose
Divining intent
Synthesis
Reasoning
Abstraction
Contextualization
Connection
Classification
Identification
Unclaimed market territory
Staked claims
Present vs Future Shared Data Market Map
12
steps
to
FAIR
data
power
*Findable, accessible, interoperable, reusable data
Reach of
current ML
efforts

Challenge: Seamless, at-scale, FAIR data collaboration
5
James Kobelius, 2016
Association of European Libraries, 2017

Opportunity: Unitary data + description logic = knowledge
7
“Data management” (structured data,
mostly)
Knowledge management (internally
shared)
Content management (externally
shared)
Learning management (internal
coursework)
FAIR data and
associated
description
logic
FAIR data is data users can
have confidence in for
many purposes.
Data becomes FAIR when
it disambiguates concepts,
individuals and roles and
how they interact and relate
to one another.
In a knowledge graph
context, documented
knowledge = FAIR data.
Under the FAIR data umbrella are all heterogeneous
types of data/content.

To create a knowledge graph, users can start with a single triple
8
Linked Open Data Cloud, 2022
Starter triple for a knowledge graph
A standard knowledge graph consists of triplified, relationship-rich
data. The data model, or ontology, is also described in triples and
lives with the rest of the data. Ontologies can also be managed as
data. Linking triples merely requires a verb (or predicate, or
described edge) to link them.

Simple way to start a business knowledge graph (besides using gist)
● “Use JSON-LD to atomise your enterprise data down into three-part statements and voila!
You get a connected graph!
● ✨ Decentralize the process by having each team publish their own JSON-LD, for example,
let the sales team publish the sales data and ask them to link each sale to the correct product
and client.
● 🤖 Connect GPT to the JSON-LD that your teams have published. Then, harness the power
of GPT to assist new teams in publishing their JSON-LD and integrating it back into your
enterprise-wide Knowledge Graph.”
Key to scaling external/internal integration: use the schema.org modeled JSON-LD from websites
GPT is trained on and connect it with internal data also modeled with schema.org
–#HT Tony Seale, UBS
https://www.linkedin.com/posts/tonyseale_mlops-dataintegration-ai-activity-7052551060237819904-bAZc
9

Yes, data warehousing focused on the integration problem
10
● Pro: Identified the critical problem to solve
● Con: Advocated a method that doesn’t delve deep enough to solve today’s
problem
● Still face the unified data model challenge

No, data warehousing model conformance doesn’t scale
“I spent a good 15 years working in financial services at some
pretty big banks. Half of the IT change budget is spent on
integration and the by-products of integration….I saw as the
technology was advancing that the percentage wasn’t going
down – in fact, it was going up. At some point, is the integration
tax going to be 100 percent?”
– Dan DeMers, CEO of Cinchy
“Disambiguation of Data Mesh, Fabric, Centric, Driven, and Everything!” YouTube video,
https://www.youtube.com/watch?v=M5XlGloj4UY&t=564s, 2021
11

How data warehousing stopped scaling
“They recognized that these themes ended up in all these legacy apps. Sales rolled up against a
geographic and a product hierarchy, and an organizational hierarchy…. They said, Let’s have
those conformed dimensions and a small number of facts. Let’s bring the facts from all the
different systems and snap them together according to these conformed dimensions….
Brilliant idea, but I think what actually happened over time is the workload just got greater and
greater. The ability of people to actually conform those dimensions kept eroding….”
–Dave McComb, President, Semantic Arts
“Disambiguation of Data Mesh, Fabric, Centric, Driven, and Everything!” YouTube video, https://www.youtube.com/watch?v=M5XlGloj4UY&t=564s, 2021
12

Data warehousing can’t solve today’s integration challenge
13
● Thousands of databases per enterprise (siloing)
● Thousands of applications (code sprawl)
● Data models buried in the app code
● Every app a special snowflake with its own data model

How did we get here? By selling the old as new
14

Why large-scale integration?
15
Large scale integration is essential to
avoiding observational bias. The drunk
looking for his money under the lamppost
analogy describes the nature of this bias.
The drunk is looking for his money where
the light is, even though he knows the
money is in the shadows.
To manage today’s business at scale,
enterprises need light and visibility
across departments, organizations and
supply networks

Semantic standards allow a desiloed data landscape for
interactive, interoperable digital twins and agents
16

Promise of digital twins and agents–way beyond APIs
17
Autonomous agents
Digital twins/
Small KGs
Locale: Portsmouth, UK
Sensor nets
Iotics, 2019
and 2023

How shared graph semantics helps
● Boosts meaningful results (result of lack of data and logic transparency and
cohesiveness) and relevancy
● Contextualizes data for management and reuse with relationship logic
● Scales meaningful connections between contexts (relevant relationships
living with entities)
● Enables Metcalfe’s network of networks effect (network_effectN
)
● Enables model-driven development via knowledge graphs (code once, reuse
anywhere)
● Provides access vIa KGs to logic programs as well as heterogeneous, smart data
● Scale efficiencies and economies so that energy consumption is reduced
18

KG centricity makes reliable, automated data webs possible
19
Data teams report spending 25-30% of their time cleaning, labelling, and
gathering data sets.... [Some can spend 80% plus]
What we know for sure is that data teams and knowledge workers
generally spend a noteworthy amount of their time procuring data
points that are available on the public web…”
It took Google knowledge panels one month and twenty days to update
following the inception of a new CEO at Citi, a F100 company. In Diffbot’s
Knowledge Graph, a new fact was logged within the week, with zero
human intervention and sourced from the public web.
– Merrill Cook, Diffbot Blog, 2021-2022

Example capabilities in Diffbot’s AI automated KG
20
Mike Tung, “VLDB2020: The Diffbot Knowledge Graph,”2020

“Decentralization”: Why you should care
● Further desiloing
● More systems federation
● More interorganizational use potential
● Data Centric approach to architecture
● “Decentralized/Web3 stack”
● More storage options and tiering
● Options at different temperatures (hot vs. cold storage) for new use cases
● More captive and independent storage
21

Simple web hosting + legacy Client-Server
storage
Early Web (on Client-Server)
Compute and storage more loosely coupled,
virtualized, controlled and data-centric
“Decoupled” and “Decentralized” Cloud
Application Distribution via Proprietary
and IP Networking
Client-Server and Desktops
Commodity servers + storage + some
virtualization
Distributed Cloud and Mobile Devices
1st
2nd
3rd
4th
5th
Centralized storage and compute, with
minimal networking
Mainframe and Green Screens
The Five Commingled Phases of Compute, Networking and Storage
22
Less
centralized
Time
More
centralized
Application
Centric
Data
Centric
All phases are
still active and
evolving

Degree of control assumes a continuum–not a binary split
23
See Thomas W. Malone, Inventing the Organizations of the 21st Century, MIT Press, 2003, 45FF.

SOLID: Federated storage and decentralized apps
24
Ruben Verborgh, “Decentralizing personal data management with Solid: a hands-on workshop,” SEMIC Workshop, October 2020

SOLID shared, federated XaaS: Construction industry
25
“TrinPod™: World's first conceptually indexed space-time
digital twin using Solid,” Graphmetrix, 2022,
https://graphmetrix.com/trinpod
Company-specific SOLID storage pods and access
control can be managed by each supply chain partner.
Graphmetrix as digital twin provider manages the
system and system-level apps.

Peergos makes personal file storage management possible via IPFS and a
browser
26
Peergos technology logical architecture, https://peergos.org/technology, 2019
Peergos is a personal data
dcloud storage environment
that also uses blockchain
based decentralized
public-key-infrastructure
(dpki). Consider as an
alternative to Google or
Amazon Photos, for example.

Enterprise decentralized app environment: OriginTrail.io
27
https://origintrail.io/

OriginTrail + BSI’s supply chain tracking and tracing
28
OriginTrail and the British Standards Institute (BSI), https://twitter.com/origin_trail/status/1339606640887152642?s=20, Dec. 2020
The Monasteriven
whiskey produced in
Ireland is tracked and
traced from “grain to
glass” with the
OriginTrail.io
approach.
OT uses
decentralized
knowledge graph that
connects to one of
several different
blockchains.
This method enables
shared data reuse
and other synergies
across the supply
chain.

Seven obstacles to adoption of decentralized,
interorganizational environments
29

To succeed, organizations will have to become
more bona fide data-centric organizations first
30

Seven obstacles to adoption of FAIR data development at scale
31

Thoughts and Reactions?
Feel free to ping me anytime with questions, etc.
Alan Morrison
Data Science Central
LinkedIn | Twitter | Quora | Slideshare
+1 408 205 5109
a.s.morrison@gmail.com
32

From NLP, to stochastic parrots,
to neurosymbolic AI
Alan Morrison
Data-Centric Architecture Forum
May 2023
33

What’s a “stochastic parrot” and one who worships the same?
“A Language Model is a system for haphazardly stitching together sequences of linguistic
forms it has observed in its vast training data, according to probabilistic information about
how they combine, but without any reference to meaning: a stochastic parrot.”
–Emily Bender, et al., “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?,”
ACM paper presented at FAccT ’21, March 3–10, 2021, virtual event, Canada
Stochastic parrot worshippers: Those who mindlessly praise LLMs without realizing they’ve
mistaken the parrot part—probabilistic language methods alone–for the whole. These
worshippers seem to assume those methods alone will deliver artificial general intelligence.
Related term: Documentation debt (also per Bender, et al.)
“When we rely on ever larger datasets we risk incurring documentation debt,” they say, “i.e.,
putting ourselves in the situation where the datasets are both undocumented and too large to
document post hoc…. The solution, we propose, is to budget for documentation as part of the
planned costs of dataset creation.”
34

Deep learning guru Yann LeCun on LLMs
35

What’s Natural Language Processing (NLP)?
36
“The root of Natural Language
Processing dates back to the 1950s
when Alan Turing ﬁrst devised the
Turing Test.
“The objective of the Turing Test was
to determine whether a computer
was truly intelligent based on its
ability to interpret and generate
natural language as a criterion of
intelligence.”
– Tithy Sreemani, Analytics Vidhya
blog, 2022

What’s natural language understanding (NLU)?
1. A form of overpromising and underdelivering, or
2. A serious, ongoing linguistics + cognition endeavor to model how human
understanding works.
37
A sentence-level
model based on Role
and Reference
Grammar by PAT
Inc., 2022.

What’s a large language model (LLM)?
1. A neural network with many layers (“deep learning”).
2. A transformer model that “learns” context a token at a time, in sequence.
3. A tokenizer that converts words to numbers and numbers to words.
4. A token-to-embedding (vectorization) transformer.
5. An ML model that is trained on very large data sets with millions of billions of
parameters (akin to multi-dimensional topographic features)
6. The NLP (natural language processing) system currently in vogue.
38

LLM Leaderboard (partial)
39
Dan Saatrup Nielsen, Alexandra Institute, LinkedIn post, 2023

Solving arithmetic or chasing “facts” with LLMs wastes time and energy
“Suppose that I wanted to ﬁnd out the square root of ﬁve. If I asked an LLM (say ChatGPT), getting this answer involves the
following steps:
● Me: Send a prompt saying “What is the square root of 5?”
● ChatGPT: Do I understand the concept of square root? Yes, I do … it’s a math function.
● ChatGPT: There is a Python function that can be used to invoked that function, in the Python Math Library. Retrieve
that library.
● ChatGPT: Evaluate the number 5 with the function call to get the value 2.235.
● ChatGPT: Construct a response and send that response back to the client.
This assumes that everything goes right.”
– Curt Kagle, The Cagle Report
40

Knowledge graphs know; LLMs need prompts and figure it out, sort of .
“LLMs have to ﬁgure things out. They follow an iterative feedback loop called a
langchain, with either a human, itself, or a combination of the two. This
langchain model should be emulatable with SPARQL.
“Update. I’m playing around with this idea on Jena/Fuseki, and the early results
are … intriguing. The key is to recognize that you are doing mutations to the
database, which makes many DBAs cringe. However, I don’t think there is any
way you can get to conversational AI on a knowledge graph without constantly
building (and, when necessary, destroying) contextual graphs.”
Kurt Cagle. “Figuring Out vs. Knowing,” The Cagle Report
41

Idea: Connect the LLM directly to a KG such as Wikidata
“We can just use the SPARQL query generation ability directly and ask queries
against Wikidata. Not only can we connect the LLM to a knowledge graph, but
also to a repository of functions such as wiki functions.” LLM can learn to use KGs
and functions as tools.”
–Denny Vrandečić, Wikimedia Foundation, 2023
42

Each machine learning answer creates some uncertainty
“You can use machine learning to retrieve Obama’s birthplace every time you
need it, but it costs a lot, and you’re never sure it’s correct.”
–Jamie Taylor of Google
43

Efficiency argument for knowledge graphs
“Why would you ever use a 96-layer, 156 billion parameter large language model
to do multiplication, when that’s something you can do in a single operation on
your CPU?”
“Why internalize knowledge in an LLM, when you can externalize it in a graph
store and look it up when you need it?”
“Use LLMs where they are efficient.”
– Denny Vrandečić of the Wikimedia Foundation
44

To scale FAIR data, use an assisted, hybrid AI approach
45
Amit Sheth, From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge (Neuro-symbolic AI),” USC Information Sciences Institute on
YouTube, March 2023, https://www.youtube.com/watch?v=xyxQXka6dRY&t=2377s

46
How hybrid AI helps in research
“LLMs have amazing abilities in
manipulating natural language text,
but generating timely and factually
verified recommendations is one
thing LLMs are not naturally great
at.”
–Mike Tung, CEO of Diffbot
Diffbot Blog, April 2023,
https://blog.diffbot.com/generating-company-recommendations-usi
ng-large-language-models-and-knowledge-graphs/
LLMs aren’t a reliable research tool
alone because they hallucinate. you
can’t trust the answers unless you know
the answer already.
Mike Tung recommends more precise
prompting on the query side and answer
verification via a knowledge graph such
as Diffbot. Both of these capabilities
harness precise logical description
missing in current LLM Q&As.

NLP’s compost grinder data mentality
47
https://pixabay.com/photos/compost-grinder-compost-chipper-3389088/

Versus KGs growing naturally in companion plant mode
48
Rich data ecosystems evolve naturally by
comparison with underdescribed, fragmented
data assets
Zero-copy integration becomes possible,
reducing complexity, labor and energy waste by
up to 90 percent
Second-order cybernetics (humans in the loop)
and precise facts and contextualization
complement probabilistic methods
https://www.fruitsaladtrees.com/blogs/news/ediblegarden

AI’s Wave III: Less wasteful, more explicit smart data
management via a knowledge graph foundation
49

Alan Morrison
+1 408 205 5109
50

NLP versus NLU: Most true understanding is unclaimed territory
51
Unclaimed data market territory
FAIR*
Actionability
Immediacy
Divining purpose
Divining intent
Synthesis
Reasoning
Abstraction
Contextualization
Connection
Classification
Identification
Unclaimed market territory
Staked claims
Present vs Future Data Market Map
12
steps
to
FAIR
data
power
*Findable, accessible, interoperable, reusable data
Reach of
current ML
efforts

Stochastic parrots and hallucination
55

Teaching LLMs to query knowledge graphs
57

Datalanguage hackathon results
58

Semantic community LLM use results
59

Goal: Develop FAIR data efficiently
60

Alan Morrison
+1 408 205 5109
61

DCAF 2023 1 and 2.pdf

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à DCAF 2023 1 and 2.pdf

Similaire à DCAF 2023 1 and 2.pdf (20)

Plus de Alan Morrison

Plus de Alan Morrison (7)

Dernier

Dernier (20)

DCAF 2023 1 and 2.pdf