Powerful Google developer tools for immediate impact! (2023-24 C)
Creating a Data Distribution Knowledge Base using Neo4j, UBS
1. Public
11th May 2017
Syed Haniff
Creating a Data Distribution
Knowledge Base using Neo4j
Using graph technologies to map and manage data flows within the Bank
2. 1
Reference data at UBS
Building an integrated data distribution platform
Creating a Knowledge Base using Neo4j
Overview
3. 2
Founded 1854
Headquarters: Zurich, Switzerland
Operates in 50+ countries
Around 60,000 employees
6 Businesses
– Wealth Management
– Wealth Management Americas
– Personal & Corporate Banking
– Asset Management
– Investment Bank
– Corporate Centre
About UBS
4. 3
GDS manages the mastering and distribution of reference data to consumers
within the Bank.
About Group Data Services
5. 4
Externally and internally sourced non-transactional data:
Reference Data at UBS
Account Book Calendar Client
Confirms
Financial
Instrument
Legal Entity
Group
Dictionary
Prices Product
Trading
Agreement
Settlement
Instruction
Account Book Calendar Client
Legal Entity
6. 5
12 Data Domains
18 Datasets
7 Distribution Channels
400+ Integrations
000s Attributes
Group Data Services in Numbers
7. 6
Providing timely, accurate, and complete reference data to users, systems, and
processes through a number of channels.
Reference Data Distribution
8. 7
Masters send normalized, canonical
datasets.
Consumers land and join datasets
themselves
Good for producers (master data
sources) … Not so good for consumers
FeaturesOverview
Data Distribution – Previously
9. 8
Example – Consumer joins
Consumers store multiple messages from multiple domains and resolve joins
themselves
10. 9
Driver Situation Impact
Simplification Multiple components doing the same
/ similar tasks.
Cost+
Complexity+
Risk Reduction Consumers have to store and join
reference data
Data Staleness+
Potential for errors+
Efficiency Consumers have to receive updates
where they are not interested
Storage volumes+
Processing volumes+
Business Drivers for Change
11. 10
Single platform consuming
data from masters
Platform integrates datasets
Custom or normalized
datasets sent via
standardized channels
FeaturesOverview
Distribution Platform – Blueprint
12. 11
Example – Platform joins
Data joined at source and available for multiple consumers – simplifies consumption
13. 12
Single Platform
Pre-joined datasets
Flexible subscription to attributes
More consumer-oriented …
But there are still things we'd like to know …
Platform Benefits
17. 16
What datasets and
attributes do we
provide?
How are the
different datasets
related?
How are users
receiving our data?
Which consumers
are using which
attributes?
Data Distribution – Questions
18. 17
What datasets and
attributes do we
provide?
How are the
different datasets
related?
How are users
receiving our data?
Which consumers
are using which
attributes?
Knowledge
Base
Data Distribution – Questions
19. 18
A system component that lets us describe the journey of the
datasets and attributes from master systems to consumers
What is the Knowledge Base?
21. 20
Initially, platform (not human) requirements
XLS + custom DSL (Domain Specific Language)
E.g. composite INSTRUMENT dataset
– BOND_BONDRATING, EQUITY_EQUITYRATING
, union between two data sets
_ join between two datasets
Innovative and allowed us to build platform
Limited, Complex, Inflexible
Physical Model – 1.0
23. 22
Challenging making a relational model that answers all the (diverse) questions
Lots of different entities …
Lots of different relationships …
Not all data flows are the same …
Tough to get performance needed with a generic relational model
… Not really or easily anyway
25. 24
Store the metamodel in a graph database
Neo4j
– Used in the Bank
– Mature
– Comprehensive resources online
– Drivers / Adapters matching language choices
Physical Model – 2.0
27. 26
Answers to the questions …
What datasets
and attributes
do we provide?
MATCH
(d:Dataset)-[:OWNS]->(a:PhysicalAttribute)
RETURN d, a;
CYPHER QUERY
28. 27
Answers to the questions …
How are the
different
datasets
related?
MATCH
(d1:Dataset)<-[:JOINS]-(j:JoinRelation),
(d2:Dataset)<-[:JOINS]-(j)
RETURN d1,j,d2;
CYPHER QUERY
29. 28
Answers to the questions …
How are users
receiving our
data
MATCH (c:Consumer)-
[:RECEIVES_VIA|:INTERESTED_IN]->(v)
RETURN c, v
CYPHER QUERY
30. 29
Answers to the questions …
Which
consumers are
using which
attributes?
MATCH (c:Consumer)-[:INTERESTED_IN]->(view:Dataset),
(view)-[:SELECTS]->(output:Dataset),
(output)<-[:TARGET_OF]-(aggregation:Transformer)-
[:SOURCE_OF]->(aggregate:Dataset),
(aggregate)-[:OWNS]->(parts:Dataset),
(parts)-[:OWNS]->(a:PhysicalAttribute)
RETURN c, view, output, aggregation, aggregate,
parts, a
CYPHER QUERY
31. 30
Single source of truth
Governance and lineage easier
New insights for consumers
New insights for producers!
Knowledge Base – Benefits
32. 31
Coverage – not all datasets entered yet
Lots of data – we store source, interim, target datasets
Concept can be a bit intangible at times
Knowledge Base - Challenges
33. 32
Data Distribution is a natural "flow" from one processing node to another
Ad-hoc relationship traversal difficult in relational databases
Flexibility essential
– New sources, datasets, consumers, rules, …
Everything is an instance
– Model very organic by focusing on relationship between processing nodes rather than structure
How did a graph database help?
34. 33
Answers our questions … and more
Flexible schema Can model different flows
Easy(-ish) Query Language Cypher
Easy to create platform service layer
Good performance
Good support from vendor
Neo4j – Benefits
35. 34
Loading data required manual work
No out-of-the-box tools to manage the data
Skills rare … but easy to grow
Neo4j – Challenges
36. 35
Focus on human interactions
– Better search
– Better visualisation
Widen coverage of datasets
Offer to other parts of Bank
Impact Analysis tools
Self-service data integration
Next steps