2. 2
Foreword..........................................................................................................3
What you will learn...........................................................................................4
State of play.....................................................................................................5
Global architecture overview.............................................................................6
Reference architecture.....................................................................................7
The Real-time data catalog.............................................................................21
Ensuring data quality......................................................................................24
Regulations....................................................................................................25
About Lenses.io and Viseca...........................................................................27
CONTENTS
3. 31
VentureBeat, 2019 - Why do 87% of data science projects never make it into production?
What’s the most overlooked piece of your company’s data strategy? If
you’re like many companies today, it’s probably proprietary data and its
governance. Data that is unique to your organization and can be used to
create a sustainable competitive advantage, and that has enough added
value to make it a unique business asset.
This data can be big or small, raw or refined, structured or unstructured.
As apps increasingly deliver instant information and immediate value, this
data is in most cases real-time.
In every case, it needs to be available to the right people, actionable, and
of the highest quality.
This is one of the reasons that 87% of data projects never make it to
production1
. As data proliferates across technologies and business users
collect into silos, the problem in making analytics available to the
data-literate - and not only the technology-literate - grows in its complexity
and expense.
At Viseca and Lenses.io, we’ve spent much of our time working with
enterprise data engineering and analytics teams in highly regulated
industries like financial services. On the data projects we’ve managed,
granular access controls and quality assurance are a must from the
ground up rather than an afterthought. Good governance has meant
empowering the people who really know the data to truly own it and
innovate with it.
We also acknowledge that not every data-driven company is evangelical
about deep data governance for fear of slowing time-to-market. In the
following pages, we’ll outline an approach, reference architecture, and a
set of capabilities that has delivered the following results for leading
Swiss financial services provider Viseca across customer applications
and business units:
FOREWORD
BY ANTONIOS CHALKIOPOULOS & VLADIMIRO BORSI
It's time for leading regulated companies to
share great real-time data governance principles
10xreduction in time-to-market
of strategic streaming applications
600Ktargeted marketing
communications sent to customers
<10 minutesto integrate with data
on demand instead of weeks
4. What you’ll learn:
How to sculpt event data based on the proven reference
architecture of a European bank with +2 million customers
The elements to ensure your streaming data project is both
scalable and well-governed, using logical access models to
segregate responsibilities across your lines of business
The principles and approaches behind data operations, and
why a good data governance framework is central to this
5. 2
McKinsey, 2019 - Global Digital Transformation Survey 5
Executives in every industry know that data is business. Without it, there
can be no digital transformation to propel progress and no analytics to
identify new revenue opportunities.
Without data, even keeping the lights on isn’t possible. But for data to fuel
these initiatives, it must be:
• Readily available
• Of high quality, and
• Relevant
Without quality-assured governance, companies not only miss out on
data-driven opportunities; they waste resources.
McKinsey reports that:
Good data governance ensures high quality and availability of data so that
enterprises are positioned to create value.
McKinsey states,
STATE OF PLAY
Good governance can’t wait
“An average of 30 percent of their total
enterprise time was spent on
non-value-added tasks because of poor
data quality and availability.”
“Leading firms have eliminated millions of
dollars in cost from their data ecosystems
and enabled digital and analytics use cases
worth millions or even billions of dollars.
Data governance is one of the top three
differences between firms that capture this
value and firms that don’t. In addition, firms
that have underinvested in governance
have exposed their organizations to real
regulatory risk, which can be costly.”
2
6. 6
With today’s enterprises rapidly and successfully adopting multiple data
technologies, new data architectures must be governed by business principles
and data operations (DataOps). Bringing people, real-time data and apps
together in the right measure and underpinning these business assets with the
right metrics makes global management of information more effective as well
as ensuring data privacy and preparedness for emerging regulations.
The major drivers of DataOps are operational agility, governance and analytics -
making data readily available, high quality and relevant. And although unlocking
the value of data via increased operational agility is a key benefit of DataOps,
governance initiatives are a prerequisite to introducing new apps, technologies
and, as such, business opportunities.
A DataOps architecture focused on federated data implemented over
heterogeneous data stores unleashes new opportunities. When data is
well-governed from the ground up, strategic data projects are set up for success
and can move more seamlessly past proof-of-technology and into production.
GLOBAL ARCHITECTURE OVERVIEW
Moving strategic data projects past pilot purgatory
7. 7
Set theory in mathematics is the logic that studies collections of objects. Similarly this document defines the
logical and conceptual units that can enable a future-proof implementation of federated data governance over
heterogeneous systems.
This allows architects to align each logical unit, or line of business (LOB) with data, apps, rules, functions and
processes in a secure data and application container. A unit or line of business (LOB) in an enterprise can be Private
Banking, Mortgages, Credit Cards or other units that are segregated in terms of responsibilities and jurisdiction.
REFERENCE ARCHITECTURE
Turning lines of business into logical units
Data Store
Physical data repository implemented with the same
technology (i.e. Kafka or RDBMS); a base layer for data
storage and retrieval.
Actor
An entity that can access and manage data and has an
identity that can be authenticated (ie. individual people,
services, applications or processes).
Dataset+
A collection of homogeneous data (ie. a Topic or Table
as a dataset that lives in a Space) enhanced with
context-aware, dynamic (real-time) and static metadata
(i.e. description in forms of schemas, tags...).
Space
A set of federated resources including data,
applications, rules and processes relevant for a
specific function of a Line of Business (LOB) acting as
a secure data and application container.
THE UNITS
8. 8
In addition to the logical model, three technical entities complete the
access model:
Identity
A unique identity of a physical person or virtual entity (a system).
An Identity Provider (i.e. AD/LDAP, SSO) that authenticates and provides a
unique physical or virtual Identity.
Collection
A data repository subset where datasets are grouped by naming
convention or tagging. This is a pragmatic approach that simplifies and
improves access management.
Data repository
A population of datasets authorized for use in a Space (i.e. ACL-ed,
permission granted). The data repository is an abstraction of the access
architecture required to implement data access.
ACTOR
IDENTITY
NAMESPACE DATASET + DATA REPOSITORY DATA STORE
IS AUTHORISED
RBAC
CONTAINS BELONGS CONNECTS TO
SPACE
REFERENCE ARCHITECTURE
The access model
9. 3
Production environment: an environment that contains real data, even partially 9
An Actor, application or role of a user has an authorized Identity and is
granted fine grained RBAC (Role Based Access Control) over a collection
of data-sets across multiple data-repositories.
The RBAC management process must be:
• Based on a finite set of rules
• Subject to versioning, each representing a finite state
• Deployed in a production environment3
through a formalized/managed
change process
• Traceable in terms of planned/firecall changes (audit log)
• Defined in terms of process roles: requestor, approver or change
executor
Access Officer
The Access Officer is accountable and responsible for Identity and RBAC
rules management, and they also act as the first line of defense.
REFERENCE ARCHITECTURE
The access roles
10. 10
Lines of Businesses (LOB) are driving new initiatives with a data- driven
approach. Data and application logic exist in a space in which technology
and domain expertise must integrate to operate business processes.
With multiple LOBs and Spaces executing initiatives in parallel an
enterprise also must be in a position to aggregate to control and enforce
Data Governance practices globally.
CUSTOMER SUPPORT
SPACE
MARKETING
SPACE
DATA
STORE
DATA
STORE
DATA
STORE
DATA
STORE
DATA
STORE
LINE OF BUSINESS (LOB) i.e Private Banking
GLOBAL SPACE
SPACE SPACE SPACE SPACE
A LOB manages and governs multiple SPACES
over existing Data Stores ( i.e. KAFKA, HDFS, ERP, CRM ) on a logical level
REFERENCE ARCHITECTURE
The business relationship
Global Data Governance can be achieved by dynamically aggregating
all the logical data, metadata and processes across multiple Spaces.
11. 4
Gartner glossary 11
Data governance enables risk management and value creation for the
organization with the following pillars:
Data confidentiality
Minimal access to data, only for known accepted scope; based on
principle of least privilege (POLP - principle of giving a user account or
process only those privileges which are essential to perform its intended
function).
Data integrity
Data is scoped to pre-defined known rules related to single information
and its relation with other data (referential integrity)
Data availability
Data support to enable operational and business processes as expected
in time and as per deliverables.
Contribution to value creation
Data-driven organization knowledge enhancement for in-time strategic
decisions. This requires integrated stakeholders, digital interactions and
communication processes such as for Business Intelligence, analytics,
Know Your Customer or 360° Customer View.
REFERENCE ARCHITECTURE
Data governance goals
“Data governance is the specification of
decision rights and an accountability
framework to ensure the appropriate
behavior in the valuation, creation,
consumption and control of data and
analytics”
4
12. 12
Data Access, consisting of an enabler and guarantor of Data
Confidentiality, must be enforced at two levels.
1. Technical
Data Store infrastructure level when access is granted over a Space
SPACE-A
TOPIC
A1
TOPIC
A2
TOPIC
A3
TOPIC
A4
DATA REPOSITORY
TOPIC
B1
TOPIC
B2
TOPIC
B3
TOPIC
B4
KAFKA ADMIN ( DATA STORE ADMIN )
REFERENCE ARCHITECTURE
Granting access
SPACE-B
DATA REPOSITORY
13. 13
2. Logical
Space within the Line Of Business level
SPACE
ROLE-BASED
PERMISSIONS
LOGICAL LAYER
COLLECTION-2
SPACE
OWNER
BI-DIRECTIONAL
SECURE
CONNECTION
LINE OF BUSINESS (LOB) i.e Private Banking
i.e. Marketing
KAFKA DATA STORE
DATA STORE
TECHNICAL LAYER
DS
OWNER
APP B
READ | WRITE
READ
READ | WRITE
TOPIC A TABLE A
TOPIC A
TOPIC B
TOPIC C
ROLE-BASED
PERMISSIONS
COLLECTION-1
APP A
READ | WRITE
READ
READ
TOPIC A, TOPIC B
ORACLE
DS
OWNER
TABLE A
TABLE B
TABLE C
Federated data access control model via LOB delegation
REFERENCE ARCHITECTURE / GRANTING ACCESS
14. 14
The Space owner is granted access from the Access Officer into the
relevant logical Data Repositories, resulting in self sufficient support,
information control and access governance.
Fine-grained access and confidentiality levels are set within each Space
and require different levels of access depending on the Actor. Numerous
personas are collaborating in a Space, including Data engineers, Business
Analysts, Data Scientists, Testers and Product Owners.
The Space acts as a federated data container that enables secure and
efficient access to Datasets.
A Collection groups together Datasets across multiple Data Store
systems, enabling Actors to access them. The entire design aims at
providing a technology-independent and practical solution that can be
seamlessly applied over a large number of heterogeneous Data Store
systems.
REFERENCE ARCHITECTURE
Granting access – how it works
ROLE-BASED
PERMISSIONS
COLLECTION-2
APP B
READ | WRITE
READ
READ | WRITE
TOPIC A TABLE A
15. 15
Data Governance must introduce technology
agnostic, future-proof architecture, standards and
processes around Data knowledge management.
The “secure gateway” is a well known process for
enterprise security
REFERENCE ARCHITECTURE
The Data Container
WORKSPACE
AUDIT LOG-SIEM
SSH SSH
ADMIN
The concept is based on isolating and restricting
infrastructure access to technicians by providing
a secure gateway (also commonly known as
workspace, bastion or jump-box ) to access
infrastructure for technical administration
purposes.
SPACE
RBAC
ADMIN
DATA
POLICY
DATA
REPOSITORY
DATA
REPOSITORY
AUDIT LOG-SIEM
16. 16
The same concept can be effectively applied to cover modern data
repositories and provide a secure layer for governance and management
of information and knowledge. When an Actor (a User or Application)
accesses a data repository via a secure Space there are numerous
opportunities for enhanced, technology-transparent governance:
Role Based Access Control (RBAC)
This layer ensures that the IDENTITY of the Actor is verified and
appropriate data access permissions are available.
Data Policy (DP)
This layer works with a Data Catalog automatically and in real-time to
detect sensitive data i.e. PII, GDPR, HIPAA and applies data anonymization
and masking to data repositories.
Audit Log (AL)
This layer captures any Actor activity in the audit flagging logs for
classified/sensitive data.
A future-proof governance framework requires a technology-agnostic &
plug-and-play design, to avoid vendor and technology lock-in:
• RBAC must seamlessly work with: SSO, LDAP, AD, Kerberos or any
other identity providers
• DP enforcement should enable global policies to be pushed-down to
Spaces and LOBs.
DATA & APP CATALOG
RBAC AND IDENTITY
DATA POLICIES
AUDITING
• AL must be pluggable to multiple Security Information and Event
Management (SIEM) systems.
REFERENCE ARCHITECTURE / THE DATA CONTAINER
17. 17
Datasets are enriched with contextual information statically (i.e. users
adding labels, tags or descriptions) for improved classification and
categorization, and dynamically and automatically in real-time with
meta-information. In this model the static and dynamically enriched
Dataset is called Dataset+.
REFERENCE ARCHITECTURE
Dataset +
CONNECTION
TECHNICAL METADATA
TAGS
APPS
ACTIVITY
DATA & BUSINESS / QUALITY SLAs
18. 18
The DataSet + contains information such as:
Dynamic information is populated in real-time in a fully automated fashion
within the data container/space and static information is contextualized
via human operations. The Dataset + is pivotal to the Data Catalog.
WHERE DOES THE DATA LIVE? DYNAMIC
DYNAMIC
STATIC
DYNAMIC
STATIC
STATIC & DYNAMICWHAT TECHNICAL METADATA EXISTS?
WHAT APPLICATIONS & PROCESSES ARE USING THEM?
WHAT DATA AND BUSINESS QUALITY SLAs AND ALERTS ARE SET UP?
WHO IS USING THE DATA?
HOW IS THE DATA CLASSIFIED AND CATEGORIZED (VIA LABELS/ TAGS)
REFERENCE ARCHITECTURE / DATASET +
19. 19
The concept of tags (or labels) is a capability that can strategically
address multiple enterprise governance challenges by logically grouping
and linking objects like datasets, application resources and fields. It
solves the problem of master data management with an agile approach
(by scoping and enabling each line of business to define their own context
and language within the space they operate) and solve use cases such as:
• Classifying data based on context i.e. [ marketing ] or [ crm ]
• Identifying sensitive data i.e. [ pii ] [ financial ] [ confidential ]
• Applying role-based access on collections of data using include / or
exclude rules
• Coping with auditing reviews i.e. by focusing on [ pii ] data
REFERENCE ARCHITECTURE
Tags
Dataset Tag
A set of tags classify and organize datasets. For example the context of “a
dataset with marketing data, that contains personal identifiable
information (pii) originating from the CRM that is already enriched” can be
expressed with the following tags:
The above classification drives reporting, analytics, auditing and
discoverability. A commonly used primitive type of tagging is
implementing name’s semantics (namespacing).
Application Tags
A set of tags classifies and organizes business processes, applications
and quality guarantors.
An organization can “free the data” by agreeing on sensitivity levels and
giving employees access to use and explore it.
MARKETING ENRICHEDPII CRM
+ Tag
20. 20
Field Tags
These enable dynamic/logical linking between single information fields.
Value creation with tagging
Tagging represents a value-creation process because it is:
• Business oriented
• Supportive of multiple scopes
• Independent from data store technology
• Implemented if and when useful (cost/benefit ratio)
• Managed by data user
RBAC EXAMPLE:
ACCESS & USE
KAFKA
MS SQL
EXCLUDE
CONFIDENTIAL PII
CONFIDENTIAL PII
REFERENCE ARCHITECTURE / TAGS
21. 21
The Data Catalog (DC) is a known concept with well-defined benefits to an
organization, providing a “Google Search” and “Google Maps” view over
data and applications.
The needs of a Real-time Data Catalog are:
• Active and automatic discoverability of data, information and
applications
• Classification and categorization via tagging / labelling
• Schema and metadata monitoring
• Organizational and auditing reporting
The Real-time Data Catalog
SPACE
BI-DIRECTIONAL
SECURE
CONNECTION
LINE OF BUSINESS (LOB) i.e Private Banking
i.e. Marketing
TOPIC A TOPIC B TOPIC C
DATA STOREKAFKA
DS
OWNER
DATA STOREORACLE
DS
OWNER
TABLE A TABLE B TABLE C
DATA
POLICY
REAL-TIME
DATA CATALOG
22. 22
The Data and Application catalog operates in real-time and continuously
identifies data, schema changes and stores information, like which Actor
(Application or Human) is accessing and using any data. This builds and
preserves a rich graph of interactions. Apart from the business benefits of
a Data Catalog, the following capabilities are critical in Governing
confidential data:
• Automatically identifying all new data via a real-time continuous learning
process.
• Applying Data Policies to automatically monitor and mask sensitive data.
• Providing enriched Audit Logs for reporting.
• Identifying whether the principle of least access is applied to Actors.
The principles of a real-time Data Catalog per Space focus on avoiding
cataloging ALL information related to every data point (that is a
humongous task). Instead it enables Lines Of Business to self-govern their
own assets.
Global Data Governance can then easily be built up using the local data
catalog and applications to seamlessly enable stakeholders, data officers
and auditors to run reports and review governance levels.
THE REAL-TIME DATA CATALOG
How it works
GLOBAL
SPACE
SPACE
SPACE
SPACE
23. 23
On the Global Data Catalog an Auditor can review Global Data Governance
rules and also inspect different LOBs. A Governance Data Officer can also
review reports and policies across multiple LOBs.
Data Catalog
Where it is, How to use it, Who is using it
Dictionary
What it is, Why it exists
Schema
Technical information how to use the data on a field level
Associations
Data Lineage, Audits, Policies, Rules, Metrics, Reports
THE REAL-TIME DATA CATALOG
The components
DATA CATALOG FOR AGILITY AND GOVERNANCE
TRACK & PROTECT SENSITVE DATA
SEARCH & DISCOVER
DESCRIBE & TAG
TOPOLOGY & REPORTING
DATA CATALOG
24. 5
Forbes, 2016 - Cleaning Big Data 24
The main areas of Data Quality - a dimension of Data Governance - are:
Semantic Quality
The technical schema of a DataSet adheres to formal criteria i.e. age is
INTEGER. It is typically guaranteed by the technology itself. Looking at the
bigger picture it’s highly valuable to identify how schematic changes might
affect business outcomes across the entire estate.
In this new logical model, we define the set of principles upon which Data
Quality can be governed.
Information Quality
The actual information within data adheres to business quality rules i.e.
the age of a card-holder can be >16 and <120 years; the temperature of an
IoT device should be between 0°C and 100°C.
These rules require an external guarantor process that is based on
technology-agnostic business rules that evolve over time. This includes
“volume” quality and identifies anomalies in produced or consumed
datasets. If the business is expecting an average of 1 - 10 Million IoT
device events per day, for example, a statistical analysis on metadata can
reveal information quality issues.
In Time Quality
Is data consumed & produced in-time with business and operational
needs, for example coded in SLA within 10 seconds? This requires an
external guarantor process, based on statistical analysis that manages
false/true positives.
Referential Quality
Does data in a Data Store have referential integrity within the enterprise
universe, i.e. The e-mail address is in Apache Kafka, but do I have a
customer in the CRM with that email address? (cross Data Store quality
checks based on rules)
Ensuring data quality
“Approximately 60% of the time and effort of Data
Scientists is spent on cleaning and organizing data.”
5
DATA QUALITY RULES TRIGGER ALERTS AND REPORT INSIGHTS
REFERENTIAL QUALITY
SCHEMANTIC QUALITY
INFORMATION QUALITY
IN TIME QUALITY
DATASET +
25. 25
DATA MASKING
Data Governance should go above and beyond box-checking for regulatory
requirements, but regulations do provide a set of guiding principles to
center and expand upon.
Data masking
Data masking is appropriate for ensuring privacy. A high degree of
automation in metadata collection in a data catalog can protect sensitive
data (at a column / field leve). Note that PCI compliance can be tackled
via “tokenization” of credit card numbers, data expiration and CIV.
Data access alerts
Regulations (such as MFID) impose real-time reporting of data access to
reduce time-to-detect in the event of a breach.
This also provides justification of data access, for example a marketing
campaign manager who needs to access customer information to scope
their campaign. This justification must be provided as part of the
campaign manager viewing the data.
REGULATIONS
Responding to data in real-time
TOKENIZATION
PRIVILEDGE TO KNOW
THE RIGHT TO FORGET
DATA GOVERNANCE AND REGULATIONS
ACCESS
“NON-CONFIDENTIAL”
DATA
GDPR FORGET
“EMAIL@XX.COM”
MASK CUSTOMER
PHONE NUMBERS
GDPR HIPAA
26. 26
The Privilege to know
In the context of putting the data in front of the right Actor, there are some
interlinking and logical solutions to common regulatory issues (i.e. IT
general controls for SOX) such as “privilege to know”.
In the context of real-time and dynamic data, “row level” permissions can
.Imagine, a credit card processor with datasets (i.e. in Kafka, or HDFS)
from multiple transaction streaming from different service providers:
For the internal client organization, a supplier or Actor should only access
a subset of the above events i.e. specific provider (i.e. AMEX).
The right to forget
The “right to forget” is part of the European Citizens GDPR regulation. By
introducing privilege-based rules on datasets, one can avoid technical and
costly implementations such as crypto-shredding.
AMEX | TRANSACTION_ID | ..
MASTERCARD | TRANSACTION_ID | ..
REGULATIONS / RESPONDING TO DATA IN REAL-TIME DATA
27. About Lenses.io
lenses.io/
Lenses.io was founded to help enterprises simplify their application
development and transform data operations by making data work for
teams and not the other way around.
Around 100 organizations and 25,000 engineers around the globe have
placed their trust in Lenses.io to help them build, maintain, and secure
their most strategic applications.
Engineers use Lenses workspaces to build and operate real-time
applications on any Apache Kafka. By enabling teams to monitor,
investigate, secure and deploy on their data platform, organizations can
shift their focus to data-driven business outcomes and help teams get
their weekends back.
About the authors
ANTONIOS CHALKIOPOULOS, CEO
With 20 years of engineering experience, Antonios has
led and contributed to big data and digital transforma-
tion projects in finance, media and government for
organizations including Barclays, BSkyB and the
Ministry of Education for Greece.
In 2017 Antonios co-founded Lenses.io, and with it a
new DataOps movement, to make it possible for
everyone, in any company, to access, understand and
maximize their data through application development
and deployment.
VLADIMIRO BORSI, ENTERPRISE IT ARCHITECT
Vladi has been in the IT and services industry for 25
years, primarily focusing on enterprise architectures
and processes.
Viseca – Swiss cashless competence
www.viseca.ch/en
Viseca is a leading provider of products and services of cashless
payment. This includes the issue of payment cards of Viseca and Accarda
and the development of innovative finance management solutions of
Contovista. In 2019, revenue was at CHF 544.2 million and net profit was
at CHF 58.3 million. Viseca is wholly owned by the largest Swiss cantonal
and retail banks. These include all cantonal banks, the Raiffeisen Group,
Migros Bank, Bank Cler, regional banks and a number of private and
commercial banks.
DARIO CARNELLI, GOVERNANCE SPECIALIST
Dario Carnelli is an Isaca certified governance expert.
28. Start your DataOps journey at
lenses.io
CASE STUDIES
Read more
LENSES WORKSPACE
Choose
SLACK COMMUNITY
Join