2. Introduction
Neil Murray – Lead Big Data Architect
• Big Data platforms, products and solutions
• Telco
• Public Sector
6point6 - Technology consultancy with strong expertise in digital, data,
emerging technology and cyber.
About
3. Digital
Helping businesses on their
digital transformation journey,
which we see as a past to the
future continuum.
Data
Helping businesses
leverage data platforms,
data science and data
engineering to drive value
from the data they generate.
Cyber
Helping businesses
understand, manage and
contain cyber risks, with
appropriate measures,
prioritised for all their
digital assets.
What we do
Emerging
Technology
Working with businesses to
combine human and
artificial intelligence (AI),
optimising data, process
and technology to support
enhanced decision making.
4. Relational Database Stockholm Syndrome - the psychological
phenomenon often observed in hostage situations where the hostages
(Data/Services) start to identify with (and sympathise with) their captor
(Relational Database), even though trapped.
Situation
5. Typical Situation : Data Warehouse
• The Challenge:
• 40+ siloed data sources ingested at irregular intervals ‘as-is’ into a data
warehouse
• Deploy an OLAP data warehouse to co-locate the previously disparate data
for exploration, analysis and analytics.
• A need for abstraction! Business domains, entities and data models to be
identified - subsets of data transformed
• The Result:
• Batch based, single vendor, MPP OLAP shared database with a universal
interface - SQL
• Data team select ELT tooling leveraging push-down SQL to exploit DB
investment – storage and compute
• Platform team to monitor, support, tune, tidy and upgrade
We have all lived the typical situation
DATA WAREHOUSE
RELATIONAL DATABASE
MPP OLAP
Atomic Data
COLLECT
ELT
Specific ELT Tool
Extract Load Transform
DATA
DATABASE
OLTP
FILE
CSV, XML, XLS
API
REST, GRPC
STREAMING
Kafka, CDC
…
…
Abstract Data
6. Typical Situation : Data Access
• Services deployed - Reporting, Analytics, Search,
Rules Engines, ML, …
• Services access data only via abstract layer
• SQL interface (JDBC, ODBC, Clients)
• Storage and compute - utilise DB to transform and
shape data for specific service usage
• Emerging pattern of exporting transformed data
where appropriate to support specific use cases
The service solution
SERVICE
SEARCH
SERVICE
REPORTING
DATA WAREHOUSE
RELATIONAL DATABASE
MPP OLAP
Atomic Data
Abstract Data
SERVICE
RULES
DATA CONSUMERS
Data Scientist
Data Analyst
SERVICE CONSUMERS
7. Exercising Control
• How to support rich data structures?
• Impossible to predict future needs!
• Data ‘lost in translation’
• Difficult to model key metadata
• Design leads to bad leaky data - shortcuts
• Change to schema design difficult to manage and
has enormous blast area
Relational Schema is too rigid
SERVICE
SEARCH
SERVICE
REPORTING
DATA WAREHOUSE
RELATIONAL DATABASE
MPP OLAP
Atomic Data
Abstract Data
SERVICE
RULES
DATA CONSUMERS
Data Scientist
Data Analyst
8. Exercising Control
• Services struggle to be independently deployable using a
single shared database
• Transformations embed business logic in the DB
• Failed attempts to create Service APIs
• ELT tooling may only work for target DB
• ELT requires careful scheduling, hand cranking
Lost independence
SERVICE
SEARCH
SERVICE
REPORTING
DATA WAREHOUSE
RELATIONAL DATABASE
MPP OLAP
Atomic Data
Abstract Data
SERVICE
DATA CONSUMERS
Data Scientist
Data Analyst
COLLECT
ELT
Specific ELT Tool
Extract Load Transform
9. Exercising Control
• Complex data ecosystem
• Limited shared resources
• SQL interface
• Punishing queries
• High latency data polling based ‘subscriptions’
• Poll/batch/bulk/delta the only valid approaches =
stale data
Performance
SERVICE
ANALYTICS
SERVICE
SEARCH
SERVICE
REPORTING
DATA WAREHOUSE
RELATIONAL DATABASE
MPP OLAP
Atomic Data
Abstract Data
SERVICE
RULES
DATA CONSUMERS
Data Scientist
Data Analyst
COLLECT
ELT
Specific ELT Tool
Extract Load Transform
10. Sympathy
• Temptation to reach around abstract layer
irresistible! Hybrid access patterns emerge
• Spaghetti SQL code harbouring complex nested
dependencies
• No protection from upstream changes
• Tight coupling and technical debt that will never be
repaid
Workarounds
SERVICE
SEARCH
SERVICE
REPORTING
DATA WAREHOUSE
RELATIONAL DATABASE
MPP OLAP
Atomic Data
Abstract Data
SERVICE
RULES
11. Sympathy
• Solutions converge around data locality (gravity)
• ELTTTTTTT
• Repetition of compute
• Challenging to share
• Challenging to maintain
• Bespoke frameworks emerge
Locality and Transformations
SERVICE
SEARCH
SERVICE
REPORTING
DATA WAREHOUSE
RELATIONAL DATABASE
MPP OLAP
Atomic Data
Abstract Data
SERVICE
RULES
FRAMEWORK
12. Data Hostage Test
• Does your data reside in single DBMS and do the SMT frequently discuss the purchase of additional nodes,
capacity and licenses?
• Is business is slow to adapt? Are opportunities frequently missed?
• Is it easy to trial/adopt new technologies?
• Is there an appetite for change or does change = unpredictable risk and cost?
• You have agile teams, yet agility is not reflected in services/products?
• Is there a personnel skew towards a large DBA/Platform team?
• Cookie-cutter solution architecture - 1 tool fits all? We’ve always done it this way!
Question your data captivity…
13. Data Liberation
• Kappa Architecture (J. Kreps)
• Turning the Database Inside Out (M. Kleppmann)
• A Database Unbundled (B. Stopford)
• Deconstruct the data warehouse
• Data relocated to a distributed log
• State management
• Queries/Projections relocated to services
• The right tool for the right job
• Separation of storage and compute - BYOC
An alternative approach DATA
DATABASE
OLTP
FILE
CSV, XML, XLS
API
REST, GRPC
STREAMING
Kafka, CDC
…
…
COLLECT
ETP
Record Assembly
Extract PublishTransform
EVENTS
DISTRIBUTED LOG + STATE PROCESSING
Apache Kafka
State
SERVICE
SEARCH
SERVICE
REPORTING
SERVICE
RULES
STORAGE
DOCUMENT
STORAGE
OBJECT
DATA CONSUMERS
Data Scientist
Data Analyst
14. Data Liberation
• Entities modelled as Domain Driven Design Aggregates or
Events
• Evolvable Schema – Avro/Protobuf/Thrift
• Support for rich data structures
• Governable/extensible/composable/versionable
• Efficient/full data capture with generics
• A contract, insulating change and democratising data whilst not
mandating shape of storage
Domain Entities in an Evolvable Schema
Abstract Evolvable Schema
Metadata
Common
Type
Generic
Attrs Feats
Abstract Relational Schema
15. Data Liberation
• Paradigm shift from stateful entity ELT to stateless
entity ETP
• Extract: extensible adapters get data from variety of
data sources batch or stream
• Transform: map entity data to schema
• Publish: submit commands
• Language/tool/log agnostic
Stateful Entity Data Pipeline - ETP
Command Envelope
COLLECT
ETP
Record Assembly
Extract PublishTransform
EVENTS
DISTRIBUTED LOG + STATE PROCESSING
Apache Kafka
State
Abstract Evolvable Schema
Metadata
Common
Type
Generic
Attrs Feats
DATA
DATABASE
OLTP
FILE
CSV, XML, XLS
API
REST, GRPC
STREAMING
Kafka, CDC
…
…
16. Data Liberation
• Domain Processing consumes commands to
handle state mutation, de-duplication, out-of-
order resolution
• Utilise Kafka Streams API and RocksDB
• Produces state change events and entity
snapshots for downstream consumers
• Distributed, resilient, separation of concerns
• Eventual Consistency paradigm can remove
assembly complexity (joins)
Stateful Entity Data Pipeline – Domain Processing
EVENTS
DISTRIBUTED LOG + STATE PROCESSING
Apache Kafka
DOMAIN PROCESSING
AGGREGATE PROCESSING
Kafka Streams, RocksDB
State
Command State Change + Snapshot
Fact
18. Data Liberation
• Polyglot data stores, the right tool for the right
job
• Maintain existing services
• Embedded lightweight materialised views
(CQRS)
• Low latency
• Independently deployable, agile
• Service patterns: stateful (snapshot), stateless
(state change), ephemeral, one-time, serverless, …
Stateful Entity Data Pipeline - Services
EVENTS
DISTRIBUTED LOG + STATE PROCESSING
Apache Kafka
State
SERVICE
SEARCH
SERVICE
REPORTING
SERVICE
RULES
SERVING
DOCUMENT
SERVING
OBJECT
Command State Change + Snapshot
19. The way forward
• Right tool for the right job – an enabler for innovation
• Don’t underestimate the sympathy factor – the process will be like removing a comfort blanket, there will be
resistance. Challenge opinions on storage vs compute, batch vs stream, SQL all the things
• Your schema is your contract – take care to maintain compatibility, put governance in place. Avoid the
temptation to use JSON
• Utilise Kafka ecosystem first
• Leverage Kafka Streams for domain processing and services or KSQL where appropriate
• Use Schema Registry
• Prefer Kafka Connect
• Consider Kafka managed service offerings – Confluent Cloud, KMS
Learnings
20. Get in touch
Neil Murray
Lead Big Data Architect, Data
neil.murray@6point6.co.uk
About 6point6
Integrating digital technology into your business can result in fundamental changes to
how you operate and deliver value to your customers. To go digital is to reinvent
yourself to the core, opening yourself and your clients to a world of possibilities.
6point6 is a technology consultancy. We bring a wealth of hands-on experience to help
financial service providers, media houses and government achieve more with digital.
Using cutting edge technology and agile delivery methods, we help you reinvent,
transform and secure a brighter digital future.
Visit us on www.6point6.co.uk
Twitter: @6point6ltd
LinkedIn: linkedin.com/company/6point6
192