Better Architecture for Data: Adaptable, Scalable, and Smart
A Better Architecture for Data:
Adaptable, Scalable, and Smart
Paul Boal &
Adam Doyle
June 8, 2018ST LOUIS
Agenda
1. Modern Data Architecture Myths
2. Characteristics of Modern Data Architecture
a. Governed, Secure
b. Adaptable, Customer Centric, Collaborative
c. Flexible, Elastic, Simple, Resilient
d. Smart, Automated
3. Reference Data Architecture
4. How do I get there?
5. Recap
2
MYTH #1
A modern data architecture is
not a single technology or single
vendor solution.
Modern data architectures
combine a portfolio of
technologies to create an
ecosystem with certain
characteristics.
Just install
Hadoop
4
MYTH #2
NoSQL technologies provide an
efficient way to manage and
access data under certain
circumstances, but traditional
relational databases and SQL
continue to provide the most
powerful way to organize and
query well-known data.
Modern must
mean NoSQL
5
MYTH #3
We talk a lot about the
accelerating growth of data, the
decreasing cost of storage and
compute power, and the power
of data science. It's convenient
to believe that throwing all of
this into a pot and simmering
will produce results while we
wait. The truth is that applying
data, technology, and analytics
still requires planning, analysis,
and careful execution.
Big data is
magical pixie
dust
6
MYTH #4
Not all data is created equal.
Sometimes you might have
unreliable or invalid data that
will obfuscate results if used
inappropriately.
Using extraneous data can
make analysis more
complicated by adding time to
filter the data set and select
features. Sometimes more just
means more work.
More data is
always better
7
MYTH #5
One of the characteristics of a
modern data architecture is
flexibility, meaning that your
modernization should be
developed incrementally,
implementing new capabilities
in a way that integrates with
and slowly supplants existing
limited technologies.
I have to
replace
everything I
have right now
8
Governed,
11
The architecture and its
components have to evolve and
adapt in ways that are intentional
and informed by enterprise
strategy.
Make collaboration the default.
Communicate and then
communicate some more.
Treat every component as if another
team may want to use it, too.
Accessing information should be
easy and should effortlessly ensure
that users are knowingly using the
right information for the right
purpose.
Security as an enabler of usage, not a
denier of access.
Track and log access for audit
purposes and for learning.
Secure
ING
Apache Atlas
Open Metadata and
Governance - APIs,
notification systems,
integration of metadata,
security, and governance
related tools
12
Governed, Secure
https://www.slideshare.net/Hadoop_Summit/open-metadata-and-governance-with-apache-atlas?qid=6ea30d4f-15af-46ad-b580-349f78bb7752&v=&b=&from_search=9
Frameworks and Tools
Open Source Core
Apache Atlas - Open Metadata Management
Apache NiFi - Data Provenance
Apache Sentry/Ranger - Fine-grained Access
Control
13
Governed, Secure
Vendor Participants
Adaptable, Customer Centric, Collaborative
It is not the strongest of the species that
survives, nor the most intelligent. It is the
one that is most adaptable to change.
~Charles Darwin
14
Adaptable,
15
The more you deliver, the
more you will learn about
what is really needed, so
be prepared to change and
build solutions that can
change easily.
Agile data modeling.
Agile analytics.
Focus on delivering solutions
that make sense to the people
who will use them rather than
following standards and rules
above all else.
The DBMS is not your user.
Ralph Kimball and Edgar Codd
are not your users.
The Architecture Review Board
is not your user.
Customer Centric,
Solutions that are interactively
designed and built by a team with
diverse capabilities and backgrounds
can produce a result better than what
any one individual would have done .
Collaboration is more than
requirements gathering.
Collaboration is something that has to
happen every day.
Communicate, communicate,
communicate. And then communicate.
Collaborative
Tools and Techniques
Model Storming
Rapid experimentation
Data science environments
Wherescape, Snowflake, ThoughtSpot
17
Adaptable, Customer Centric, Collaborative
Simple, Elastic, Resilient, Flexible
Notice that the stiffest tree is
most easily cracked while the
bamboo or willow survives by
bending with the wind.
-Bruce Lee
18
Simple,
19
Individual
components should
only be as complex as
necessary.
Reduce inter-
dependencies.
Use shared
components.
The system can easily
had an increase in
data volume, users,
or complexity.
Distributed computing.
Cloud.
DevOps.
Errors in data or
processing don't
cause large parts of
the system to fail.
Isolate components.
Tolerate, isolate, and
report bad data.
Change to the system
is easy to
accommodate and
doesn't break other
components.
Microservices.
Versioned interfaces.
Backward
compatibility.
Elastic, Resilient, Flexible
EarEcstasy
20
Data staging and
Data Lake only
contain needed data.
Each data pipeline is
only as complex as it
needs to be to deliver
on a narrow scope.
Data is only
integrated as
needed, keeping
processes simple.
Simple, Elastic, Resilient, Flexible
https://www.slideshare.net/AmazonWebServices/aws-summit-singapore-get-to-know-your-customers-modern-data-architecture-93784711
Automated, Smart
22
I'm afraid I can't make
that into a star schema,
Dave.
We are going through the process where
software will automate software, automation
will automate automation.
-Mark Cuban
Automated,
23
Automate tasks needed to optimize
the function of the system, to
detect significant changes, and to
alert users when attention is
needed.
Metadata injection.
Schema change detection.
Anomaly detection.
Alerting
Schema detection. Self-tuning
databases. Jeopardy champion.
Data shaping, data quality
recommendations.
Natural Language Processing.
Machine Learning.
Recommender systems.
Deep Learning.
Smart
How do I get there from here?
30
Start with something you understand well from a business perspective.
Select specific, valuable, measurable business cases.
Add simple machine learning use cases.
Identify use cases to move from a batch processing system to a streaming solution.
The Myths are Just Myths
32
● You don't "just need Hadoop" -
You may not even need Hadoop at all!
● NoSQL has a place, but that isn't the entire solution either.
● There's no magical pixie dust here.
This transformation will take real work.
● More data is not necessarily better -
no matter how much we data hoarders want it to be.
● By definition, you have to incrementally create your modern data
architecture, because it also has to continue to evolve.
Governed, Secure
33
Maintain data and the data architecture in
a way that makes governance and security
a natural and easy part of doing work.
Adaptable, Customer Centric, Collaborative
34
Apply data toward real
challenges and opportunities that
focus on customers and be willing
and able to pivot as needed.
Simple, Elastic, Resilient, Flexible
35
Build your data architecture, your teams,
and your processes in a way that creates a
high capacity for change.
Intro and Myths - Paul
Characteristics A, B - Paul
Characteristics C, D - Adam
Reference Architecture - Adam
How do I Get There - Adam or Paul or Back-and-Forth
Recap - Paul
These characteristics describe the processes by which your data is maintained.
Maybe here we want to tell stories about companies that didn’t secure their data (Target, Equifax, Schnucks)
These characteristics describe the processes by which your data is maintained.
Maybe here we want to tell stories about companies that didn’t secure their data (Target, Equifax, Schnucks)
These characteristics describe the processes by which your data is maintained.
These characteristics describe the processes by which your data is maintained.
These characteristics describe the way in which you use your data.
Built for purpose
These characteristics describe the way in which you use your data.
Built for purpose
These characteristics describe the way in which you use your data.
These characteristics describe the way in which you use your data.
These characteristics describe the architecture and its capacity to change.
These characteristics describe the architecture and its capacity to change.
These characteristics describe the architecture and its capacity to change.
These characteristics describe the way in which your data is integrated.
Informatica ClAIre
These characteristics describe the way in which your data is integrated.
Informatica ClAIre
These characteristics describe the way in which your data is integrated.
These characteristics describe the way in which your data is integrated.
These characteristics describe the architecture and its capacity to change.
Processing data - Mastering, Integration, De-identification,
Data Warehouse/Data Mart for reporting with rigor
Provisioning - Pie in the Sky - I’d like some “Net Sales”