Data Lake or Data Swamp? By now, we’ve likely all heard the comparison. Data Lake architectures have the opportunity to provide the ability to integrate vast amounts of disparate data across the organization for strategic business analytic value. But without a proper architecture and metadata management strategy in place, a Data Lake can quickly devolve into a swamp of information that is difficult to understand. This webinar will offer practical strategies to architect and manage your Data Lake in a way that optimizes its success.
Automating Google Workspace (GWS) & more with Apps Script
Data Lake Architecture – Modern Strategies & Approaches
1. Data Lake Architecture
Modern Strategies & Approaches
Donna Burbank, Managing Director
Global Data Strategy, Ltd.
August 23rd, 2018
Follow on Twitter @donnaburbank
Twitter Event hashtag: #DAStrategies
2. Global Data Strategy, Ltd. 2018
Donna Burbank
Donna is a recognised industry expert in
information management with over 20 years
of experience in data strategy, information
management, data modeling, metadata
management, and enterprise architecture.
Her background is multi-faceted across
consulting, product development, product
management, brand strategy, marketing,
and business leadership.
She is currently the Managing Director at
Global Data Strategy, Ltd., an international
information management consulting
company that specializes in the alignment of
business drivers with data-centric
technology. In past roles, she has served in
key brand strategy and product
management roles at CA Technologies and
Embarcadero Technologies for several of the
leading data management products in the
market.
As an active contributor to the data
management community, she is a long time
DAMA International member, Past President
and Advisor to the DAMA Rocky Mountain
chapter, and was recently awarded the
Excellence in Data Management Award from
DAMA International in 2016.
Donna is also an analyst at the Boulder BI
Train Trust (BBBT) where she provides advice
and gains insight on the latest BI and
Analytics software in the market. She was on
several review committees for the Object
Management Group’s for key information
management and process modeling
notations.
She has worked with dozens of Fortune 500
companies worldwide in the Americas,
Europe, Asia, and Africa and speaks regularly
at industry conferences. She has co-
authored two books: Data Modeling for the
Business and Data Modeling Made Simple
with ERwin Data Modeler and is a regular
contributor to industry publications. She can
be reached at
donna.burbank@globaldatastrategy.com
Donna is based in Boulder, Colorado, USA.
2
Follow on Twitter @donnaburbank
Twitter Event hashtag: #DAStrategies
3. Global Data Strategy, Ltd. 2018
DATAVERSITY Data Architecture Strategies
• January - on demand Panel: Emerging Trends in Data Architecture – What’s the Next Big Thing?
• February - on demand Building an Enterprise Data Strategy – Where to Start?
• March - on demand Modern Metadata Strategies
• April - on demand The Rise of the Graph Database
• May - on demand Data Architecture Best Practices for Today’s Rapidly Changing Data Landscape
• June - on demand Artificial Intelligence: Real-World Applications for Your Organization
• July - on demand Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic Asset
• August Data Lake Architecture – Modern Strategies & Approaches
• Sept Master Data Management: Practical Strategies for Integrating into Your Data Architecture
• October Business-Centric Data Modeling: Strategies for Maximizing Business Benefit
• December Panel: Self-Service Reporting and Data Prep – Benefits & Risks
3
This Year’s Line Up for 2018
4. Global Data Strategy, Ltd. 2018
Today’s Topic
4
Building a Successful Data Lake Architecture
• Data Lake or Data Swamp? By now, we’ve likely all heard the comparison.
• Data Lake architectures have the opportunity to provide the ability to integrate vast amounts of
disparate data across the organization for strategic business analytic value.
• But without a proper architecture and metadata management strategy in place, a Data Lake can
quickly devolve into a swamp of information that is difficult to understand.
• This webinar will offer practical strategies to architect and manage your Data Lake in a way that
optimizes its success.
5. Global Data Strategy, Ltd. 2018
Data Lakes – the Opportunity
• Data Lakes provide a response to the opportunity & reality of today’s data-focused world.
• Consumer data provides a myriad of opportunities
• IoT data for machine logs, sensors, etc.
• And more…
• aka “Big Data”
5
Opportunity and Complexity
Purchasing
Patterns
Photos &
Video
Support Call
Logs
Web Click
Activity
Etc…
Social Media
Interactions
Consumer
IoT data from
wearable
tech
Location data
from phone
6. Global Data Strategy, Ltd. 2018
What is Big Data?
• Big Data is often characterised by the “3 Vs”:
• Volume: Is there a high volume of data? (e.g. terabytes per day)
• Velocity: Is data generated or changed at a rapid pace? (e.g. per second, sub-second)
• Variety: Is data stored across multiple formats? (e.g. machine data, media files, log files)
• The ability to understand and manage these sources and integrate them into the
larger Business Intelligence ecosystem can provide the ability to gain valuable
insights from data.
• Social Media Sentiment Analysis – e.g. What are customers saying about our products?
• Web Browsing Analytics – Customer usage patterns
• Internet of Things (IoT) Analysis – e.g. Sensor data, Machine log data
• Customer Support – e.g. Call log analysis
• This ability leads to the “4th V” of Big Data – Value.
• Value: Valuable insights gained from the ability to analyze and
discover new patterns and trends from high-volume and/or
cross-platform systems.
• Volume
• Velocity
• Variety
Value
7. Global Data Strategy, Ltd. 2018
The Business Need Spans Traditional and
Modern Technology
7
Tell me what
customers are
saying about our
product.
Sybase
SAP
DB2
Oracle
SQL
Server
SQL
Azure
Informix
Teradata
DBA
Which customer
database do you
want me to pull this
from? We have 25.
Data
Architect
And, by the way, the databases
all store customer information
in a different format.
“CUST_NM” on DB2,
“cust_last_nm” on Oracle, etc.
It’s a mess.
Traditional Databases & DW
Data
Scientist
I’ll need to input the raw data
from thousands of sources, and
write a program to parse and
analyze the relevant
information.
Big Data & Data Lake
8. Global Data Strategy, Ltd. 2018
The 5th “V” - Veracity
• Only through proper Governance, Data Quality Management, Metadata Management, etc., can
organizations achieve the 5th “V” – Veracity.
• Veracity: Trust in the accuracy, quality and content of the organizations’ information assets.
• i.e. The hard work doesn’t go away with Big Data
Raw data used in Self-Service Analytics and BI environments is
often so poor that many data scientists and BI professionals
spend an estimated 50 – 90% of their time cleaning and
reformatting data to make it fit for purpose.(4
Source: DataCenterJournal.com
The absence of commonly understood and shared metadata
and data definitions is cited as one of the main impediments
to the success of Data Lakes.
Source: Radiant Advisors
Correcting poor data quality is a Data Scientist’s least favorite
task, consuming on average 80% of their working day
Source: Forbes 2016
71% of interviewees expect digitization to grow their
business. But 70% say the biggest barrier is finding the right
data; 62% cite inconsistent data
Source: Stibo Systems
Data Science Data Lakes
Data Science Digitization & Data Quality
9. Global Data Strategy, Ltd. 2018
Big Data a Growing Trend
• Over 70% of organizations are
either using Big Data solutions, or
planning to in the future.
• Analysis & Discovery are leading
trends including:
• Data Science & Discovery
• Reporting & Analytics
• “Sandbox” Exploration
9
Analysis & Discovery are Key Drivers
1 Trends in Data Architecture, 2017, DATAVERSITY, by Donna Burbank and Charles Roe
1
10. Global Data Strategy, Ltd. 2018
Big Data Concerns
10
• The Complexity of current Big
Data solutions & the Skills
Required to manage them
were also common issues.
• Security is a leading concern, and Data
Governance was a top write-in response.
1 Trends in Data Architecture, 2017, DATAVERSITY, by Donna
Burbank and Charles Roe
1
11. Global Data Strategy, Ltd. 2018
Balance Opportunity & Risk
• Scalability
• Cost Considerations
• Latency
• Storage of Diverse Data Sources
11
With the Opportunity of Big Data Comes Risk
• Privacy
• Security
• Compliance
• Collaboration between New Roles
Architecture Governance
• With the opportunity from Big Data comes a myriad of risks and concerns such as scalability, security, etc.
• These concerns can be addressed through a combination of data Lake Architecture and the supporting
Governance mechanisms.
12. Global Data Strategy, Ltd. 2018 12
A Successful Data Strategy links Business Goals with Technology Solutions
“Top-Down” alignment with
business priorities
“Bottom-Up” management &
inventory of data sources
Managing the people, process,
policies & culture around data
Coordinating & integrating
disparate data sources
Leveraging & managing data for
strategic advantage
Copyright 2018 Global Data Strategy, Ltd
Aligning Business Strategy and Data Strategy
13. Global Data Strategy, Ltd. 2018
Traditional Relational Technologies and “Big Data”:
a Paradigm Shift
Traditional
• Top-Down, Hierarchical
• Design, then Implement
• “Passive”, Push technology
• “Manageable” volumes of information
• “Stable” rate of change
• Data Warehouse
• Business Intelligence
Big Data
• Distributed, Democratic
• Discover and Analyze
• Collaborative, Interactive
• Massive volumes of information
• Rapid and Exponential rate of growth
• Data Lake
• Statistical Analysis
Design Implement Discover Analyze
14. Global Data Strategy, Ltd. 2018
“Traditional” way of Looking at the World: Hierarchies
• Carolus Linnaeus in 1735 established a hierarchy/taxonomy for organizing and identifying
biological systems.
Kingdom
Phylum
Class
Order
Family
Genus
Species
15. Global Data Strategy, Ltd. 2018
“New” Way of Looking at the World - Emergence
In philosophy, systems theory, science, and art, emergence is
the way complex systems and patterns arise out of a
multiplicity of relatively simple interactions.
- Wikipedia
I love my new
Levis jeans.
Is Levi coming
to my party?
Sale #LEVIS
20% at Macys.
LOL. TTYL.
Leving soon.
16. Global Data Strategy, Ltd. 2018
Data Warehouse vs. Data Lake
16
Data Warehouse Data Lake
A Data Lake is a storage repository that holds a vast
amount of raw data in its native format, including
structured, semi-structured, and unstructured data.
The data structure & requirements are not defined until
the data is needed.
A Data Warehouse is a storage repository that holds current
and historical data used for creating analytical reports. Data
structures & requirements are pre-defined, and data is
organized & stored according to these definitions.
17. Global Data Strategy, Ltd. 2018
Combining DW & Big Data Provides Value
• There are numerous ways to gain value from data
• Relational Database and Data Warehouse systems are one key source of value
• Customer information
• Product information
• Big Data can offer new insights from data
• From new data sources (e.g. social media, IoT)
• By correlating multiple new and existing data sources (e.g. network patterns & customer data)
• Integrating DW and Big Data can provide valuable new insights.
• Examples include:
• Customer Experience Optimization
• Churn Management
• Products & Services Innovation
17
New
InsightsData
Warehouse
Data
Lake
18. Global Data Strategy, Ltd. 2018
Data Lake Adoption is Varied
• Most are using a Data Lake along
with a Data Warehouse
• Many are not currently using a Data
Lake
18
1 Trends in Data Architecture, 2017, DATAVERSITY, by Donna Burbank and Charles Roe
1
19. Global Data Strategy, Ltd. 2018
Poll: Are you currently implementing a Data Lake?
Are you currently implementing a Data Lake?
19
YES NO
20. Global Data Strategy, Ltd. 2018
Integrating the Data Lake & Traditional Data Sources
• The Data Lake has a different architecture & purpose than traditional data sources such as data
warehouses.
• But the two environments can co-exist to share relevant information.
20
Data Analysis & Discovery – Data Lake Enterprise Systems of Record
Data Governance & Collaboration
Master &
Reference Data
Data Warehouse
Data MartsOperational Data
Security & Privacy
Sandbox
Lightly Modeled
Data
Data
Exploration
Reporting & Analytics
Advanced
Analytics
Self-Service BI
Standard BI
Reports
21. Global Data Strategy, Ltd. 2018
The Data Ecosystem
• Know what to manage closely and what to leave alone
• The more the data is shared across & beyond the organization, the more formal governance needs to be
21
Core Enterprise
Data
Functional & Operational
Data
Exploratory Data
Reference &
Master Data
Core Enterprise Data
• Common data elements used by multiple
stakeholders, departments, etc. (e.g. DW)
• Highly governed
• Highly published & shared
Functional & Operational Data
• Lightly modeled & prepared data for
limited sharing & reuse
• Collaboration-based governance
• May be future candidates for core data
Exploratory Data
• Raw or lightly prepped data for
exploratory analysis
• Mainly ad hoc, one-off analysis
• Light touch governance
Examples
• Operational Reporting
• Non-productionized analytical model data
• Ad hoc reporting & discovery
Examples
• Raw data sets for exploratory analytics
• External & Open data sources
Examples
• Common Financial Metrics: for Financial & Regulatory Reporting
• Common Attributes: Core attributes reused across multiple areas
Master & Reference Data
• Common data elements used by multiple stakeholders
across functional areas, applications, etc.
• Highly governed
• Highly published & shared
Examples
• Reference Data: Department Codes, Country Codes, etc.
• Master Data: Customer, Product, Student, Supplier, etc.
Exploratory analysis
uses core data sets
when applicable
Derived variables of
value can be fed into
Core Enterprise, or
even Master Data.
PublishPromote
22. Global Data Strategy, Ltd. 2018
Governance Requires Interaction Between Roles
Data Scientist
“Citizen Data
Scientist”
Data Architect
BI Reporting
Analyst
ETL Developer
Data Steward
Data Warehouse – centric roles Data Lake – centric roles
Alignment
DW Developer
Data Lake Platform
Administrator
Data Governance
ManagerCross-cutting Governance
& Architecture Roles
23. Global Data Strategy, Ltd. 2018
Metadata Repository vs. Data Catalogue
• The collaboration paradigm of the data lake can require a different way of managing metadata
23
Different Data Sources Require Different Ways of Working
Encyclopedia – Metadata Repository Wikipedia – Data Catalogue
• Created by a few, then published as read-only
• Single source of “vetted” truth
• Slowly-changing
• Created by a by many, edited by many
• Eventual consistency with multiple inputs
• Dynamic
For Standardized, Enterprise Data Sets
Data Warehouse
For Data Exploration, Self Service
Data Lake
24. Global Data Strategy, Ltd. 2018
Collaboration, Governance & Metadata
24
Data Lakes require new ways of collaborating
Core Enterprise
Data
Functional & Operational
Data
Exploratory Data
Reference &
Master Data
Metadata Repository, Stricter Governance
Data Catalogue – Collaborative Governance
• Glossary: Strictly vetted
• Data Dictionary: approved sources
• Data Lineage: detailed source/target mapping at
field level
• Audit Trails
• PII mapping and audit
• Data classification
• Glossary: Crowdsourced & open
• Data Dictionary: exploratory sources
• Data Lineage: high-level data flow and
lineage between source and target
• Usage ranking
• Usefulness ranking and “likes”
• Tagging
25. Global Data Strategy, Ltd. 2018
Data Catalogue: Harnessing “Tribal Knowledge”
25
Usage Ranking
• Which:
• Definitions are most
complete & helpful?
• Algorithms offer a helpful
starting point?
• Queries offer great logic
to share?
• Etc.
Helpfulness Ranking
• Which:
• Queries are others using?
• Tables are accessed the
most?
• Glossary terms are most
often searched?
• Etc.
Collaboration & Crowdsourcing
Term: Part Number
Alternate Names: Component Number
Definition:
A part number is an 8 digit alphanumeric field that uniquely
identifies a machine part used in the manufacturing process.
Is this truly the same as the old Component
Number? That was a 10 digit numeric field. It
didn’t have letters.
Yes, it is. I had the same problem for the
finance app, and I wrote a quick program to
convert the numbers. We just strip off the first
two chars now. Click here to find it.
26. Global Data Strategy, Ltd. 2018
Avoiding Silos
• Don’t create Data Lily Pads – i.e. disparate Data Lakes not connected with a wider Data Strategy.
26
• Often, teams create their own “stealth” data
lakes in order to solve an immediate, tactical
problem.
• This approach loses the value of cross-
functional data sharing.
• Costs issues and redundancy are also a
concern.
27. Global Data Strategy, Ltd. 2018
Considerations & Risks to Avoid
• The World of Data Lakes brings with it new risks and concerns
27
Platform
- On Prem
- Cloud
- Provider selection
Skills
- Outsourced
- In House
- Training Requirements
Cost
- Is Cloud the right model
for our scalable usage?
- Are we shutting off
sandboxes when we’re
done?
Data Lifecycle
- What can be cold storage vs hot storage?
- When can data be deleted?
- How do we move from Exploration to Enterprise?
Data Security
- Who has access?
- How is PII managed?
Data Governance
- Is there common semantic meaning?
- How do teams work together – operating model?
- Policies & Procedures
- Who is spinning up a sandbox & why?
28. Global Data Strategy, Ltd. 2018
Summary
• Data Lakes can provide significant opportunity to an organization to gain value from
cross-functional, disparate data sources
• Data Warehouses and Data Lakes work well together for a comprehensive enterprise
view.
• Data Governance is critical for the success of data lakes:
• Collaboration and sharing of information
• Access control and security
• Lifecycle and production to enterprise-data assets
• Operating model and ways of working between roles and departments
29. Global Data Strategy, Ltd. 2018
DATAVERSITY Data Architecture Strategies
• January - on demand Panel: Emerging Trends in Data Architecture – What’s the Next Big Thing?
• February - on demand Building an Enterprise Data Strategy – Where to Start?
• March - on demand Modern Metadata Strategies
• April - on demand The Rise of the Graph Database: Practical Use Cases & Approaches to Benefit your Business
• May - on demand Data Architecture Best Practices for Today’s Rapidly Changing Data Landscape
• June –on demand Artificial Intelligence: Real-World Applications for Your Organization
• July – on demand Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic Asset
• August – soon on demand Data Lake Architecture – Modern Strategies & Approaches
• Sept Master Data Management: Practical Strategies for Integrating into Your Data Architecture
• October Business-Centric Data Modeling: Strategies for Maximizing Business Benefit
• December Panel: Self-Service Reporting and Data Prep – Benefits & Risks
29
This Year’s Line Up for 2018 – Join Us Next Month
30. Global Data Strategy, Ltd. 2018
White Paper: Trends in Data Architecture
30
Free Download
• Download from
www.globaldatastrategy.com
• Under ‘Resources/Whitepapers’