5 Steps for Architecting a Data Lake

STEPS FOR
ARCHITECTING
A DATA LAKE
How to maximize intelligence
by unifying enterprise data
© 2018 MetroStar Systems, Inc. - All Rights Reserved
5

© 2018 MetroStar Systems, Inc. - All Rights Reserved 2
5 STEPS FOR ARCHITECTING A DATA LAKE
TABLE OF CONTENTS
SECTION 1: INTRODUCTION ……………………..…………………………………………………………….. 3
Data Growth Challenges……………..……………………………………………………………… 4
SECTION 2: WHAT IS A DATA LAKE? ……………………….……………………………………………..… 5
How Does a Data Lake Work? …………………………………………………………………… 6
Data Lake vs Traditional Approach …….……………………..……………………………… 7
SECTION 3: DATA LAKE REQUIREMENTS ………………………………………………………………… 8
Creating a Successful Data Lake…………………………………………………………………. 9
Data Lake Governance……………………………………………………..……………………… 10
Selecting the Right Platform…………………………………………….……………………… 11
SECTION 4: 5 STEPS FOR ARCHITECTING A DATA LAKE…….…………………………………...... 12
1. Ingestion & Storage ……………………………………………………………………………. 13
2. Data Processing ………………………………………………………………….………………. 14
3. Robust Data Governance ……………………………………………………………………. 15
4. Data Retrieval and Visualization …………………………………………………………. 16
5. Advanced Analytics………………………………………………………………................. 17
Overview of a Data Lake’s Capabilities ……………..…………………….………………. 18
SECTION 5: MAXIMIZING THE VALUE OF A DATA LAKE …….……………………………………. 19
Data Revolves Around Citizens……………..………………………………….………………. 20
Enhancing Citizen Experience……………..……………………………….….………………. 21
ASSESSING READINESS………………………………………………………………………………………….. 22

SECTION 1:
INTRODUCTION

DATA GROWTH CHALLENGES
5 STEPS FOR ARCHITECTING A DATA LAKE | INTRODUCTION
Data Growth Challenges:
 High overhead costs due to
inflexible architecture and
legacy technology
maintenance
 Antiquated data environments
that suffer from poor master
data management practices
 Low data integrity due to a
lack of a single source of truth
with respect to the data
 Inability to provide internal
users, analysts, developers,
and management the tools
needed to perform their
respective roles at the high
caliber of quality expected
from today’s workplace
Enterprises that do not
employ Data Lake platforms
can find themselves being
outpaced by the rate of their
agency’s data growth.
AS AN AGENCY GROWS SO DOES ITS DATA. Data is no longer limited to structured,
relational, and/or transactional in nature. Data now includes semi-structured, unstructured,
operational log, social media, free-text, and more. The ability to ingest data of all varieties
is imperative to gaining a holistic understanding of the digital ecosystem. Agencies can
leverage cutting-edge technologies with wide-ranging, high integrity data sources to derive
powerful insights to their operational and theoretical questions. By coupling the robust
technologies of a Data Lake with the flexible, cost effect capabilities of a Cloud Service
Provider (CSP) such as Amazon Web Services (AWS) or Microsoft Azure, among others,
the value the Data Lake offers becomes a powerful asset for agencies large and small.
Source: http://infosysblogs.com/brandededge/2013/04/20130419infographic.html

SECTION 2:
WHAT IS A DATA
LAKE?

HOW DOES A DATA LAKE WORK?
5 STEPS FOR ARCHITECTING A DATA LAKE | WHAT IS A DATA LAKE?
A Data Lake is a natural
maturation of data
migrating to a single
environment. The Data
Lake provides capabilities
seldom seen in IT
enterprises that employ
disparate data stores and
databases.
“A data lake is like a large
body of water in a more
natural state. The contents
of the data lake stream in
from a source to fill the
lake, and various users of
the lake can come to
examine, dive in, or take
samples.”
– James Dixon, CTO, Pentaho

DATA LAKE vs TRADITIONAL APPROACH
5 STEPS FOR ARCHITECTING A DATA LAKE | WHAT IS A DATA LAKE?
DATA LAKE TRADITIONAL
Data Storage
Structured, semi-structured, or
unstructured data can be stored at
low costs and can be stored with a
schema (e.g. relational) or can be
schema-less.
Data is stored in vertically scaling
relational database management
systems (RDBMS) at high costs.
Advanced
Analytics
Analytics can be run on any and all
data sets in real-time (e.g. in
memory machine learning
algorithms) without requiring
upfront manual processing or
preparation.
Data typically has to be manually
prepared and integrated from
multiple sources, which can be a
significant barrier to generating
rapid insights.
Enterprise
Data
Taxonomy
Multiple taxonomies, schemas, and
standards can exist in a single data
environment while being applied by
different data stakeholder groups.
Agencies struggled in the past to
create a single taxonomy or
schema to represent the enterprise
data model.
User Access
Control
Data is tagged at ingestion (and
automatically analyzed on read)
with the appropriate authorization
rules. Authentication can be
controlled through single sign on
(SSO) capabilities.
Data authentication and
authorization is specified using
manually-controlled and disparate
tools (e.g. Access Control Lists).
Business
Intelligence
Information and analytics are
conveyed using automated,
feature-rich, dashboards and
visualizations.
Information and analytics are
presented in compiled, static
reports.
Data Lake implementations
using Big Data technologies
like Hadoop, represent a
transformational paradigm
shift in the data enterprise
objectives for agencies. This
shift allows existing legacy
or traditional approaches to
data utilization to drastically
advance forward.

SECTION 3:
DATA LAKE
REQUIREMENTS

5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE REQUIREMENTS
CREATING A SUCCESSFUL DATA LAKE
Scaling the data value proposition of the Data Lake starts by making data
accessible and easy to use. The Data Lake’s data consumers will have
diverse needs, so using a common data storage and access infrastructure
alongside a fully featured Cloud Service Provider (e.g., Amazon Web
Services, Microsoft Azure, etc.) provides the capabilities and flexibilities
needed to drive innovative uses of data and data services.
Using best of breed open-source cloud architectures to overcome “vendor
lock-in” challenges, for a Data Lake eliminates linkage maintenance of stove-
piped systems, increases ease of data use, expedites delivery, and ultimately
reduces the risks/costs associated with achieving innovation.
A successful Data Lake implementation also allows data across the agency
to be integrated and leveraged in a sophisticated solution, and begins with a
modular, modern cluster-based (multiple interconnected servers) architecture
that is grounded in a flexible infrastructure platform.
A significant challenge
when striving for innovative
results is “vendor lock-in,”
which is caused by
proprietary commercial-off-
the-shelf (COTS)
technologies that make it
difficult to modify, scale, or
transition to new data
uses/services.

DATA LAKE GOVERNANCE
Without data lake governance, businesses could be left without meaningful business intelligence -- or even jeopardize the
business.

SELECTING THE RIGHT PLATFORM
Agencies have successfully
used AWS to support
workloads and solutions
with data from Controlled
Unclassified Information
(CUI) to Top Secret
classifications.
 AWS Elastic
MapReduce (EMR), a
managed Hadoop,
Spark, and Presto
Solution
 EMR Ingests with a
number of AWS Services
 AWS also has real-time
analytics, predictive
analytics, and data
dashboard and
visualization capabilities
 AWS has been used to
support government
missions in health and
human sciences,
defense, intelligence,
statistical, regulatory,
and financial industries
 Azure includes the
managed Apache
platform HDInsight
(Hadoop, Spark, Storm,
Hbase)
 HDInsight includes a
local Hadoop Distributed
File System (HDFS),
connected to the Data
Lake
 Azure Data Lake Store
can store data in its
native format, without
prior transformations
 Recently added Azure
Data Lake Analytics, a
serverless hyper-scale
data storage and
analytical platform
 Fully managed Hadoop
and Spark offering
 Provides a fully
programmable
framework for Java and
Python
 Cloud Dataflow & Spark
for pipeline execution
 Machine Learning as a
fully management
platform for training and
hosting
 Google offers a Cloud
Machine Learning
Engine to build model
based on TensorFlow’s
deep learning library
*Comparisons shown above based on August 2017 data

SECTION 4:
5 STEPS FOR ARCHITECTING
A DATA LAKE

1. INGESTION & STORAGE
5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS
DATA INGESTION
To begin data ingestion, agencies must perform an
analysis of the high value data sources present in the
enterprise. These data sources are typically relational
and/or transactional and offer quick-win opportunities to
establish the Data Lake as the center for a single source of
truth. The processes used to obtain and capture data can
be iterated upon, and open source tools can reduce the
complexities of data ingestion configuration.
DATA STORAGE
By developing a data pipeline, events called processors
that can handle specific extract, transform, and load (ETL)
processes on incoming data are implemented. For data
that requires more advanced processing, native tools can
help bridge the gap between data collection, data ETL
(including applying governance policies and access
control), and data storage. For storing data that is from
relational sources, native technologies can be used.
PROPER DATA INGESTION IS CRUCIAL TO THE SUCCESS OF A DATA
LAKE. Understanding the velocity, size, format, and frequency of the
data being ingested, and how it will be analyzed ensures the
architecture properly accommodates data.

2. DATA PROCESSING
The processing capabilities of a Data Lake enable
innovative and creative questioning to happen at speeds
and scales never before seen in legacy data processing
environments. Queries and workloads run across the Data
Lake cluster of nodes as opposed to on single servers,
which reduces the resources required by a single server.
This maximizes the Data Lake’s ability to deliver results in
a timely, streamlined way.
The freedom and expressive ability of a Data Lake’s
processing paradigms allows users to think beyond simply
asking questions of single data sources (e.g., a query
performed on a relational data store).
Newer technologies allow entire datasets across the Data
Lake to be loaded into the memory of the cluster, further
reducing the time to compute heavy workloads, and
delivering results up to 100 times faster. By decreasing the
barriers of complexity to access, and extract value out of
the agency’s data, the Data Lake’s processing paradigms
advance the ability to gain new insights from the data.
From challenges as simple as the word count of a dataset, to as
complicated as processing streaming biometric information, no
workload is too small, too large, too simple, or too complex to be
performed inside the Data Lake.

3. ROBUST DATA GOVERNANCE
Data Lakes offer a single source of truth for an agency.
Therefore, it’s imperative that the data is appropriately
secured and only accessed by authorized individuals. Data
accountability can be established by using a combination
of native tools to ensure that users are only authorized to
view and execute actions that are approved for their role.
This accountability also allows security and audit
specialists to easily evaluate the data configurations and
operations across the Data Lake.
In addition to restricting access, an important piece in the
data and information access control strategy is
implementing data governance, retention, and linage
policies. Introducing these types of policies at the point of
ingestion to the Data Lake automates an otherwise tedious
and complicated process.
Conducting stakeholder interviews to gain an
understanding of target high-value data systems enables a
holistic understanding of the taxonomies present in the
enterprise, and establishes the data governance and
access needs of the Data Lake.
Governance combines quality, management, policy management,
business process management, and risk management to ensure data
is formally and properly managed.

4. DATA RETRIEVAL & VISUALIZATION
One of the most important components of a Data Lake is
the ability to retrieve, analyze, visualize, and share insights
derived from data. Communicating data visually is directly
in line with the key pillars of a successful Data Lake.
Legacy COTS reporting tools are not designed to provide
the creative, captivating, and accessible analytics and
insight desired by users.
This means that the Data Lake’s tools must support the
dynamic challenge of enabling users to easily prepare
visually compelling data stories. As data-related problems
grow in size and complexity, traditional reverse-engineered
analysis methods that require pre-formulated hypotheses
and data source/schema decisions become more
expensive, less accurate, and too rigid for analysts to use
to make timely decisions.
Custom data visualization tools are well-suited to providing
an agency with a platform to deliver visual reporting based
on public data, which can be delivered right to a user’s
email, via built-in automation features.
Today’s data user is accustomed to interaction with apps and data on
their personal devices via sophisticated user experiences and
compelling visual narratives.

5. ADVANCED ANALYTICS
Traditionally, data science projects were incredibly costly
due to the amount of resources needed to perform the
analytical processing required by certain algorithms and
processes. These barriers made the field of data science
difficult to access, because a successful project was too
expensive in both time and costs. However, with a Data
Lake, the ability to process data at huge scales is now
more readily available for data science applications.
An agency can exploit the capabilities found in the Data
Lake by using its cluster based data processing
paradigms. Advanced analytical techniques commonly
found in data science applications, can then be applied.
These techniques include machine learning, natural
language processing, image processing, data mining,
predictive analytics, statistical analytics, and more.

OVERVIEW OF A DATA LAKE’S CAPABILITIES
Successfully implementing a Data Lake environment
requires an advanced understanding of the analytical
insight possibilities the holistic platform provides via its
mixed ecosystem of cutting-edge open-source
technologies and best-of-breed commercial software.
Identifying the best approach for developing and
implementing the components, and the end goal of the
insights to be derived from a Data Lake is critical for
architecting a successful environment.
Incorporating best practices for analyzing, interpreting,
and understanding data science-generated results to
support data-driven decision making also helps ensure
success. Best practices, coupled with building teams with
skillsets in mathematics, computer science, and domain
expertise to solve complex data challenges allows
agencies to maximize data discovery, data-driven
decision making, and return on analytics innovation. All
of which is built on a foundation of standardized
metadata, firm access protocols, intelligent discovery
mechanisms, and a flexible data governance process to
reduce data silos.

SECTION 5:
MAXIMIZING THE
VALUE OF A DATA LAKE

DATA REVOLVES AROUND CITIZENS
A Data Lake is only as
powerful as the insights an
agency is able to derive
from its contents. Those
insights are only as
valuable as the agency’s
ability to power change via
them.
This end state requires the
ability for stakeholders and
users to derive insights
leveraging a Citizen
Engagement Model (CEM)
integration.
Using a component driven
design and development
approach leveraging best
practices from Human
Centered Design and Agile
principles will help
agencies increase the
usability, searchability,
findability, and extensibility
of their data.
5 STEPS FOR ARCHITECTING A DATA LAKE | MAXIMIZING THE VALUE OF A DATA LAKE

ENHANCING CITIZEN EXPERIENCE
By integrating the citizen-
centric data lake with the
CEM, agencies are able to
gather new, valuable insights
from previously siloed
datasets. Those insights :
 Enable quantitative
assessment of changing
customer needs and
technological innovations
 Identify metrics, KPIs, and
requirements needed to
build CEM dashboards
 Identify additional data
sources required
 Improve relevancy of
search index and
recommendations related
to structured and
unstructured searches
 Provide support to create,
maintain, and improve
loading process
 Support configuration and
maintenance of the
current data environments
5 STEPS FOR ARCHITECTING A DATA LAKE | MAXIMIZING THE VALUE OF A DATA LAKE
Properly architecting a Data Lake will provide agencies with numerous benefits including
low-cost storage, custom configurations, unified enterprise data, and the ability to
securely scale – all of which provide agencies with a unique competitive advantage.

The delivery of the Data Lake does not end with architecting, deploying, integrating, and
configuring the solution. The Data Lake is built on the concept of removing barriers to
innovating with data, but without proper education delivered by expert practitioners in the
field of Data Science, Big Data, and Cloud Computing, the opportunities the Data Lake
enable cannot be fully recognized.
Having a team of highly skilled experts supporting a Data Lake is pertinent to the realization
of a fully functioning Data Lake. Our team, comprised of full-service data scientists have
specializations across Big Data, large-scale data platforms, advanced analytics,
mathematical modeling, and computer science are uniquely qualified to provide the level of
educational care our customers require.
We possess deep technical expertise in open source development technologies and
containerization methods that bring efficiencies to development efforts and have a deep
bench of software developer consultants bringing the greatest level of technical acumen
and availability. Our team is not only an avid user and implementer of open source
software, but has also given back to the open source community as active contributors to
the Apache Accumulo, Hadoop, NiFi, and Mahout projects.
ABOUT METROSTAR SYSTEMS
MetroStar Systems has been a trusted partner, delivering leading-edge technology
solutions to federal and defense agencies since 1999. MetroStar’s unique blend of cross-
functional experts across three practice areas: Cybersecurity, Digital, and Enterprise IT,
enables the successful delivery of transformative solutions. Learn more about our work
implementing data lakes for federal agencies: https://www.metrostarsystems.com
5 STEPS FOR ARCHITECTING A DATA LAKE | ASSESSING READINESS
ASSESSING READINESS

TO LEARN MORE ABOUT METROSTAR SYSTEMS:
Contact: Debbie Peterson
1856 Old Reston Avenue, Suite 100
Reston, VA 20190
703.481.9581
dpeterson@metrostarsystems.com
www.metrostarsystems.com
© Copyright 2018 MetroStar Systems, Inc., This document is current as of the initial date of publication and may be changed by MetroStar Systems at any time. The
performance data and examples cited are presented for illustrative purposes only. Actual performance results may vary depending on specific configurations and
operating conditions. THE INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT
ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT.

5 Steps for Architecting a Data Lake

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 5 Steps for Architecting a Data Lake

Similaire à 5 Steps for Architecting a Data Lake (20)

Dernier

Dernier (20)

5 Steps for Architecting a Data Lake