Publicité
Publicité

Contenu connexe

Publicité

5 Steps for Architecting a Data Lake

  1. STEPS FOR ARCHITECTING A DATA LAKE How to maximize intelligence by unifying enterprise data © 2018 MetroStar Systems, Inc. - All Rights Reserved 5
  2. © 2018 MetroStar Systems, Inc. - All Rights Reserved 2 5 STEPS FOR ARCHITECTING A DATA LAKE TABLE OF CONTENTS SECTION 1: INTRODUCTION ……………………..…………………………………………………………….. 3 Data Growth Challenges……………..……………………………………………………………… 4 SECTION 2: WHAT IS A DATA LAKE? ……………………….……………………………………………..… 5 How Does a Data Lake Work? …………………………………………………………………… 6 Data Lake vs Traditional Approach …….……………………..……………………………… 7 SECTION 3: DATA LAKE REQUIREMENTS ………………………………………………………………… 8 Creating a Successful Data Lake…………………………………………………………………. 9 Data Lake Governance……………………………………………………..……………………… 10 Selecting the Right Platform…………………………………………….……………………… 11 SECTION 4: 5 STEPS FOR ARCHITECTING A DATA LAKE…….…………………………………...... 12 1. Ingestion & Storage ……………………………………………………………………………. 13 2. Data Processing ………………………………………………………………….………………. 14 3. Robust Data Governance ……………………………………………………………………. 15 4. Data Retrieval and Visualization …………………………………………………………. 16 5. Advanced Analytics………………………………………………………………................. 17 Overview of a Data Lake’s Capabilities ……………..…………………….………………. 18 SECTION 5: MAXIMIZING THE VALUE OF A DATA LAKE …….……………………………………. 19 Data Revolves Around Citizens……………..………………………………….………………. 20 Enhancing Citizen Experience……………..……………………………….….………………. 21 ASSESSING READINESS………………………………………………………………………………………….. 22
  3. SECTION 1: INTRODUCTION © 2018 MetroStar Systems, Inc. - All Rights Reserved 3
  4. © 2018 MetroStar Systems, Inc. - All Rights Reserved 4 DATA GROWTH CHALLENGES 5 STEPS FOR ARCHITECTING A DATA LAKE | INTRODUCTION Data Growth Challenges:  High overhead costs due to inflexible architecture and legacy technology maintenance  Antiquated data environments that suffer from poor master data management practices  Low data integrity due to a lack of a single source of truth with respect to the data  Inability to provide internal users, analysts, developers, and management the tools needed to perform their respective roles at the high caliber of quality expected from today’s workplace Enterprises that do not employ Data Lake platforms can find themselves being outpaced by the rate of their agency’s data growth. AS AN AGENCY GROWS SO DOES ITS DATA. Data is no longer limited to structured, relational, and/or transactional in nature. Data now includes semi-structured, unstructured, operational log, social media, free-text, and more. The ability to ingest data of all varieties is imperative to gaining a holistic understanding of the digital ecosystem. Agencies can leverage cutting-edge technologies with wide-ranging, high integrity data sources to derive powerful insights to their operational and theoretical questions. By coupling the robust technologies of a Data Lake with the flexible, cost effect capabilities of a Cloud Service Provider (CSP) such as Amazon Web Services (AWS) or Microsoft Azure, among others, the value the Data Lake offers becomes a powerful asset for agencies large and small. Source: http://infosysblogs.com/brandededge/2013/04/20130419infographic.html
  5. SECTION 2: WHAT IS A DATA LAKE? © 2018 MetroStar Systems, Inc. - All Rights Reserved 5
  6. © 2018 MetroStar Systems, Inc. - All Rights Reserved 6 HOW DOES A DATA LAKE WORK? 5 STEPS FOR ARCHITECTING A DATA LAKE | WHAT IS A DATA LAKE? A Data Lake is a natural maturation of data migrating to a single environment. The Data Lake provides capabilities seldom seen in IT enterprises that employ disparate data stores and databases. “A data lake is like a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” – James Dixon, CTO, Pentaho
  7. © 2018 MetroStar Systems, Inc. - All Rights Reserved 7 DATA LAKE vs TRADITIONAL APPROACH 5 STEPS FOR ARCHITECTING A DATA LAKE | WHAT IS A DATA LAKE? DATA LAKE TRADITIONAL Data Storage Structured, semi-structured, or unstructured data can be stored at low costs and can be stored with a schema (e.g. relational) or can be schema-less. Data is stored in vertically scaling relational database management systems (RDBMS) at high costs. Advanced Analytics Analytics can be run on any and all data sets in real-time (e.g. in memory machine learning algorithms) without requiring upfront manual processing or preparation. Data typically has to be manually prepared and integrated from multiple sources, which can be a significant barrier to generating rapid insights. Enterprise Data Taxonomy Multiple taxonomies, schemas, and standards can exist in a single data environment while being applied by different data stakeholder groups. Agencies struggled in the past to create a single taxonomy or schema to represent the enterprise data model. User Access Control Data is tagged at ingestion (and automatically analyzed on read) with the appropriate authorization rules. Authentication can be controlled through single sign on (SSO) capabilities. Data authentication and authorization is specified using manually-controlled and disparate tools (e.g. Access Control Lists). Business Intelligence Information and analytics are conveyed using automated, feature-rich, dashboards and visualizations. Information and analytics are presented in compiled, static reports. Data Lake implementations using Big Data technologies like Hadoop, represent a transformational paradigm shift in the data enterprise objectives for agencies. This shift allows existing legacy or traditional approaches to data utilization to drastically advance forward.
  8. SECTION 3: DATA LAKE REQUIREMENTS © 2018 MetroStar Systems, Inc. - All Rights Reserved 8
  9. © 2018 MetroStar Systems, Inc. - All Rights Reserved 9 5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE REQUIREMENTS CREATING A SUCCESSFUL DATA LAKE Scaling the data value proposition of the Data Lake starts by making data accessible and easy to use. The Data Lake’s data consumers will have diverse needs, so using a common data storage and access infrastructure alongside a fully featured Cloud Service Provider (e.g., Amazon Web Services, Microsoft Azure, etc.) provides the capabilities and flexibilities needed to drive innovative uses of data and data services. Using best of breed open-source cloud architectures to overcome “vendor lock-in” challenges, for a Data Lake eliminates linkage maintenance of stove- piped systems, increases ease of data use, expedites delivery, and ultimately reduces the risks/costs associated with achieving innovation. A successful Data Lake implementation also allows data across the agency to be integrated and leveraged in a sophisticated solution, and begins with a modular, modern cluster-based (multiple interconnected servers) architecture that is grounded in a flexible infrastructure platform. A significant challenge when striving for innovative results is “vendor lock-in,” which is caused by proprietary commercial-off- the-shelf (COTS) technologies that make it difficult to modify, scale, or transition to new data uses/services.
  10. © 2018 MetroStar Systems, Inc. - All Rights Reserved 10 DATA LAKE GOVERNANCE 5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE REQUIREMENTS Without data lake governance, businesses could be left without meaningful business intelligence -- or even jeopardize the business.
  11. © 2018 MetroStar Systems, Inc. - All Rights Reserved 11 SELECTING THE RIGHT PLATFORM 5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE REQUIREMENTS Agencies have successfully used AWS to support workloads and solutions with data from Controlled Unclassified Information (CUI) to Top Secret classifications.  AWS Elastic MapReduce (EMR), a managed Hadoop, Spark, and Presto Solution  EMR Ingests with a number of AWS Services  AWS also has real-time analytics, predictive analytics, and data dashboard and visualization capabilities  AWS has been used to support government missions in health and human sciences, defense, intelligence, statistical, regulatory, and financial industries  Azure includes the managed Apache platform HDInsight (Hadoop, Spark, Storm, Hbase)  HDInsight includes a local Hadoop Distributed File System (HDFS), connected to the Data Lake  Azure Data Lake Store can store data in its native format, without prior transformations  Recently added Azure Data Lake Analytics, a serverless hyper-scale data storage and analytical platform  Fully managed Hadoop and Spark offering  Provides a fully programmable framework for Java and Python  Cloud Dataflow & Spark for pipeline execution  Machine Learning as a fully management platform for training and hosting  Google offers a Cloud Machine Learning Engine to build model based on TensorFlow’s deep learning library *Comparisons shown above based on August 2017 data
  12. SECTION 4: 5 STEPS FOR ARCHITECTING A DATA LAKE © 2018 MetroStar Systems, Inc. - All Rights Reserved 12
  13. © 2018 MetroStar Systems, Inc. - All Rights Reserved 13 1. INGESTION & STORAGE 5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS DATA INGESTION To begin data ingestion, agencies must perform an analysis of the high value data sources present in the enterprise. These data sources are typically relational and/or transactional and offer quick-win opportunities to establish the Data Lake as the center for a single source of truth. The processes used to obtain and capture data can be iterated upon, and open source tools can reduce the complexities of data ingestion configuration. DATA STORAGE By developing a data pipeline, events called processors that can handle specific extract, transform, and load (ETL) processes on incoming data are implemented. For data that requires more advanced processing, native tools can help bridge the gap between data collection, data ETL (including applying governance policies and access control), and data storage. For storing data that is from relational sources, native technologies can be used. PROPER DATA INGESTION IS CRUCIAL TO THE SUCCESS OF A DATA LAKE. Understanding the velocity, size, format, and frequency of the data being ingested, and how it will be analyzed ensures the architecture properly accommodates data.
  14. © 2018 MetroStar Systems, Inc. - All Rights Reserved 14 2. DATA PROCESSING The processing capabilities of a Data Lake enable innovative and creative questioning to happen at speeds and scales never before seen in legacy data processing environments. Queries and workloads run across the Data Lake cluster of nodes as opposed to on single servers, which reduces the resources required by a single server. This maximizes the Data Lake’s ability to deliver results in a timely, streamlined way. The freedom and expressive ability of a Data Lake’s processing paradigms allows users to think beyond simply asking questions of single data sources (e.g., a query performed on a relational data store). Newer technologies allow entire datasets across the Data Lake to be loaded into the memory of the cluster, further reducing the time to compute heavy workloads, and delivering results up to 100 times faster. By decreasing the barriers of complexity to access, and extract value out of the agency’s data, the Data Lake’s processing paradigms advance the ability to gain new insights from the data. From challenges as simple as the word count of a dataset, to as complicated as processing streaming biometric information, no workload is too small, too large, too simple, or too complex to be performed inside the Data Lake. 5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS
  15. © 2018 MetroStar Systems, Inc. - All Rights Reserved 15 3. ROBUST DATA GOVERNANCE Data Lakes offer a single source of truth for an agency. Therefore, it’s imperative that the data is appropriately secured and only accessed by authorized individuals. Data accountability can be established by using a combination of native tools to ensure that users are only authorized to view and execute actions that are approved for their role. This accountability also allows security and audit specialists to easily evaluate the data configurations and operations across the Data Lake. In addition to restricting access, an important piece in the data and information access control strategy is implementing data governance, retention, and linage policies. Introducing these types of policies at the point of ingestion to the Data Lake automates an otherwise tedious and complicated process. Conducting stakeholder interviews to gain an understanding of target high-value data systems enables a holistic understanding of the taxonomies present in the enterprise, and establishes the data governance and access needs of the Data Lake. Governance combines quality, management, policy management, business process management, and risk management to ensure data is formally and properly managed. 5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS
  16. © 2018 MetroStar Systems, Inc. - All Rights Reserved 16 4. DATA RETRIEVAL & VISUALIZATION One of the most important components of a Data Lake is the ability to retrieve, analyze, visualize, and share insights derived from data. Communicating data visually is directly in line with the key pillars of a successful Data Lake. Legacy COTS reporting tools are not designed to provide the creative, captivating, and accessible analytics and insight desired by users. This means that the Data Lake’s tools must support the dynamic challenge of enabling users to easily prepare visually compelling data stories. As data-related problems grow in size and complexity, traditional reverse-engineered analysis methods that require pre-formulated hypotheses and data source/schema decisions become more expensive, less accurate, and too rigid for analysts to use to make timely decisions. Custom data visualization tools are well-suited to providing an agency with a platform to deliver visual reporting based on public data, which can be delivered right to a user’s email, via built-in automation features. Today’s data user is accustomed to interaction with apps and data on their personal devices via sophisticated user experiences and compelling visual narratives. 5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS
  17. © 2018 MetroStar Systems, Inc. - All Rights Reserved 17 5. ADVANCED ANALYTICS Traditionally, data science projects were incredibly costly due to the amount of resources needed to perform the analytical processing required by certain algorithms and processes. These barriers made the field of data science difficult to access, because a successful project was too expensive in both time and costs. However, with a Data Lake, the ability to process data at huge scales is now more readily available for data science applications. An agency can exploit the capabilities found in the Data Lake by using its cluster based data processing paradigms. Advanced analytical techniques commonly found in data science applications, can then be applied. These techniques include machine learning, natural language processing, image processing, data mining, predictive analytics, statistical analytics, and more. 5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS
  18. © 2018 MetroStar Systems, Inc. - All Rights Reserved 18 OVERVIEW OF A DATA LAKE’S CAPABILITIES Successfully implementing a Data Lake environment requires an advanced understanding of the analytical insight possibilities the holistic platform provides via its mixed ecosystem of cutting-edge open-source technologies and best-of-breed commercial software. Identifying the best approach for developing and implementing the components, and the end goal of the insights to be derived from a Data Lake is critical for architecting a successful environment. Incorporating best practices for analyzing, interpreting, and understanding data science-generated results to support data-driven decision making also helps ensure success. Best practices, coupled with building teams with skillsets in mathematics, computer science, and domain expertise to solve complex data challenges allows agencies to maximize data discovery, data-driven decision making, and return on analytics innovation. All of which is built on a foundation of standardized metadata, firm access protocols, intelligent discovery mechanisms, and a flexible data governance process to reduce data silos. 5 STEPS FOR ARCHITECTING A DATA LAKE | DATA LAKE STEPS
  19. SECTION 5: MAXIMIZING THE VALUE OF A DATA LAKE © 2018 MetroStar Systems, Inc. - All Rights Reserved 19
  20. © 2018 MetroStar Systems, Inc. - All Rights Reserved 20 DATA REVOLVES AROUND CITIZENS A Data Lake is only as powerful as the insights an agency is able to derive from its contents. Those insights are only as valuable as the agency’s ability to power change via them. This end state requires the ability for stakeholders and users to derive insights leveraging a Citizen Engagement Model (CEM) integration. Using a component driven design and development approach leveraging best practices from Human Centered Design and Agile principles will help agencies increase the usability, searchability, findability, and extensibility of their data. 5 STEPS FOR ARCHITECTING A DATA LAKE | MAXIMIZING THE VALUE OF A DATA LAKE
  21. © 2018 MetroStar Systems, Inc. - All Rights Reserved 21 ENHANCING CITIZEN EXPERIENCE By integrating the citizen- centric data lake with the CEM, agencies are able to gather new, valuable insights from previously siloed datasets. Those insights :  Enable quantitative assessment of changing customer needs and technological innovations  Identify metrics, KPIs, and requirements needed to build CEM dashboards  Identify additional data sources required  Improve relevancy of search index and recommendations related to structured and unstructured searches  Provide support to create, maintain, and improve loading process  Support configuration and maintenance of the current data environments 5 STEPS FOR ARCHITECTING A DATA LAKE | MAXIMIZING THE VALUE OF A DATA LAKE Properly architecting a Data Lake will provide agencies with numerous benefits including low-cost storage, custom configurations, unified enterprise data, and the ability to securely scale – all of which provide agencies with a unique competitive advantage.
  22. The delivery of the Data Lake does not end with architecting, deploying, integrating, and configuring the solution. The Data Lake is built on the concept of removing barriers to innovating with data, but without proper education delivered by expert practitioners in the field of Data Science, Big Data, and Cloud Computing, the opportunities the Data Lake enable cannot be fully recognized. Having a team of highly skilled experts supporting a Data Lake is pertinent to the realization of a fully functioning Data Lake. Our team, comprised of full-service data scientists have specializations across Big Data, large-scale data platforms, advanced analytics, mathematical modeling, and computer science are uniquely qualified to provide the level of educational care our customers require. We possess deep technical expertise in open source development technologies and containerization methods that bring efficiencies to development efforts and have a deep bench of software developer consultants bringing the greatest level of technical acumen and availability. Our team is not only an avid user and implementer of open source software, but has also given back to the open source community as active contributors to the Apache Accumulo, Hadoop, NiFi, and Mahout projects. ABOUT METROSTAR SYSTEMS MetroStar Systems has been a trusted partner, delivering leading-edge technology solutions to federal and defense agencies since 1999. MetroStar’s unique blend of cross- functional experts across three practice areas: Cybersecurity, Digital, and Enterprise IT, enables the successful delivery of transformative solutions. Learn more about our work implementing data lakes for federal agencies: https://www.metrostarsystems.com © 2018 MetroStar Systems, Inc. - All Rights Reserved 22 5 STEPS FOR ARCHITECTING A DATA LAKE | ASSESSING READINESS ASSESSING READINESS
  23. TO LEARN MORE ABOUT METROSTAR SYSTEMS: Contact: Debbie Peterson 1856 Old Reston Avenue, Suite 100 Reston, VA 20190 703.481.9581 dpeterson@metrostarsystems.com www.metrostarsystems.com © Copyright 2018 MetroStar Systems, Inc., This document is current as of the initial date of publication and may be changed by MetroStar Systems at any time. The performance data and examples cited are presented for illustrative purposes only. Actual performance results may vary depending on specific configurations and operating conditions. THE INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT.
Publicité