Publicité
Publicité

Contenu connexe

Similaire à Data Lake Architecture(20)

Publicité

Plus de DATAVERSITY(20)

Publicité

Data Lake Architecture

  1. The First Step in Information Management looker.com Produced by: MONTHLY SERIES In partnership with: Data Lake Architecture October 5, 2017
  2. Topics for Today’s Analytics Webinar  Benefits and Risks of a Data Lake  Data Lake Reference Architecture  Lab and the Factory  Base Environment for Batch Analytics, Streaming and Real-Time Data  Critical Governance Components  Key Take-Aways  Q&A pg 2© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
  3. Polling Questions  Do you have a data lake? − Yes − No − Unsure  If yes, is it: − Operational and regularly used for analytics − Informally used, like a lab or sandbox − Unsure pg 3© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
  4. Defining the Data Lake  A data lake is a collection of storage instances of various data assets additional to the originating data sources. These assets are stored in a near-exact, or even exact, copy of the source format.  The purpose of a data lake is to present an unrefined view of data to only the most highly skilled analysts, to help them explore their data refinement and analysis techniques independent of any of the system-of-record compromises that may exist in a traditional analytic data store (such as a data mart or data warehouse).  A data lake can support either/or exploratory analytics and operational uses of data. pg 4© 2017 First San Francisco Partners www.firstsanfranciscopartners.com Source: Gartner IT Glossary
  5. www.firstsanfranciscopartners.com Benefits and Risks of the Data Lake
  6. Benefits of the Data Lake pg 6© 2017 First San Francisco Partners www.firstsanfranciscopartners.com  Enables “productionizing” advanced analytics  Cost-effective scalability and flexibility  Derives value from unlimited data types (including raw data)  Reduces long-term cost of ownership across entire spectrum of data use
  7. Risks of the Data Lake  Loss of trust  Loss of relevance and momentum  Increased risk  Long-term excessive cost pg 7© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
  8. www.firstsanfranciscopartners.com Data Lake Reference Architecture
  9. Modern Reality of the Data Lake pg 9© 2017 First San Francisco Partners www.firstsanfranciscopartners.com  The data lake has changed due to storage availability, data management tools and ease of which data can be managed.  Today’s data lake is comprised of: ‒ Landing Zone ‒ Standardization Zone ‒ Analytics Sandbox
  10. Modern Reality of the Data Lake pg 10© 2017 First San Francisco Partners www.firstsanfranciscopartners.com LANDING ZONEDATA SOURCES Landing Zone: Closest to original data lake conception where raw data is stored and available for consumption
  11. Modern Reality of the Data Lake pg 11© 2017 First San Francisco Partners www.firstsanfranciscopartners.com LANDING ZONE STANDARDIZATION ZONEDATA SOURCES Standardization Zone: Standardized, cleaned data – the preferred version for downstream consumers and the Analytics Sandbox
  12. Modern Reality of the Data Lake pg 12© 2017 First San Francisco Partners www.firstsanfranciscopartners.com LANDING ZONE STANDARDIZATION ZONE ANALYTICS SANDBOXDATA SOURCES Analytics Sandbox: Where Data Scientists work to create new models
  13. Modern Reality of the Data Lake pg 13© 2017 First San Francisco Partners www.firstsanfranciscopartners.com LANDING ZONE STANDARDIZATION ZONE ANALYTICS SANDBOXDATA SOURCES DATA MANAGEMENT
  14. Modern Reality of the Data Lake pg 14© 2017 First San Francisco Partners www.firstsanfranciscopartners.com LANDING ZONE STANDARDIZATION ZONE ANALYTICS SANDBOX DATA GOVERNANCE DATA OPERATIONS DATA SOURCES DATA MANAGEMENT
  15. Modern Reality of the Data Lake pg 15© 2017 First San Francisco Partners www.firstsanfranciscopartners.com LANDING ZONE STANDARDIZATION ZONE ANALYTICS SANDBOX DATA GOVERNANCE DATA OPERATIONS DATA SOURCES DATA SCIENTISTS DATA MANAGEMENT
  16. Modern Reality of the Data Lake pg 16© 2017 First San Francisco Partners www.firstsanfranciscopartners.com LANDING ZONE STANDARDIZATION ZONE ANALYTICS SANDBOX DATA GOVERNANCE DATA CONSUMERS DATA OPERATIONS DATA SOURCES DATA SCIENTISTS DATA MANAGEMENT
  17. Reminder: Two Lenses to Derive an Effective Architecture pg 17© 2017 First San Francisco Partners www.firstsanfranciscopartners.com Form Developing the architecture so all stakeholders can actually understand and develop it Progression Develop architectures that are best fit for purpose and effective, no matter how simple or complex
  18. www.firstsanfranciscopartners.com Lab and the Factory
  19. Why is This Topic Important?  A key to successful data lake management is understanding if it is a lab, a factory or both.  There are architectural, governance and organizational impacts.  You must clearly identify if you are evolving from a lab to a factory or intend to keep them separate. pg 19© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
  20. First Progression – Lab Elements pg 20© 2017 First San Francisco Partners www.firstsanfranciscopartners.com Organization ElementsFunctional Elements Technology Elements Data Consumption DataSupply Chain/Logistics Data Management Landing/Staging ETL Data Analysts Access – Publish, Subscribe, Notify Access Tools – BI, AnalyticsAnalytics – Descriptive, Predictive, Prescriptive HDFS, Columnar and Graph
  21. Operational Elements pg 21© 2017 First San Francisco Partners www.firstsanfranciscopartners.com Organization ElementsFunctional Elements Technology Elements Data Consumption DataSupply Chain/Logistics Data Management Pedigree and Preparation Landing/Staging Model/Metrics Management Data Reduction Glossary Management Machine Learning/AI Data Governance Data OperationsData Ingestion Reference and Master Data Competency Centers Self-Service/Data Citizens ETL/Virtualization Distributed Processing Metadata Data Quality/Hygiene Lake, Pond, Warehouse HDFS, Columnar and Graph Data Streaming Data Glossary Data Lake Management Taxonomy/Ontology Web Services Policy and Process Data Analysts and Scientists Collaboration, Decision-Making Access – Publish, Subscribe, Notify Access Tools – BI, Analytics Applications Analytics – Descriptive, Predictive, Prescriptive Business/Tech. Planning Security, Privacy Business Continuity
  22. pg 22 The Lab – Characteristics  Allows for experimentation, testing new models, proof of concepts  Technical − Flexible architectures, even ad hoc or non-persistent − Rarely documented − Schema on read  Organizational − Run by the main users, hence informal or departmental  Functional − By nature, results should be evaluated for relevance © 2017 First San Francisco Partners www.firstsanfranciscopartners.com
  23. pg 23 The Factory – Characteristics © 2017 First San Francisco Partners www.firstsanfranciscopartners.com  Addressing directed requirements, producing regular outputs associated with a business service, product or action  Technical − Architecture needs to be defined so its use and limits are understood  Organizational − Published rules of engagement  Functional − Data quality is monitored and known − Lineage and metadata support navigation and use of content − May need scheduled access and loading − Publishing results will require some form of quality control and approval − Models that are executed on a scheduled basis will require some sort of administrative and maintenance capabilities
  24. www.firstsanfranciscopartners.com Base Environment for Batch Analytics, Streaming and Real-Time Data
  25. A Base Environment pg 25© 2017 First San Francisco Partners www.firstsanfranciscopartners.com Data Governance Data Operations Rapid ingestion – stream, low latency or batch updating Ease of access – find it, use it, know what it means Effective data supply chain – data of the correct quality needs to be where it is supposed to be Flexibility – Data Scientists need to be able to experiment, but without polluting the lake
  26. Additional Components for Real-time Analytics and Ingesting Streaming Data  Are you replacing the Operational Data Store (ODS)?  Will you be doing full CRUD operation (Create, Update, Read, Delete)?  How fast do you need to go? Latencies should match your real needs.  Vendors – Hortonworks, Attunity, Splice − Ingest − Process − Consumption  Technologies you will hear about − Apache Kafka, Storm (real-time streaming components) − Apache Spark (fast batches) pg 26© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
  27. www.firstsanfranciscopartners.com Critical Governance Components
  28. Major Areas of Data Governance Concern pg 28© 2017 First San Francisco Partners www.firstsanfranciscopartners.com In the Data Lake LANDING ZONE STANDARDIZATION ZONE ANALYTICS SANDBOX DATA GOVERNANCE DATA CONSUMERS DATA OPERATIONS DATA SOURCES DATA SCIENTISTS 1 2 3 3 3 4 5 6 1 2 3 4 5 6 Data Acquisition Data Catalog Data Decisions Analytics Governance Data Usage Model Productionalization Some Data Governance approaches are new, and others are applications of traditional approaches
  29. Major Areas of Data Governance Concern in the Data Lake pg 29© 2017 First San Francisco Partners www.firstsanfranciscopartners.com Data is cataloged/ mapped so it’s easily found Data is described adequately to permit reuse for any need Decisions about data are logged and communicated Flow of data (data lineage) is documented, so users/ regulators can understand where it came from Staff who knows and understands the data are identified Data Governance defines the information you need to maintain your data, develops the processes to do this, trains staff and provides the environments to manage the knowledge, while monitoring and ensuring compliance.
  30. Evolution of Critical Governance Components pg 30© 2017 First San Francisco Partners www.firstsanfranciscopartners.com While flexible, governance is required to ensure appropriate use While operational, governance will ensure legitimacy, compliance and verify alignment with business needs To move to operational, governance should supply road map, new policies, training and organization management
  31. www.firstsanfranciscopartners.com Key Take-Aways
  32. Key Take-Aways  Make sure you offer up business benefits in addition to traditional “access to data” – such as lower costs, more nimble reactions.  Avoid additional data risks by providing oversight of data quality and sources. Do not take a causal approach to managing the data lake assets.  Understand that the architectural aspect of the data lake (as it is evolving) is becoming a standard, much like the data warehouse.  Maintain an open mind for supporting technologies, because they are changing every day.  Implement Data Governance. It is a critical success factor, no matter how you view the data lake. pg 32© 2017 First San Francisco Partners www.firstsanfranciscopartners.com
  33. Questions? © 2017 First San Francisco Partners www.firstsanfranciscopartners.com MONTHLY SERIES
  34. Thank you for joining today! Please join us Thursday, Nov. 2 for the Keys to Effective Data Visualization webinar. John Ladley @jladley john@firstsanfranciscopartners.com Kelle O’Neal @kellezoneal kelle@firstsanfranciscopartners.com
Publicité