Publicité

Shaping the Role of a Data Lake in a Modern Data Fabric Architecture

Denodo
21 Oct 2022
Publicité

Contenu connexe

Similaire à Shaping the Role of a Data Lake in a Modern Data Fabric Architecture(20)

Plus de Denodo (20)

Publicité

Shaping the Role of a Data Lake in a Modern Data Fabric Architecture

  1. DATA VIRTUALIZATION Packed Lunch Webinar Series Sessions Covering Key Data Integration Challenges Solved with Data Virtualization
  2. Shaping the Role of a Data Lake in a Modern Data Fabric Architecture Pablo Alvarez Global Director of Product Management Denodo Alberto Pan CTO Denodo
  3. 3 The Rise and Fall of the Data Lake • Data Lakes were often the flagship initiatives of the Hadoop era • However, few data lakes manage to fulfill initial expectations, and often failed to deliver results • Those “data swamps” were often criticized for lack of process, governance and security • However, many of the technological advances of those data lakes lived on in newer technologies
  4. 4 The Advent of the Object Storage • Object Storage is a form of storage for unstructured data (objects) that eliminates scaling limitations of traditional storage options • In other words, it is limitless in terms of capacity • Its rooted in the Big Data initiatives of the early 2010’s, especially HDFS • But it came to popularity with its adoption by cloud providers • Nowadays, Amazon’s S3 (Simple Storage Service) and Azure’s ADLS (Azure Data Lake Storage) are the most popular • Although there are alternatives from other vendors (Google, Oracle, IBM, etc) and open source options (like MinIO)
  5. 5 Object Storage is the Foundation of Cloud Data Systems • Modern cloud data systems, like cloud EDW, data lakes and the “lakehouse”, have evolved based on the premise of separation of processing and storage • Unlike traditional EDW, processing power was not tied to additional disk space • Object storage technologies provided the limitless storage they needed, in a more cost-efficient way, and adapted to the cloud • Open formats, like Parquet and Avro, specifically designed for interoperability on analytics, helped them grow and gain adoption
  6. However, it’s versatility has made them useful beyond “just” storage for those systems Let’s look at some examples
  7. 7 Common Usage Patterns for Modern Data Lakes • Cheap storage for backup, old or rarely used data • Ingest 3rd party data • Move non-critical workloads to cheaper systems • Data science playground • New life for legacy Hadoop efforts • And many others
  8. 8 Can you work with an object storage alone? • Object storage platforms provide limitless, cost- efficient storage space • However, they are still filesystems • Although some client applications can connect and use those files directly as if they were in a local filesystem, processing data that way is not efficient • In addition, object storage platforms offer limited granularity in terms of security, and few options for governance • Incorporating an object storage in your data strategy will need additional pieces
  9. 9 What else do we need? 1. In order to process data in the object storage efficiently, we will need a modern MPP engine that can work in parallel to process large data volumes • Most new generation cloud data systems, like Snowflake, Databricks, Presto, Redshift, etc. follow that design 2. But an MPP engine alone is not enough, as seen by the failures of previous incarnations of Data Lake projects! 3. We need to bring additional options for data management: • Fine-grained security and access control • Documentation, classification and search capabilities to bring cataloguing and governance into the process • Data integration capabilities to ingest, massage, curate and expose information in the right format 4. Additionally, we need to keep in mind that data in the object storage is just a portion of the data in the organization. All data should be managed with consistency, regardless of location
  10. 10 Adding an MPP engine to the Denodo Platform Logical Layer Traditional DB & DW Cloud Excel Lake filesystem (S3/ADLS) Lake Engine MPP Engine
  11. 11 How does it work? • Easy, efficient MPP access to content in the object storage • No need for an additional external engine • Integrated security and management • Out-of-the-box MPP options for caching and query acceleration Logical Layer MPP Coordinator MPP worker MPP worker MPP worker MPP worker Object Storage
  12. 12 How does it work? Object Storage configuration Object Storage browsing • Automated deployment using Kubernetes and Helm charts • Integrated configuration • Graphical browsing and introspection of object storage
  13. 13 Putting in Context Denodo Virtualization Server Denodo Data Catalog Denodo Web Services On-prem data Other Apps IdP Denodo MPP Warehouse A Warehouse B AWS S3 bucket AWS Aurora
  14. 14 Move Non-Critical Workloads to Cheaper Systems • Separation of compute and storage means that the same data and queries can be computed with other engines with minimal changes • Denodo includes the tools to move and keep data updated when needed • A logical layer means that the change is transparent for consumers
  15. 15 Cheap storage for backup, old or rarely used data • Object storage is a great option for data that is rarely used but that need to be stored for backup or compliance reasons • These data can be exported into Parquet and moved to the object storage • Denodo can automatically map these data and make it accessible at no additional cost
  16. 16 Ingest 3rd party data • An object storage where our partners have access is a great way to offer a way to bring third party data into the organization • Data can be in parquet, but also in JSON, CSV or even Excel • Denodo can automatically map it • And provide the right tools to massage and load in the corporate systems on periodical bases
  17. 17 Data Science Playground • Denodo provides access in SQL to any company data asset • This data can be easily moved into the object storage, where the MPP engine can efficiently process it for deeper analysis • Denodo offers native python drivers and is compatible with popular data scient toolkits (e.g. pandas) and tools (R, DataIku, etc.) • Additionally, a data scientist may prefer to export content to a parquet file and connect directly to that file from a different platform, like Databricks
  18. 18 Conclusions 1. Object Storage technologies, especially in the cloud (S3, ADLS, etc.), offer a very attractive and flexible technology to store very large data volumes at low cost 2. New-gen MPP engines provide efficient processing capabilities for data stored in an object storage, especially when formats like Parquet are used 3. A logical layer, like Denodo, provides the additional security, governance and data integration requirements to safely introduce an object storage based data lake into your data strategy
  19. Fireside Chat: Shaping the Role of a Data Lake in a Modern Data Fabric Architecture Pablo Alvarez Global Director of Product Management Denodo Alberto Pan CTO Denodo
  20. Q&A
  21. 21 Next Steps Access Denodo Platform in the Cloud. Start your Free Trial today! G E T STA RT E D TO DAY www.denodo.com/free-trials Logical Data Fabric A Technical Whitepaper DOWNLOAD WHITEPAPER
  22. Thanks! www.denodo.com info@denodo.com © Copyright Denodo Technologies. All rights reserved Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.
Publicité