Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

From Insights to Value - Building a Modern Logical Data Lake to Drive User Adoption and Business Value

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 26 Publicité

From Insights to Value - Building a Modern Logical Data Lake to Drive User Adoption and Business Value

Télécharger pour lire hors ligne

Businesses often have to interact with different data sources to get a unified view of the business or to resolve discrepancies. These EDW data repositories are often large and complex, are business critical, and cannot afford downtime. This session will share best practices and lessons learned for building a Data Fabric on Spark / Hadoop / HIVE/ NoSQL that provides a unified view, enables a simplified access to the data repositories, resolves technical challenges and adds business value. Businesses often have to interact with different data sources to get a unified view of the business or to resolve discrepancies. These EDW data repositories are often large and complex, are business critical, and cannot afford downtime. This session will share best practices and lessons learned for building a Data Fabric on Spark / Hadoop / HIVE/ NoSQL that provides a unified view, enables a simplified access to the data repositories, resolves technical challenges and adds business value.

Businesses often have to interact with different data sources to get a unified view of the business or to resolve discrepancies. These EDW data repositories are often large and complex, are business critical, and cannot afford downtime. This session will share best practices and lessons learned for building a Data Fabric on Spark / Hadoop / HIVE/ NoSQL that provides a unified view, enables a simplified access to the data repositories, resolves technical challenges and adds business value. Businesses often have to interact with different data sources to get a unified view of the business or to resolve discrepancies. These EDW data repositories are often large and complex, are business critical, and cannot afford downtime. This session will share best practices and lessons learned for building a Data Fabric on Spark / Hadoop / HIVE/ NoSQL that provides a unified view, enables a simplified access to the data repositories, resolves technical challenges and adds business value.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à From Insights to Value - Building a Modern Logical Data Lake to Drive User Adoption and Business Value (20)

Publicité

Plus par DataWorks Summit (20)

Plus récents (20)

Publicité

From Insights to Value - Building a Modern Logical Data Lake to Drive User Adoption and Business Value

  1. 1. From Insights to Value Building a Modern Logical Data Lake To Drive User Adoption and Business Value Vineet Tyagi / Impetus
  2. 2. We Make Big Data Work We have been supporting several Fortune 500 customers on their Big Data Journey since last 10 years Across the board we have seen • Fast changing analytic and reporting requirements • Lack of end user self service capabilities • Need for better collaboration and agility in working with trusted data • Data is in silos making it difficult to get closer to customers, additionally data from traditional sources still remains important.
  3. 3. © 2017 Impetus Technologies – Confidential “Enterprises today are realizing about 15% of potential ROI on BI investments”
  4. 4. © 2017 Impetus Technologies – Confidential “Fragmented purpose driven Hadoop data lakes are creating integration challenges”
  5. 5. The new normal for enterprise IT EDW + BDW (Lake/s) == Unified Enterprise Data
  6. 6. Making insights and data in the lake readily discoverable, accessible and usable “Visual data-discovery, an important enabler of end user self-service” Challenge 1 : Providing a complete seamless view of business data
  7. 7. Challenge 2 : Simplified & Self-Serve enablement of BI Provision Cluster Discover and Blend New Sources Data Access and Exploration Ingest and Transform data Security and Governance BI, Analytics and Models
  8. 8. Challenge 3 : Support Use-Case driven Data Access mechanisms Specific Query & Reporting SQL Cross Dimensional Fast Slice Dice and Drill Down OLAP Data from MPP, Relational and Hadoop Data Virtualization Finding the “Needle in a Haystack” Search “Don’t Know What You Don’t Know” Self Service Data Discovery
  9. 9. Challenge 4 : Leverage EDW and BDW coexistence Optimizing the placement of enterprise workloads and the data on which they operate The multi-platform environment is the warehouse Frees capacity on high-end analytic and data warehouse systems • Immediate ROI on Hadoop Get a platform better suited to advanced analytics “Visual data-discovery, an important enabler of end user self-service”
  10. 10. Challenge 5 : Collaboration & Reuse Collaboration & Reuse of data and analytical assets on a logical data lake Data Democratization • Lowering adoption barriers for your stakeholders • Getting the data they want should be fast and easy Analytical Democratization • Publishing and Discovery of analytical assets • Ability to reuse in simple and consistent way
  11. 11. Let’s Build the Logical Lake
  12. 12. Logical Data Lake : Modern Analytical Data Fabric Landing and ingestion Structured Unstructured External Social Machine Geospatial Time Series Streaming Enterprise Data Lake Real-Time applications Data Federation/ Virtualization Exploration & discovery Data Wrangling RDBMS MPP Enterprise Meta Data Management Accelerators Traditional data repositories Provisioning, Workflow, Monitoring and Governance
  13. 13. Providing a complete seamless view of business data • Simple, consistent view of meta information • Automated sourcing and seeding • Social and usage based enrichment • Not ONLY a data catalogue • Analytical asset catalogue • Leverage and supplement existing business ontologies
  14. 14. Leverage EDW and BDW coexistence Optimizing the placement of enterprise workloads and the data on which they operate “Right Positioning” of workloads based on price / performance • Most bang for the buck Build a platform better suited to advanced analytics with Big Data technologies • Retaining what works “in situ” “Visual data-discovery, an important enabler of end user self-service”
  15. 15. © 2017 Impetus Technologies – Confidential Technical Perspective and Choices
  16. 16. Architectural Patterns Architectural Patterns: Streaming Pattern 1: Streaming Ingestion Pattern 2: Near real time event Processing with external context. Pattern 3: Near Real Time Partitioned event processing with external context. Pattern 4: Complex topology for Aggregations or Machine Learning. Architectural Patterns: Batch + Streaming Pattern 5: The Lambda Architecture: Hadoop and Storm Pattern 6: Merging Batch and Streaming: Kappa a post lambda architecture. Pattern 7: Unified Batch and Stream Processing: Flink or Spark
  17. 17. Pattern 1: Streaming Ingestion Use Case Scenarios 1. Efficiently collecting, aggregating, and moving large amounts of streaming data into Hadoop cluster 2. Emphasis on low-latency persisting of events to • HDFS • Apache HBase • Apache Solr
  18. 18. Pattern 2: Near real time event Processing Use Case Scenarios: Alerting, flagging, transforming and filtering of events as they arrive. Take immediate decisions to transform the data Take some sort of external action. The decision often depends on external profile or metadata. The user code can interact with local memory or distributed cache The user code can interact with external storage system like Hbase
  19. 19. Pattern 3: Near Real Time with partitioned external context Use Case Scenarios: When external context information required for event processing doesn’t fit in local memory Calling to external system like HBase does not meet the SLA requirements
  20. 20. Pattern 4: Complex topology for Aggregations or ML Use Case Scenarios: • Real time data from complex and flexible set of operations • Complex operations like counts, averages, sessionization • Results often depend upon windowed computations or require more active data • Focus shifts from ultra low frequency to functionality and accuracy. • Machine-learning model building that operate on batches of data.
  21. 21. Speed Layer 1. Compensate for recently updated data 2. Do fast incremental computations on the newly arrived data 3. Batch layer would eventually overwrite speed layer. 4. Provide random reads and random writes Storm, Flume etc. Serving Layer 1. Random access to batch views 2. Bulk updates from Batch Layer 3. No random writes Hbase, Solr etc. Batch Layer 1. Stores master data set. 2. Compute Batch views Hadoop, Solr cluster Pattern 5: Lambda Architecture
  22. 22. Problems with Lambda Architecture. • You implement your transformation logic twice once in the batch system and another time in the stream processing system. The two needs to be in sync to give the right result. • You stitch together the results from both the batch views and the real time views to produce a complete answer. Kappa architecture swtiches over to using a canonical data store that is an append only immutable log instead of relational DB like SQL or a key-value store like Cassandra, The seving layer uses the data streamed through the computational system and stored in auxiliary stores for serving. Pattern 6: Kappa Architecture
  23. 23. No. Technology Strength 1 Flink + Distributed stream and batch data processing + Distributed computations over data streams + High throughput and low latency + Exactly-once guaranty + Batch processing applications as special cases of stream processing 2 Spark + Fast + Low latency Pattern 7: Unified Batch and Streaming Framework
  24. 24. Time Range based technology choices Map Reduce Impala Impala Impala Flume Interceptors Spark Streaming Spark Spark Custom Storm Trident Tez Tez 50ms > 500 ms >30,000 ms >90,000 ms
  25. 25. Logical Data Lake : Modern Analytical Data Fabric Landing and Ingestion Structured Unstructured External Social Machine Geospatial Time Series Streaming Real-Time Applications Data Federation/ Virtualization Exploration & Discovery Data Wrangling RDBMS MPP Enterprise Meta Data Management Accelerators Traditional Data Repositories Provisioning, Workflow, Monitoring and Governance DATA BLENDING Metadata & Discovery WORKLOAD MIGRATION Data Blending Enterprise Data Lake
  26. 26. Thank you. Questions? © 2017 Impetus Technologies – Confidential

×