Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

End to End Streaming Architectures

2 604 vues

Publié le

Data is being generated at a feverish pace and many businesses want all of it at their disposal to solve complex strategic problems. As decision making moves to real-time, enterprises need data ready for analysis immediately. Sean Anderson and Amandeep Khurana will discuss common pipeline trends in modern streaming architectures, Hadoop components that enable streaming capabilities, and popular use cases that are enabling the world of IOT and real-time data science.

Publié dans : Logiciels
  • Soyez le premier à commenter

End to End Streaming Architectures

  1. 1. 1© Cloudera, Inc. All rights reserved. Beyond ETL: End to End Streaming Architectures
  2. 2. 2© Cloudera, Inc. All rights reserved. Your Speakers Amandeep Khurana Solutions Architect Sean Anderson Product Marketing
  3. 3. 3© Cloudera, Inc. All rights reserved. Agenda Trends I. Traditional Architectures II. New Solutions Primer on Streaming Systems I. Kafka II. Flume III. Storm IV. Spark Streaming V. Flink Typical Architectures I. Bus centric II. File system centric III. Hybrid
  4. 4. 4© Cloudera, Inc. All rights reserved. Poll How many are familiar with the Hadoop ecosystem components? • Little/no familiarity • Starting out • Very familiar, not in production • Using in production
  5. 5. 5© Cloudera, Inc. All rights reserved. Disruption in the Space 16 billion connected devices generating more data than ever. Data is driving modern businesses. Popular data warehouse platforms were not designed to handle the scale of modern data. Processing technology struggling to keep up with increased data and new real- time formats. Explosion of Unstructured Data Data Warehouse Limitations Increased Processing Demands
  6. 6. 6© Cloudera, Inc. All rights reserved. Traditional Pipelines
  7. 7. 7© Cloudera, Inc. All rights reserved. Challenges with a Traditional Solution 1) Limited Data Archive ETL ELT Staging Environments Enterprise Data Warehouse Applications BI System Modeling Reporting Storage Archive Unstructured Data Data Sources Structured Data Ingest Ingest 1 Serve Model Load Process Load
  8. 8. 8© Cloudera, Inc. All rights reserved. Challenges with a Traditional Solution 1) Limited Data Archive ETL ELT Staging Environments Enterprise Data Warehouse Applications BI System Modeling Reporting Storage Archive Unstructured Data Data Sources Structured Data Ingest Ingest 1 Serve Model Load Process 2) Poor Performance 2 2 Load
  9. 9. 9© Cloudera, Inc. All rights reserved. Challenges with a Traditional Solution 1) Limited Data Archive ETL ELT Staging Environments Enterprise Data Warehouse Applications BI System Modeling Reporting Storage Archive Unstructured Data Data Sources Structured Data Ingest Ingest 1 Serve Model Load Process 2) Poor Performance 2 2 3) Operational Complexity 3 Load
  10. 10. 10© Cloudera, Inc. All rights reserved. What is the impact to the business? Limited Data Access • Archived data is inaccessible • Streaming data not captured • Unstructured data not captured Missed Processing SLA’s • Data ingest/transformations exceeding processing windows • New projects abandoned due to workload • Decreased Engineering and Data Science productivity Poor ROI • High Data Warehouse/RDBMS Costs • Data archived to save cost • Increased performance drag on analytic systems Operational Fragmentation • Separate platforms for batch and stream processing • Separate security, governance, and management • Insufficient access for developers
  11. 11. 11© Cloudera, Inc. All rights reserved. Data Ingestion and Processing at Cloudera
  12. 12. 12© Cloudera, Inc. All rights reserved. Cloudera Enterprise, A New Way Forward
  13. 13. 13© Cloudera, Inc. All rights reserved. Poll Which streaming systems have you heard of? • Kafka • Storm • Spark Streaming • Samza • Flink • Kafka Streams • Flume
  14. 14. 14© Cloudera, Inc. All rights reserved. Streaming Systems and Architectures
  15. 15. 15© Cloudera, Inc. All rights reserved. Ingestion The foundation of your data platform Data can come from a variety of “siloed” sources ▪ Existing databases ▪ Sensor data ▪ Server logs ▪ Chat transcripts Value of data is multiplied when combined and correlated with other data ▪ “40% value improvement from combining data from multiple IoT sources” McKinsey Global Institute
  16. 16. 16© Cloudera, Inc. All rights reserved. Apache Sqoop SQL to Hadoop Efficiently exchange data between database and Hadoop • Bidirectional • Import all or partial/new data • Export for shared data access across systems Easily get started with high performance connectors • Free to use • Optimized connectors for popular RDBMS, EDW, and NoSQL options OPERATIONS DATAMANAGEMENT UNIFIED SERVICES PROCESS,ANALYZE, SERVE STORE INTEGRATE
  17. 17. 17© Cloudera, Inc. All rights reserved. Streaming Systems in Hadoop Flume Kafka Spark Streaming Storm Flink Samza Tightly coupled ingestion General purpose bus Processing
  18. 18. 18© Cloudera, Inc. All rights reserved. Apache Flume Log & Event Aggregation for Hadoop • Efficiently move large amounts of streaming/log data • Easily collect data from multiple systems (sources) • Built-in sources, sinks, and channels • Customize data flow to transform data on- the-fly • Reliable, scalable, and extensible for production • Manage and monitor with Cloudera Manager OPERATIONS DATAMANAGEMENT UNIFIED SERVICES PROCESS,ANALYZE, SERVE STORE INTEGRATE
  19. 19. 19© Cloudera, Inc. All rights reserved. Typical pipeline
  20. 20. 20© Cloudera, Inc. All rights reserved. Apache Kafka Pub-Sub Messaging for Hadoop Backbone for real-time architectures • Fast, flexible messaging for a wide range of use cases • Scale to support more data sources and growing data volumes • Zero data loss durability and always-on fault- tolerance • Built-in security and data protection Seamless integration across the platform • Connect to Flume, Spark Streaming, HBase, and more • Manage and monitor with Cloudera Manager OPERATIONS DATAMANAGEMENT UNIFIED SERVICES PROCESS,ANALYZE, SERVE STORE INTEGRATE
  21. 21. 21© Cloudera, Inc. All rights reserved. Kafka is a Publish-Subscribe Messaging System What is a pub-sub messaging system? • Act as Broker between producers of data and consumers of data • Producers don’t worry about who will consume, and making sure they get the data • Consumers consume from the Broker and don’t talk to the producers • Broker makes sure data is delivered fast and reliably Decouple Data Pipelines Producer Producer Producer Producer Consumer Consumer Consumer Consumer Producer Producer Producer Producer Consumer Consumer Consumer Consumer Pub-Sub Broker
  22. 22. 22© Cloudera, Inc. All rights reserved. • Messages are organized into topics • Topics are broken into partitions • Partitions are replicated across the brokers as replicas • Kafka runs in a cluster. Nodes are called brokers • Producers push messages • Consumers pull messages The Basics
  23. 23. 23© Cloudera, Inc. All rights reserved. Replicas • A partition has 1 leader replica. The others are followers. • Followers are considered in-sync when: • The replica is alive • The replica is not “too far” behind the leader (configurable) • The group of in-sync replicas for a partition is called the ISR (In-Sync Replicas) • Replicas map to physical locations on a broker Messages • optionally be keyed in order to map to a static partitions • Used if ordering within a partition is needed • Avoid otherwise (extra complexity, skew, etc.) • Location of a message is denoted by its topic, partition & offset • A partitions offset increases as messages are appended Beyond Basics…
  24. 24. 24© Cloudera, Inc. All rights reserved. Brokers • Heavily rely on Linux PageCache • The I/O scheduler will batch together consecutive small writes into bigger physical writes which improves throughput. • The I/O scheduler will attempt to re- sequence writes to minimize movement of the disk head which improves throughput. • It automatically uses all the free memory on the machine Clients • Batch messages • Reduce network overhead • Allow efficient compression • Load balance across the cluster via partitions • They talk to multiple nodes • Utilize zero copy I/O using sendfile Beyond Basics…
  25. 25. 25© Cloudera, Inc. All rights reserved. • Brokers: 3->15 per Cluster • Common to start with 3-5 • Largest clusters ~30-40 nodes • Having many clusters is common • Topics: 1->100s per Cluster • Partitions: 1->1000s per Topic • Clusters with up to 10k total partitions are workable. Beyond that we don't aggressively test. [src] • Consumer Groups: 1->100s active per Cluster • Could Consume 1 to all topics Kafka Cardinality—What is large?
  26. 26. 26© Cloudera, Inc. All rights reserved. • Kafka is not designed for very large messages • Optimal performance ~10KB • Could consider breaking up the messages/files into smaller chunks Large Messages
  27. 27. 27© Cloudera, Inc. All rights reserved. Typical pipeline
  28. 28. 28© Cloudera, Inc. All rights reserved. Kafka + Apache Flume • Kafka can be configured as a fast, reliable Flume Channel • Flume Sources and Sinks can be used as out-of-the-box Kafka Producers and Consumers Flume Sinks Consume from Kafka: Write data to HDFS, HBase, or Search Flume Sources Write to Kafka: Read from logs, files, jms, http, rpc, thrift, etc and write events to Kafka
  29. 29. 29© Cloudera, Inc. All rights reserved. Data Processing Leverage the right processing for your job Data may require unique processing characteristics ▪ Batch ▪ Streaming ▪ Real-time Hadoop arose to address one and now the ecosystem is answering the rest. ▪ “We’re doubling down on Spark. We invested earliest, and we’ve invested most, in making Hadoop enterprise-grade” Doug Cutting
  30. 30. 30© Cloudera, Inc. All rights reserved. Processing System Latencies Custom Custom ~50 ms >500 ms >30,000 ms Samza/Storm Flume Interceptors Trident Spark Streaming Hive >90,000 ms Spark Impala Hive Spark Impala MR Near Real-Time Processing Flink
  31. 31. 31© Cloudera, Inc. All rights reserved. Storm 101 • Open source project • Fundamental abstraction - Streams, consisting of tuples • Deployment - Nimbus (master) and Supervisors (workers) • ZK for membership and state • Storm processes are stateless • Applications defined by topologies • Topologies consist of Spouts and Bolts • Spout - source of stream • Bolt - consumer of stream from Spouts. Outputs a stream • Topology runs till you terminate it • When nodes fail, storm restarts them. You can set parallelism
  32. 32. 32© Cloudera, Inc. All rights reserved. Storm 101 • Work happens in a bolt at a tuple level • 3 levels of guarantees • At-least-once • At-most-once • Exactly-once (most expensive. needs Trident) • New project @ Twitter - Heron
  33. 33. 33© Cloudera, Inc. All rights reserved. Typical pipeline
  34. 34. 34© Cloudera, Inc. All rights reserved. Flink 101 • Fundamentally a stream processing system • Core abstraction: DataStreams • Consume events from any streaming source • Transformation operators - Map, FlatMap, Filter, Reduce, Fold, Window etc • Fault tolerance based on Chandy Lamport distributed snapshots • Ensures exactly-once semantics • Does optimizations internally to club subsequent transformations where possible
  35. 35. 35© Cloudera, Inc. All rights reserved. Spark Streaming What is it? • Run continuous processing of data using Spark’s core API • Extends Spark concepts to fault-tolerant, transformable streams • Adds “rolling window” operations • Example: Compute rolling averages or counts for data over last five minutes Benefits: • Reuse knowledge and code in both contexts • Same programming paradigm for streaming and batch • Simplicity of development • High-level API with automatic DAG generation • Excellent throughput • Scale easily to support large volumes of data ingest Common Use Cases: • “On-the-fly” ETL as data is ingested into Hadoop/HDFS • Detect anomalous behavior and trigger alerts • Continuous reporting of summary metrics for incoming data
  36. 36. 36© Cloudera, Inc. All rights reserved. Dstreams
  37. 37. 37© Cloudera, Inc. All rights reserved. Typical pipeline
  38. 38. 38© Cloudera, Inc. All rights reserved. Key difference in approach  Spark is a batch processing system that can approximate stream processing.  Flink is a stream processing system that can look like a batch processor.
  39. 39. 39© Cloudera, Inc. All rights reserved. The Spark Ecosystem & Hadoop STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE SQL Impala SEARCH Solr SDK Kite BATCH & STREAM Spark Spark Streaming Spark SQL DataFrames MLlib …
  40. 40. 40© Cloudera, Inc. All rights reserved. Architectural Patterns
  41. 41. 41© Cloudera, Inc. All rights reserved. One source, one destination
  42. 42. 42© Cloudera, Inc. All rights reserved. Multiple sources, one destination
  43. 43. 43© Cloudera, Inc. All rights reserved. Multiple sources, multiple destinations
  44. 44. 44© Cloudera, Inc. All rights reserved. Hypothetical Anomaly Detection System • Definition of anomalous activity: • Amount > previous Max amount (per event decision) • Location is different than what mobile device suggests (per event decision) • >2 transactions in the last 10 seconds (window based decision)
  45. 45. 45© Cloudera, Inc. All rights reserved. Architecture
  46. 46. 46© Cloudera, Inc. All rights reserved. Architectural patterns • Kafka is front and center • HDFS is front and center • Best tool for the job Bus centric Data hub centric Hybrid
  47. 47. 47© Cloudera, Inc. All rights reserved. Real world - Hybrid Data Sources Kafka Stream ingestion via Pub-Sub Cloudera Enterprise Data Hub Ingestion Custom Apps Preparation Analytics Spark Stream processing Iterative processing Machine learning MapReduce Deep, batch processing On-Premise Cloud (Cloudera Director) Cluster Management (Cloudera Manager) Security (Sentry, Record Service) Metadata & Governance (Cloudera Navigator) Unified Cluster Management Suite Flexible Deployment Options EDW OLTP DB Analytical Tools Cloudera Search Real-time Search HBase NoSQL Impala Fast analytics Sqoop RDBMS integration
  48. 48. 48© Cloudera, Inc. All rights reserved. Cloudera makes Data Processing Fast, Easy, & Secure. Fast Leadership in Kafka and Spark to help turn processing windows from hours to minutes. Secure End-to-end Security, Governance, and Data Management The leading big data platform from the leaders in enterprise Hadoop. Easy Deliver optimum system utilization and meet SLA commitments, on-premises or in the cloud, with minimum effort.
  49. 49. 49© Cloudera, Inc. All rights reserved. Getting Started is Easy Visit our Data Engineering Webpage Signup for Spark Training Contact Us to start a POC 1 2 3
  50. 50. 50© Cloudera, Inc. All rights reserved. Thank you

×