Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Using the SDACK Architecture to Build a Big Data Product

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité

Consultez-les par la suite

1 sur 67 Publicité

Using the SDACK Architecture to Build a Big Data Product

Télécharger pour lire hors ligne

You definitely have heard about the SMACK architecture, which stands for Spark, Mesos, Akka, Cassandra, and Kafka. It’s especially suitable for building a lambda architecture system. But what is SDACK? Apparently it’s very much similar to SMACK except the “D" stands for Docker. While SMACK is an enterprise scale, multi-tanent supported solution, the SDACK architecture is particularly suitable for building a data product. In this talk, I’ll talk about the advantages of the SDACK architecture, and how TrendMicro uses the SDACK architecture to build an anomaly detection data product. The talk will cover:
1) The architecture we designed based on SDACK to support both batch and streaming workload.
2) The data pipeline built based on Akka Stream which is flexible, scalable, and able to do self-healing.
3) The Cassandra data model designed to support time series data writes and reads.

You definitely have heard about the SMACK architecture, which stands for Spark, Mesos, Akka, Cassandra, and Kafka. It’s especially suitable for building a lambda architecture system. But what is SDACK? Apparently it’s very much similar to SMACK except the “D" stands for Docker. While SMACK is an enterprise scale, multi-tanent supported solution, the SDACK architecture is particularly suitable for building a data product. In this talk, I’ll talk about the advantages of the SDACK architecture, and how TrendMicro uses the SDACK architecture to build an anomaly detection data product. The talk will cover:
1) The architecture we designed based on SDACK to support both batch and streaming workload.
2) The data pipeline built based on Akka Stream which is flexible, scalable, and able to do self-healing.
3) The Cassandra data model designed to support time series data writes and reads.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (20)

Publicité

Similaire à Using the SDACK Architecture to Build a Big Data Product (20)

Plus par Evans Ye (20)

Publicité

Plus récents (20)

Using the SDACK Architecture to Build a Big Data Product

  1. 1. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016
 Vancouver Using the SDACK Architecture to Build a Big Data Product
  2. 2. Outline • A Threat Analytic Big Data product • The SDACK Architecture • Akka Streams and data pipelines • Cassandra data model for time series data
  3. 3. Who am I • Apache Bigtop PMC member • Software Engineer @ Trend Micro • Develop big data apps & infra
  4. 4. A Threat Analytic 
 Big Data Product
  5. 5. Target Scenario InfoSec Team Security information and event management
 (SIEM)
  6. 6. Problems • Too many log to investigate • Lack of actionable, prioritized recommendations
  7. 7. Splunk ArcSight Windows event Web Proxy DNS … AD Anomaly Detection Expert Rule Filtering Prioritization Threat Analytic System Notification
  8. 8. Deployment
  9. 9. On-Premises InfoSec Team Security information and event Anomaly Detection Expert Rule Filtering Prioritization
  10. 10. On-Premises InfoSec Team Anomaly Detection Expert Rule Filtering Prioritization AD Windows Event Log DNS …
  11. 11. On the cloud? InfoSec Team AD Windows Event Log DNS …
  12. 12. There’re too many PII data in the log
  13. 13. System Requirement • Identify anomaly activities based on data • Lookup extra info such as ldap, active directory, hostname, etc • Correlate with other data • EX: C&C connection followed by an anomaly login • Support both streaming and batch analytic workloads • Run streaming detections to identify anomaly in timely manner • Run batch detections with arbitrary start time and end time • Scalable • Stable
  14. 14. No problem. I have a solution. How exactly are we going to build this product? PM RD
  15. 15. SMACK Stack Toolbox for wide variety of data processing scenarios
  16. 16. SMACK Stack • Spark: fast and general purpose engine for large-scale data processing scenarios • Mesos: cluster resource management system • Akka: toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications • Cassandra: distributed, highly available database designed to handle large amounts of data across datacenters • Kafka: high-throughput, low-latency distributed pub-sub messaging system for real-time data feeds Source: http://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka
  17. 17. Medium-sized Enterprises InfoSec Team AD Windows Event Log DNS …
  18. 18. No problem. I have a solution. Can’t we simplify it? PM RD
  19. 19. SDACK Stack
  20. 20. SDACK Stack • Spark: fast and general purpose engine for large-scale data processing scenarios • Docker: deployment and resource management • Akka: toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications • Cassandra: distributed, highly available database designed to handle large amounts of data across datacenters • Kafka: high-throughput, low-latency distributed pub-sub messaging system for real-time data feeds Source: http://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka
  21. 21. • Three key techniques in Docker • Use Linux Kernel’s Namespaces to create 
 isolated resources • Use Linux Kernel’s cgroups to constrain 
 resource usage • Union file system that supports git-like 
 image creation Docker - How it works
  22. 22. • Lightweight • Fast creation • Efficient to ship • Easy to deploy • Portability • runs on *any* Linux environment The nice things about it
  23. 23. Medium-sized Enterprises InfoSec Team AD Windows Event Log DNS …
  24. 24. Large Enterprises InfoSec Team AD Windows Event Log DNS …
  25. 25. Fortune 500 InfoSec Team AD Windows Event Log DNS …
  26. 26. Threat Analytic System Architecture
  27. 27. DB a d d d i w e b Actor Actor Actor … … Log Web Portal Detection ResultAPI Server
  28. 28. API Server Web Portal
  29. 29. Docker Compose kafka: build: . ports: - “9092:9092” spark: image: spark port: - “8080:8080” …… Web Portal API Server
  30. 30. Akka Streams and 
 data pipelines
  31. 31. receiver
  32. 32. receiver buffer
  33. 33. transformation,
 lookup ext info receiver buffer
  34. 34. batch streaming receiver buffer transformation,
 lookup ext info
  35. 35. • Each streaming job occupies at least 1 core • There’re so many type of log • AD, Windows Event, DNS, Proxy, Web Application, etc • Kafka offset loss when code changed • No back-pressure (back then) • Estimated back-pressure available in Spark 1.5 • Integration with Akka based info lookup services Why not Spark Streaming?
  36. 36. Akka • High performance concurrency framework for Java and Scala • Actor model for message-driven processing • Asynchronous by design to achieve high throughput • Each message is handled in a single threaded context
 (no lock, synchronous needed) • Let-it-crash model for fault tolerance and auto-healing • Clustering mechanism available The Road to Akka Cluster, and Beyond
  37. 37. Akka Streams • Akka Streams is a DSL library for doing streaming computation on Akka • Materializer to transform each step into Actor • Back-pressure enabled by default Source Flow Sink The Reactive Manifesto
  38. 38. Reactive Kafka • Akka Streams wrapper for Kafka • Commit processed offset back into Kafka • Provide at-least-once message delivery guarantee https://github.com/akka/reactive-kafka
  39. 39. Message Delivery Sematics • Actor Model: at-most-once • Akka Persistence: at-least-once • Persist log to external storage (like WAL) • Reactive Kafka: at-least-once + back-pressure • Write offset back into Kafka • At-least-once + Idempotent writes = exactly-once
  40. 40. Implementing Data Pipelines kafka source manual commit Parsing
 Phase Storing Phase Akka Stream Reactive Kafka
  41. 41. Scale Up kafka source manual commit Parsing
 Phase Storing Phase map
 Async router parser parser parser … scale up
  42. 42. Scale Out • Scale out by creating more docker containers with same Kafka consumer group
  43. 43. Be careful using 
 Reactive Kafka 
 Manual Commit
  44. 44. A bug that periodically reset the committed offset kafka source manual commit #partition offset partition 0 50 partition 1 90 partition 2 70 0 1 2
  45. 45. A bug that periodically reset committed offset kafka source manual commit kafka source manual commit #partition offset partition 0 300 partition 1 250 partition 2 70 0 1 2 #partition offset partition 2 350
  46. 46. Normal:
 redline(lag) keep declining
  47. 47. Abnormal:
 redline(lag) keep getting reset
  48. 48. PR merged in 0.8 branch Apply patch if version <= 0.8.7 https://github.com/akka/reactive-kafka/pull/144/commits
  49. 49. Akka Cluster • Connects Akka runtimes into a cluster • No-SPOF • Gossip Protocol • ClusterSingleton library available
  50. 50. Lookup Service Integration Akka Streams Akka Streams Akka Streams ClusterSingleton Ldap Lookup Hostname Lookup Akka Cluster
  51. 51. Lookup Service Integration Akka Streams Akka Streams Akka Streams ClusterSingleton Ldap Lookup Hostname Lookup Akka Cluster
  52. 52. Lookup Service Integration Akka Streams Akka Streams Akka Streams ClusterSingleton Ldap LookupHostname Lookup Akka Cluster Service Discovery
  53. 53. Async Lookup Process Akka Streams ClusterSingleton Hostname Lookup Cache Synchronous Asynchronous
  54. 54. Async Lookup Process Akka Streams ClusterSingleton Hostname Lookup Cache Synchronous Asynchronous
  55. 55. Akka Streams Cache Akka Streams Cache Async Lookup Process Akka Streams ClusterSingleton Hostname Lookup Cache Synchronous Asynchronous
  56. 56. Cassandra Data Model for Time Series Data
  57. 57. Event Time V.S. Processing Time
  58. 58. Problem • We need to do analytics based on event time instead of processing time • Example: modelling user behaviour • Log may arrive late • network slow, machine busy, service restart, etc • Log arrived out-of-order • multiple sources, different network routing, etc
  59. 59. Problem • Spark streaming can’t handle event time ts 12:03 ts 11:50 ts 12:04 ts 12:01 ts 12:02
  60. 60. • Store time series data into Cassandra • Query Cassandra based on event time • Get most accurate result upon query executed • Late tolerance time is flexible • Simple Solution
  61. 61. Cassandra Data Model • Retrieve a entire row is fast • Retrieve a range of columns is fast • Read continuous blocks on disk
 (not random seek)
  62. 62. Data Model for AD log use shardId to distribute the log into different partitions
  63. 63. Data Model for AD log partition key to support frequent queries for a specific hour
  64. 64. Data Model for AD log clustering key for range queries
  65. 65. Wrap up
  66. 66. • Spark: streaming and batch analytics • Docker: ship, deploy and resource management • Akka: resilient, scalable, flexible data pipelines • Cassandra: batch queries based on event data • Kafka: durable buffer, streaming processing Recap: SDACK Stack
  67. 67. Questions ? Thank you !

×