Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Fraud Detection for Israel BigThings Meetup

1 908 vues

Publié le

Modern data systems don't just process massive amounts of data, they need to do it very fast. Using fraud detection as a convenient example, this session will include best practices on how to build real-time data processing applications using Apache Kafka. We'll explain how Kafka makes real-time processing almost trivial, discuss the pros and cons of the famous lambda architecture, help you choose a stream processing framework and even talk about deployment options.

Publié dans : Technologie
  • Soyez le premier à commenter

Fraud Detection for Israel BigThings Meetup

  1. 1. Real Time Anomaly Detection Patterns and reference architectures Gwen Shapira, System Architect
  2. 2. ©2014 Cloudera, Inc. All rights reserved. Overview • Intro • Review Problem • Quick overview of key technology • High level architecture • Deep Dive into NRT Processing • Completing the Puzzle – Micro-batch, Ingest and Batch
  3. 3. ©2014 Cloudera, Inc. All rights reserved. Gwen Shapira • 15 years of moving data • Formerly consultant, engineer • System Architect @ Confluent • Kafka Committer • @gwenshap
  4. 4. There’s a Book on That
  5. 5. Founded by creators of Kafka - @jaykreps, @nehanarkhede, @junrao We help you gather, transport, organize, and analyze all of your stream data What we offer • Confluent Platform • Kafka plus critical bug fixes not yet applied in Apache release • Kafka ecosystem projects • Enterprise support • Training and Professional Services
  6. 6. ©2014 Cloudera, Inc. All rights reserved. The Problem
  7. 7. ©2014 Cloudera, Inc. All rights reserved. Credit Card Transaction Fraud
  8. 8. ©2014 Cloudera, Inc. All rights reserved. Coupon Fraud
  9. 9. ©2014 Cloudera, Inc. All rights reserved. Video Game Strategy
  10. 10. ©2014 Cloudera, Inc. All rights reserved. Health Insurance Fraud
  11. 11. ©2014 Cloudera, Inc. All rights reserved. How do we React • Human Brain at Tennis • Muscle Memory • Reaction Thought • Reflective Meditation
  12. 12. ©2014 Cloudera, Inc. All rights reserved. Overview of Key Technologies
  13. 13. ©2014 Cloudera, Inc. All Rights Reserved. Kafka
  14. 14. ©2014 Cloudera, Inc. All rights reserved. The Basics • Messages are organized into topics • Producers push messages • Consumers pull messages • Kafka runs in a cluster. Nodes are called brokers
  15. 15. ©2014 Cloudera, Inc. All rights reserved. Topics, Partitions and Logs
  16. 16. ©2014 Cloudera, Inc. All rights reserved. Each partition is a log
  17. 17. ©2014 Cloudera, Inc. All rights reserved. Each Broker has many partitions Partition 0 Partition 0 Partition 1 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partion 2
  18. 18. ©2014 Cloudera, Inc. All rights reserved. Producers load balance between partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
  19. 19. ©2014 Cloudera, Inc. All rights reserved. Producers load balance between partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
  20. 20. Consumers Consumer Group Y Consumer Group X Consumer Kafka Cluster Topic Partition A (File) Partition B (File) Partition C (File) Consumer Consumer Consumer Order retained with in partition Order retained with in partition but not over partitionsOffSetX OffSetX OffSetX OffSetYOffSetYOffSetY Off sets are kept per consumer group
  21. 21. Consumer-Producer Pattern
  22. 22. Keeping Things Simple • Consume records from Kafka Topic • Filter, transform, join, lookups, aggregate • Write to another Kafka Topic • https://github.com/confluentinc/examples/tree/master/specific-avro- consumer
  23. 23. Kafka Makes Streams Easy • Producers partition the data • Consumers load balance partitions • Add / remove consumers any way you want • Will work with any framework (or none!)
  24. 24. Coming Soon to Kafka Near You • KafkaConnect - Export / Import for Kafka - 0.9.0 (Its here!) • KStream • Consumer-Producer client - Processor (0.10.0 - April?) • DSLs: • KStream (a bit like Spark) - (0.10.0 - April?) • SQL - ???
  25. 25. KConnect - Its a thing • Easy to add connectors to Kafka • Existing connectors • JDBC • HDFS • MySQL * 2 • ElasticSearch * 4 • Cassandra • S3 * 2 • MQTT • Twitter
  26. 26. • Kafka Connectors: • http://www.confluent.io/developers/connectors • http://docs.confluent.io/2.0.0/connect/index.html • KStreams: • https://github.com/gwenshap/kafka- examples/blob/master/KafkaStreamsAvg
  27. 27. SparkStreaming
  28. 28. ©2014 Cloudera, Inc. All rights reserved. Spark Example 1. val conf = new SparkConf().setMaster("local[2]”) 2. val sc = new SparkContext(conf) 3. val lines = sc.textFile(path, 2) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print()
  29. 29. ©2014 Cloudera, Inc. All rights reserved. Spark Streaming Example 1. val conf = new SparkConf().setMaster("local[2]”) 2. val ssc = new StreamingContext(conf, Seconds(1)) 3. val lines = ssc.socketTextStream("localhost", 9999) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print() 8. SSC.start()
  30. 30. Spark Streaming Confidentiality Information Goes Here DStream DStream DStream Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print Pre-first Batch First Batch Second Batch
  31. 31. Confidentiality Information Goes Here DStream DStream DStream Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Pre-first Batch First Batch Second Batch Stateful RDD 1 Print Stateful RDD 2 Stateful RDD 1
  32. 32. ©2014 Cloudera, Inc. All rights reserved. High Level Architecture
  33. 33. ©2014 Cloudera, Inc. All rights reserved. Real-Time Event Processing Approach Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Impa la Map/Redu ce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App Adjust NRT Statistics
  34. 34. Yarn / Mesos Analytics Layer SolR Client Client KStreams Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats Batch Time Adjustments Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App Kafka HDFS NoSQL DWH Connecor Connector
  35. 35. KStream Processor Profile Updates Model Updates Transactions Local Store Decisions DWH RedoLog KStream Processor KStream Processor
  36. 36. ©2014 Cloudera, Inc. All rights reserved. NRT Processing
  37. 37. ©2014 Cloudera, Inc. All rights reserved. Focus on NRT First Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Processor Hbase / Memory Spark Streaming HDFS Hive/Impa la Map/Redu ce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App Adjust NRT Statistics
  38. 38. ©2014 Cloudera, Inc. All rights reserved. Streaming Architecture – NRT Event Processing Kafka Initial Events Topic Event Processing Logic Local Memory HBase Client Kafka Answer Topic HBase KafkaConsumer KafkaProducer Able to respond with in 10s of milliseconds
  39. 39. ©2014 Cloudera, Inc. All rights reserved. Partitioned NRT Event Processing Kafka Initial Events Topic Event Processing Logic Local Cache HBase Client Kafka Answer Topic HBase KafkaConsumer KafkaProducer Topic Partition A Partition B Partition C Producer Partitioner Producer Partitioner Producer Partitioner Custom Partitioner Better use of local memory
  40. 40. ©2014 Cloudera, Inc. All rights reserved. Questions? http://confluent.io @confluentInc @gwenshap gwen@confluent.io

×