Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Designing Data
Architectures
for Robust Decision
Making
Gwen Shapira / Software Engineer
2©2014 Cloudera, Inc. All rights reserved.
• 15 years of moving data around
• Formerly consultant
• Now Cloudera Engineer:...
3©2014 Cloudera, Inc. All rights reserved.
There’s a book on that!
4
About you:
You know Hadoop
“Big Data”
is stuck at
The Lab.
6
We want to move to The Factory
7Click to enter confidentiality information
8
What does it mean to “Systemize”?
• Ability to easily add new data sources
• Easily improve and expend analytics
• Ease ...
9
We will discuss:
• Actual decision making
• Data Science
• Machine learning
• Algorithms
Click to enter confidentiality ...
10
So how do we build
real data
architectures?
Click to enter confidentiality information
11
The Data Bus
1212
Client Source
Data Pipelines Start like this.
1313
Client Source
Client
Client
Client
Then we reuse them
1414
Client Backend
Client
Client
Client
Then we add consumers to the
existing sources
Another
Backend
1515
Client Backend
Client
Client
Client
Then it starts to look like this
Another
Backend
Another
Backend
Another
Backend
1616
Client Backend
Client
Client
Client
With maybe some of this
Another
Backend
Another
Backend
Another
Backend
17
Adding applications should be easier
We need:
• Shared infrastructure for sending records
• Infrastructure must scale
•...
18
Kafka Based Ingest Architecture
18
Source System Source System Source System Source System
Kafka decouples Data Pipelin...
19
Retain All Data
Click to enter confidentiality information
20
Data Pipeline – Traditional View
Raw data
Raw data Clean data
Aggregated dataClean data Enriched data
Input Output
Wast...
21©2014 Cloudera, Inc. All rights reserved.
It is all valuable data
Raw data
Raw data Clean data
Aggregated dataClean data...
22
Hadoop Based ETL – The FileSystem is the
DB
/user/…
/user/gshapira/testdata/orders
/data/<database>/<table>/<partition>...
23
Store intermediate data
/etl/<biz unit>/<app>/<dataset>/<stage>/<dataset_id>
/etl/pharmacy/fraud/orders/raw/date=201311...
24
Batch ETL is old news
Click to enter confidentiality information
25
Small Problem!
• HDFS is optimized for large chunks of data
• Don’t write individual events of micro-batches
• Think 10...
26
Well, we have this data bus…
Click to enter confidentiality information
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
0 1 2 3 4 5...
27
Kafka has topics
How about?
<biz unit>.<app>.<dataset>.<stage>
pharmacy.fraud.orders.raw
pharmacy.fraud.orders.deduped
...
28©2014 Cloudera, Inc. All rights reserved.
It’s (almost) all topics
Raw data
Raw data Clean data
Aggregated dataClean dat...
29
Benefits
• Recover from accidents
• Debug suspicious results
• Fix algorithm errors
• Experiment with new algorithms
• ...
30
Kinda Lambda
31
Lambda Architecture
• Immutable events
• Store intermediate stages
• Combine Batches and Streams
• Reprocessing
Click t...
32
What we don’t like
Maintaining two applications
Often in two languages
That do the same thing
Click to enter confidenti...
33
Pain Avoidance #1 – Use Spark +
SparkStreaming
• Spark is awesome for batch, so why not?
– The New Kid that isn’t that ...
34
Spark Streaming
• Calling Spark in a Loop
• Extends RDDs with DStream
• Very Little Code Changes from ETL to Streaming
...
35
Spark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Co...
36
Small Example
val sparkConf = new SparkConf()
.setMaster(args(0)).setAppName(this.getClass.getCanonicalName)
val ssc = ...
37
Pain Avoidance #2 – Split the Stream
Why do we even need stream + batch?
• Batch efficiencies
• Re-process to fix error...
38
Lets Re-Process with new algorithm
Click to enter confidentiality information
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Strea...
39
Lets Re-Process with new algorithm
Click to enter confidentiality information
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Strea...
40
Oh no, we just got a bunch of data for
yesterday!
Click to enter confidentiality information
0 1 2 3 4 5 6 7 8 9
1
0
1
...
41
Note:
No need to choose between the
approaches.
There are good reasons to do both.
Click to enter confidentiality infor...
42
Prediction:
Batch vs. Streaming distinction is going
away.
Click to enter confidentiality information
43
Yes, you really need
a Schema
Click to enter confidentiality information
44
Schema is a MUST HAVE for
data integration
Click to enter confidentiality information
4545
Client Backend
Client
Client
Client
Another
Backend
Another
Backend
Another
Backend
46
Remember that we want this?
46
Source System Source System Source System Source System
Hadoop
Security
Systems
Real-tim...
47
This means we need this:
Click to enter confidentiality information
Source System Source System Source System Source Sy...
48
We can do it in few ways
• People go around asking each other:
“So, what does the 5th field of the messages in topic Bl...
49
I Avro
• Define Schema
• Generate code for objects
• Serialize / Deserialize into Bytes or JSON
• Embed schema in files...
50
Schemas are Agile
• Leave out MySQL and your favorite DBA for a second
• Schemas allow adding readers and writers easil...
51Click to enter confidentiality information
52
Woah, that was lots of
stuff!
Click to enter confidentiality information
53
Recap – if you remember nothing else…
• After the POC, its time for production
• Goal: Evolve fast without breaking thi...
Thank you
Prochain SlideShare
Chargement dans…5
×

Data Architectures for Robust Decision Making

11 537 vues

Publié le

Slides for my Strata 2015 presentation.

Showing how to build agile data pipelines.

Publié dans : Ingénierie

Data Architectures for Robust Decision Making

  1. 1. Designing Data Architectures for Robust Decision Making Gwen Shapira / Software Engineer
  2. 2. 2©2014 Cloudera, Inc. All rights reserved. • 15 years of moving data around • Formerly consultant • Now Cloudera Engineer: – Sqoop Committer – Kafka – Flume • @gwenshap About Me
  3. 3. 3©2014 Cloudera, Inc. All rights reserved. There’s a book on that!
  4. 4. 4 About you: You know Hadoop
  5. 5. “Big Data” is stuck at The Lab.
  6. 6. 6 We want to move to The Factory
  7. 7. 7Click to enter confidentiality information
  8. 8. 8 What does it mean to “Systemize”? • Ability to easily add new data sources • Easily improve and expend analytics • Ease data access by standardizing metadata and storage • Ability to discover mistakes and to recover from them • Ability to safely experiment with new approaches Click to enter confidentiality information
  9. 9. 9 We will discuss: • Actual decision making • Data Science • Machine learning • Algorithms Click to enter confidentiality information We will not discuss: • Architectures • Patterns • Ingest • Storage • Schemas • Metadata • Streaming • Experimenting • Recovery
  10. 10. 10 So how do we build real data architectures? Click to enter confidentiality information
  11. 11. 11 The Data Bus
  12. 12. 1212 Client Source Data Pipelines Start like this.
  13. 13. 1313 Client Source Client Client Client Then we reuse them
  14. 14. 1414 Client Backend Client Client Client Then we add consumers to the existing sources Another Backend
  15. 15. 1515 Client Backend Client Client Client Then it starts to look like this Another Backend Another Backend Another Backend
  16. 16. 1616 Client Backend Client Client Client With maybe some of this Another Backend Another Backend Another Backend
  17. 17. 17 Adding applications should be easier We need: • Shared infrastructure for sending records • Infrastructure must scale • Set of agreed-upon record schemas
  18. 18. 18 Kafka Based Ingest Architecture 18 Source System Source System Source System Source System Kafka decouples Data Pipelines Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Producer s Brokers Consume rs Kafka decouples Data Pipelines
  19. 19. 19 Retain All Data Click to enter confidentiality information
  20. 20. 20 Data Pipeline – Traditional View Raw data Raw data Clean data Aggregated dataClean data Enriched data Input Output Waste of diskspace
  21. 21. 21©2014 Cloudera, Inc. All rights reserved. It is all valuable data Raw data Raw data Clean data Aggregated dataClean data Enriched data Filtered data Dash board Report Data scientis t Alerts OMG
  22. 22. 22 Hadoop Based ETL – The FileSystem is the DB /user/… /user/gshapira/testdata/orders /data/<database>/<table>/<partition> /data/<biz unit>/<app>/<dataset>/partition /data/pharmacy/fraud/orders/date=20131101 /etl/<biz unit>/<app>/<dataset>/<stage> /etl/pharmacy/fraud/orders/validated
  23. 23. 23 Store intermediate data /etl/<biz unit>/<app>/<dataset>/<stage>/<dataset_id> /etl/pharmacy/fraud/orders/raw/date=20131101 /etl/pharmacy/fraud/orders/deduped/date=20131101 /etl/pharmacy/fraud/orders/validated/date=20131101 /etl/pharmacy/fraud/orders_labs/merged/date=20131101 /etl/pharmacy/fraud/orders_labs/aggregated/date=20131101 /etl/pharmacy/fraud/orders_labs/ranked/date=20131101 Click to enter confidentiality information
  24. 24. 24 Batch ETL is old news Click to enter confidentiality information
  25. 25. 25 Small Problem! • HDFS is optimized for large chunks of data • Don’t write individual events of micro-batches • Think 100M-2G batches • What do we do with small events? Click to enter confidentiality information
  26. 26. 26 Well, we have this data bus… Click to enter confidentiality information 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 0 1 2 3 4 5 6 7 8 9 1 0 1 1 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Partition 1 Partition 2 Partition 3 Writes Old New
  27. 27. 27 Kafka has topics How about? <biz unit>.<app>.<dataset>.<stage> pharmacy.fraud.orders.raw pharmacy.fraud.orders.deduped pharmacy.fraud.orders.validated pharmacy.fraud.orders_labs.merged Click to enter confidentiality information
  28. 28. 28©2014 Cloudera, Inc. All rights reserved. It’s (almost) all topics Raw data Raw data Clean data Aggregated dataClean data Filtered data Dash board Report Data scientis t Alerts OMG Enriched Data
  29. 29. 29 Benefits • Recover from accidents • Debug suspicious results • Fix algorithm errors • Experiment with new algorithms • Expend pipelines • Jump-start expended pipelines Click to enter confidentiality information
  30. 30. 30 Kinda Lambda
  31. 31. 31 Lambda Architecture • Immutable events • Store intermediate stages • Combine Batches and Streams • Reprocessing Click to enter confidentiality information
  32. 32. 32 What we don’t like Maintaining two applications Often in two languages That do the same thing Click to enter confidentiality information
  33. 33. 33 Pain Avoidance #1 – Use Spark + SparkStreaming • Spark is awesome for batch, so why not? – The New Kid that isn’t that New Anymore – Easily 10x less code – Extremely Easy and Powerful API – Very good for machine learning – Scala, Java, and Python – RDDs – DAG Engine Click to enter confidentiality information
  34. 34. 34 Spark Streaming • Calling Spark in a Loop • Extends RDDs with DStream • Very Little Code Changes from ETL to Streaming Confidentiality Information Goes Here
  35. 35. 35 Spark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print Pre-first Batch First Batch Second Batch
  36. 36. 36 Small Example val sparkConf = new SparkConf() .setMaster(args(0)).setAppName(this.getClass.getCanonicalName) val ssc = new StreamingContext(sparkConf, Seconds(10)) // Create the DStream from data sent over the network val dStream = ssc.socketTextStream(args(1), args(2).toInt, StorageLevel.MEMORY_AND_DISK_SER) // Counting the errors in each RDD in the stream val errCountStream = dStream.transform(rdd => ErrorCount.countErrors(rdd)) val stateStream = errCountStream.updateStateByKey[Int](updateFunc) errCountStream.foreachRDD(rdd => { System.out.println("Errors this minute:%d".format(rdd.first()._2)) }) Click to enter confidentiality information
  37. 37. 37 Pain Avoidance #2 – Split the Stream Why do we even need stream + batch? • Batch efficiencies • Re-process to fix errors • Re-process after delayed arrival What if we could re-play data? Click to enter confidentiality information
  38. 38. 38 Lets Re-Process with new algorithm Click to enter confidentiality information 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Streaming App v1 Streaming App v2 Result set 1 Result set 2 App
  39. 39. 39 Lets Re-Process with new algorithm Click to enter confidentiality information 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Streaming App v1 Streaming App v2 Result set 1 Result set 2 App
  40. 40. 40 Oh no, we just got a bunch of data for yesterday! Click to enter confidentiality information 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Streaming App Streaming App Today Yesterday
  41. 41. 41 Note: No need to choose between the approaches. There are good reasons to do both. Click to enter confidentiality information
  42. 42. 42 Prediction: Batch vs. Streaming distinction is going away. Click to enter confidentiality information
  43. 43. 43 Yes, you really need a Schema Click to enter confidentiality information
  44. 44. 44 Schema is a MUST HAVE for data integration Click to enter confidentiality information
  45. 45. 4545 Client Backend Client Client Client Another Backend Another Backend Another Backend
  46. 46. 46 Remember that we want this? 46 Source System Source System Source System Source System Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Producer s Brokers Consume rs
  47. 47. 47 This means we need this: Click to enter confidentiality information Source System Source System Source System Source System Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Schema Repository
  48. 48. 48 We can do it in few ways • People go around asking each other: “So, what does the 5th field of the messages in topic Blah contain?” • There’s utility code for reading/writing messages that everyone reuses • Schema embedded in the message • A centralized repository for schemas – Each message has Schema ID – Each topic has Schema ID Click to enter confidentiality information
  49. 49. 49 I Avro • Define Schema • Generate code for objects • Serialize / Deserialize into Bytes or JSON • Embed schema in files / records… or not • Support for our favorite languages… Except Go. • Schema Evolution – Add and remove fields without breaking anything Click to enter confidentiality information
  50. 50. 50 Schemas are Agile • Leave out MySQL and your favorite DBA for a second • Schemas allow adding readers and writers easily • Schemas allow modifying readers and writers independently • Schemas can evolve as the system grows • Allows validating data soon after its written – No need to throw away data that doesn’t fit! Click to enter confidentiality information
  51. 51. 51Click to enter confidentiality information
  52. 52. 52 Woah, that was lots of stuff! Click to enter confidentiality information
  53. 53. 53 Recap – if you remember nothing else… • After the POC, its time for production • Goal: Evolve fast without breaking things For this you need: • Keep all data • Design pipeline for error recovery – batch or stream • Integrate with a data bus • And Schemas
  54. 54. Thank you

×