Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Arie Leeuwesteijn,Solutions Architect,AWS
May 20...
Agenda
Big data challenges
How to simplify big data processing
What technologies should you use?
• Why?
• How?
Reference a...
Ever Increasing Big Data
Volume
Velocity
Variety
Big Data Evolution
Batch
processing
Stream
processing
Machine
learning
Plethora of Tools
Amazon
Glacier
S3 DynamoDB
RDS
EMR
Amazon
Redshift
Data Pipeline
Amazon
Kinesis
Kinesis-enabled
app
Lamb...
Big Data Challenges
Is there a reference architecture?
What tools should I use?
How?
Why?
Decoupled “data bus”
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure,...
Simplify Big Data Processing
COLLECT STORE
PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost
COLLECT
Types of Data
Database records
Search documents
Log files
Messaging events
Devices / sensors / IoT stream
Devices
Sensors ...
Store
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Apache Kafka
Amazon DynamoDB
Streams
Amazon SQS
Amazon SQS
• Managed messag...
Why Stream Storage?
Decouple producers & consumers
Persistent buffer
Collect multiple streams
Preserve client ordering
Str...
What About Queues & Pub/Sub ?
• Decouple producers &
consumers/subscribers
• Persistent buffer
• Collect multiple streams
...
What Stream Storage should I use?
Amazon
DynamoDB
Streams
Amazon
Kinesis
Streams
Amazon
Kinesis
Firehose
Apache
Kafka
Amaz...
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Database
AWS Import/Export
Snowball
Logging
Ama...
Why is Amazon S3 Good for Big Data?
• Natively supported by big data frameworks (Spark, Hive, Presto, etc.)
• No need to r...
What about HDFS & Amazon Glacier?
• Use HDFS for very frequently accessed
(hot) data
• Use Amazon S3 Standard for frequent...
Cache, database, search
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowb...
Database Anti-pattern
Data Tier
Database tier
Best Practice - Use the Right Tool for the Job
Data Tier
Search
Amazon
Elasticsearch
Service
Cache
Amazon
ElastiCache
Redi...
Materialized Views
Amazon
Elasticsearch
Service
KCL
What Data Store Should I Use?
Data structure → Fixed schema, JSON, key-value
Access patterns → Store data in the format yo...
Data Structure and Access Patterns
Access Patterns What to use?
Put/Get (key, value) Cache, NoSQL
Simple relationships → 1...
What Is the Temperature of Your Data / Access ?
Hot Warm Cold
Volume MB–GB GB–TB PB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low – high High Very...
Cache
SQL
Request rate
High Low
Cost/GB
High Low
Latency
Low High
Data volume
Low High
Amazon
Glacier
Structure
NoSQL
Hot ...
Amazon ElastiCache Amazon
DynamoDB
Amazon
RDS/Aurora
Amazon
Elasticsearch
Amazon S3 Amazon Glacier
Average
latency
ms ms m...
Cost Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project that will g...
Cost Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
https://calculator.s3.amazonaws.com/index.html
S...
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per
month
300 2,048 1,483 777,600,000
Amazon S...
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per
month
Scenario 1300 2,048 1,483 777,600,00...
COLLECT STORE PROCESS / ANALYZE
Amazon Elasticsearch
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon Kinesis...
PROCESS /
ANALYZE
Process / Analyze
• Batch - Minutes or hours on cold data
• Daily/weekly/monthly reports
• Interactive – Seconds on warm/c...
What Analytics Technology Should I Use?
Amazon
Redshift
Amazon EMR
Presto Spark Hive
Query
latency
Low Low Low High
Durabi...
Predictions via Machine Learning
ML gives computers the ability to learn without being explicitly
programmed
Machine learn...
Tools and Frameworks
Machine Learning
• Amazon ML, Amazon EMR (Spark ML)
Interactive
• Amazon Redshift, Amazon EMR (Presto...
What Streaming / Messaging Technology Should I Use?
Spark
Streaming
Apache
Storm
Kinesis KCL
Application
AWS Lambda Amazon...
What About ETL?
https://aws.amazon.com/big-data/partner-solutions/
ETLSTORE PROCESS / ANALYZE
Amazon SQS apps
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
COLLECT STORE CONSUMEPROCESS...
CONSUME
STORE CONSUMEPROCESS / ANALYZE
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksIDEAPI
Applications & API
...
Putting It All Together
Amazon SQS apps
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
COLLECT STORE CONSUMEPROCESS...
AWS Summit
Sanoma Big Data Migration Case
Sander Kieft
Sanoma, Publishing and Learning company
27 May 2016 AWSSummit– Sanoma Big Data MIgration47
2+100
2 Finnish newspapers
Over...
§ Users of Sanoma’s websites,mobile applications,online TV products generate large volumes of data
§ We use this data to i...
Starting point
27 May 2016 AWSSummit– Sanoma Big Data MIgration4949
ETL
(Jenkins & Jython)
QlikView
Reporting
Hive
Collect...
60+
NODES
720TB
CAPACITY
475TB
DATA STORED (GROSS)
180TB
DATA STORED (NETTO)
150GB
DAILY GROWTH
50+
DATA SOURCES
200+
DATA PROCESSES
3000+
DAILY HADOOP JOBS
200+
DASHBOARDS
275+
AVG MONTHLY DASHBOARD USERS
Challenges
Positives
§ Users have been steadily increasing
§ Demand for ..
– more real time processing
– faster data avail...
§ Three scenario’s evaluated for moving
Amazon Migration Scenario’s
27 May 2016 AWSSummit– Sanoma Big Data MIgration62
S
M...
Easier to leverage spot pricing, due to data on S3
§ Three scenario’s evaluated for moving
Amazon Migration Scenario’s
27 ...
Target architecture
27 May 2016 AWSSummit– Sanoma Big Data MIgration6464
ETL
(Jenkins & Jython)
QlikView
Reporting
R Studi...
§ We’re almost done with the migration. Running in parallel now.
§ Setup solves our challenges, some still require work
§ ...
Design Patterns
Primitive: Multi-Stage Decoupled “Data Bus”
Multiple stages
Storage decouples multiple processing stages
Store Process Sto...
Primitive: Multiple Stream Processing Applications
Can Read from Amazon Kinesis
Amazon
Kinesis
AWS
Lambda
Amazon
DynamoDB
...
Amazon EMR
Primitive: Analysis Frameworks Could Read
from Multiple Data Stores
Amazon
Kinesis
AWS
Lambda
Amazon S3
Amazon
...
Spark Streaming
Apache Storm
AWS Lambda
KCL apps
Amazon
Redshift
Amazon
Redshift
Hive
Spark
Presto
Amazon Kinesis
Apache K...
Amazon EMR
Real-time Analytics
Apache
Kafka
KCL
AWS Lambda
Spark
Streaming
Apache
Storm
Amazon
SNS
Amazon
ML
Notifications...
Message / Event
Processing
Amazon
SQS
Amazon SQS
App
Amazon SQS
App
Amazon
SNS Subscribers
Amazon
ElastiCache
(Redis)
Amaz...
Interactive
&
Batch
Analytics
Amazon S3
Amazon EMR
Hive
Pig
Spark
Amazon
ML
process
store
Consume
Amazon
Redshift
Amazon E...
Batch layer
Amazon
Kinesis
Data
stream
process
store
Amazon
Kinesis S3
Connector
Amazon S3
A
p
p
l
i
c
a
t
i
o
n
s
Amazon
...
Summary
Decoupled “data bus”
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data st...
Thank you!
aws.amazon.com/big-data
Big Data Architectural Patterns
Big Data Architectural Patterns
Prochain SlideShare
Chargement dans…5
×

Big Data Architectural Patterns

The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.

Presented by: Arie Leeuwesteijn, Principal Solutions Architect, Amazon Web Services

Customer Guest: Sander Kieft, Sanoma

Big Data Architectural Patterns

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Arie Leeuwesteijn,Solutions Architect,AWS May 2016 Big Data Architectural Patterns and Best Practices on AWS
  2. 2. Agenda Big data challenges How to simplify big data processing What technologies should you use? • Why? • How? Reference architecture Sanoma Use Case Design patterns
  3. 3. Ever Increasing Big Data Volume Velocity Variety
  4. 4. Big Data Evolution Batch processing Stream processing Machine learning
  5. 5. Plethora of Tools Amazon Glacier S3 DynamoDB RDS EMR Amazon Redshift Data Pipeline Amazon Kinesis Kinesis-enabled app Lambda ML SQS ElastiCache DynamoDB Streams Amazon Elasticsearch Service
  6. 6. Big Data Challenges Is there a reference architecture? What tools should I use? How? Why?
  7. 7. Decoupled “data bus” • Data → Store → Process → Store → Analyze → Answers Use the right tool for the job • Data structure, latency, throughput, access patterns Use Lambda architecture ideas • Immutable (append-only) log, batch/speed/serving layer Leverage AWS managed services • Scalable/elastic, available, reliable, secure, no/low admin Big data ≠ big cost Architectural Principles
  8. 8. Simplify Big Data Processing COLLECT STORE PROCESS/ ANALYZE CONSUME Time to answer (Latency) Throughput Cost
  9. 9. COLLECT
  10. 10. Types of Data Database records Search documents Log files Messaging events Devices / sensors / IoT stream Devices Sensors & IoT platforms AWS IoT STREAMS Stream storage IoT COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS Database Applications AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES Search File store LoggingTransport Messaging Message MESSAGES Queue Messaging
  11. 11. Store
  12. 12. Amazon Kinesis Firehose Amazon Kinesis Streams Apache Kafka Amazon DynamoDB Streams Amazon SQS Amazon SQS • Managed message queue service Apache Kafka • High throughput distributed messaging system Amazon Kinesis Streams • Managed stream storage + processing Amazon Kinesis Firehose • Managed data delivery Amazon DynamoDB • Managed NoSQL database • Tables can be stream-enabled Message & Stream Storage Devices Sensors & IoT platforms AWS IoT STREAMS IoT COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS Database Applications AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENT S FILES Search File store LoggingTransport Messaging Message MESSAGES Messaging Queue Stream
  13. 13. Why Stream Storage? Decouple producers & consumers Persistent buffer Collect multiple streams Preserve client ordering Streaming MapReduce Parallel consumption 4 4 3 3 2 2 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 4 3 3 2 2 1 1 shard 1 / partition 1 shard 2 / partition 2 Consumer 1 Count of red = 4 Count of violet = 4 Consumer 2 Count of blue = 4 Count of green = 4 DynamoDB stream Amazon Kinesis stream Kafka topic
  14. 14. What About Queues & Pub/Sub ? • Decouple producers & consumers/subscribers • Persistent buffer • Collect multiple streams • No client ordering • No parallel consumption for Amazon SQS • Amazon SNS can route to multiple queues or ʎ functions • No streaming MapReduce Consumers Producers Producers Amazon SNS Amazon SQS queue topic function ʎ AWS Lambda Amazon SQS queue Subscriber
  15. 15. What Stream Storage should I use? Amazon DynamoDB Streams Amazon Kinesis Streams Amazon Kinesis Firehose Apache Kafka Amazon SQS AWS managed service Yes Yes Yes No Yes Guaranteed ordering Yes Yes Yes Yes No Delivery exactly-once at-least-once exactly-once at-least-once at-least-once Data retention period 24 hours 7 days N/A Configurable 14 days Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ Scale / throughput No limit/ ~ table IOPS No limit/ ~ shards No limit/ automatic No limit/ ~ nodes No limits / automatic Parallel clients Yes Yes No Yes No Stream MapReduce Yes Yes N/A Yes N/A Record/object size 400 KB 1 MB Amazon Redshift row size Configurable 256 KB Cost Higher (table cost) Low Low Low (+admin) Low-medium Hot Warm
  16. 16. COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS Database AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENT S FILES Search Messaging Message MESSAGES Devices Sensors & IoT platforms AWS IoT STREAMS Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Streams Hot Stream Amazon S3 Amazon SQS Message Amazon S3 File LoggingIoTApplicationsTransportMessaging File Storage
  17. 17. Why is Amazon S3 Good for Big Data? • Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • No need to run compute clusters for storage (unlike HDFS) • Can run transient Hadoop clusters & Amazon EC2 Spot Instances • Multiple distinct (Spark, Hive, Presto) clusters can use the same data • Unlimited number of objects • Very high bandwidth – no aggregate throughput limit • Highly available – can tolerate AZ failure • Designed for 99.999999999% durability • Tired-storage (Standard, IA, Amazon Glacier) via life-cycle policy • Secure – SSL, client/server-side encryption at rest • Low cost
  18. 18. What about HDFS & Amazon Glacier? • Use HDFS for very frequently accessed (hot) data • Use Amazon S3 Standard for frequently accessed data • Use Amazon S3 Standard – IA for infrequently accessed data • Use Amazon Glacier for archiving cold data
  19. 19. Cache, database, search COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENT S FILES Messaging Message MESSAGES Devices Sensors & IoT platforms AWS IoT STREAMS Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Streams Hot Stream Amazon SQS Message Amazon Elasticsearch Service Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS SearchSQLNoSQLCacheFile LoggingIoTApplicationsTransportMessaging
  20. 20. Database Anti-pattern Data Tier Database tier
  21. 21. Best Practice - Use the Right Tool for the Job Data Tier Search Amazon Elasticsearch Service Cache Amazon ElastiCache Redis Memcached SQL Amazon Aurora MySQL PostgreSQL Oracle SQL Server MariaDB NoSQL Amazon DynamoDB Cassandra HBase MongoDB Database tier
  22. 22. Materialized Views Amazon Elasticsearch Service KCL
  23. 23. What Data Store Should I Use? Data structure → Fixed schema, JSON, key-value Access patterns → Store data in the format you will access it Data / access characteristics → Hot, warm, cold Cost → Right cost
  24. 24. Data Structure and Access Patterns Access Patterns What to use? Put/Get (key, value) Cache, NoSQL Simple relationships → 1:N, M:N NoSQL Cross table joins, transaction, SQL SQL Faceting, search, free text Search Data Structure What to use? Fixed schema SQL, NoSQL Schema-free (JSON) NoSQL, Search (Key, value) Cache, NoSQL
  25. 25. What Is the Temperature of Your Data / Access ?
  26. 26. Hot Warm Cold Volume MB–GB GB–TB PB Item size B–KB KB–MB KB–TB Latency ms ms, sec min, hrs Durability Low – high High Very high Request rate Very high High Low Cost/GB $$-$ $-¢¢ ¢ Hot data Warm data Cold data Data / Access Characteristics: Hot, Warm, Cold
  27. 27. Cache SQL Request rate High Low Cost/GB High Low Latency Low High Data volume Low High Amazon Glacier Structure NoSQL Hot data Warm data Cold data Low High Search
  28. 28. Amazon ElastiCache Amazon DynamoDB Amazon RDS/Aurora Amazon Elasticsearch Amazon S3 Amazon Glacier Average latency ms ms ms, sec ms,sec ms,sec,min (~ size) hrs Typical data stored GB GB–TBs (no limit) GB–TB (64 TB max) GB–TB MB–PB (no limit) GB–PB (no limit) Typical item size B-KB KB (400 KB max) KB (64 KB max) KB (2 GB max) KB-TB (5 TB max) GB (40 TB max) Request Rate High – very high Very high (no limit) High High Low – high (no limit) Very low Storage cost GB/month $$ ¢¢ ¢¢ ¢¢ ¢ ¢/10 Durability Low - moderate Very high Very high High Very high Very high Availability High 2 AZ Very high 3 AZ Very high 3 AZ High 2 AZ Very high 3 AZ Very high 3 AZ Hot data Warm data Cold data What Data Store Should I Use?
  29. 29. Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB? “I’m currently scoping out a project that will greatly increase my team’s use of Amazon S3. Hoping you could answer some questions. The current iteration of the design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…” Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month 300 2048 1483 777,600,000
  30. 30. Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB? https://calculator.s3.amazonaws.com/index.html Simple Monthly Calculator
  31. 31. Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month 300 2,048 1,483 777,600,000 Amazon S3 or Amazon DynamoDB?
  32. 32. Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month Scenario 1300 2,048 1,483 777,600,000 Scenario 2300 32,768 23,730 777,600,000 Amazon S3 Amazon DynamoDB use use
  33. 33. COLLECT STORE PROCESS / ANALYZE Amazon Elasticsearch Service Apache Kafka Amazon SQS Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS Amazon DynamoDB Streams HotHotWarm SearchSQLNoSQLCacheFileMessage Stream Mobile apps Web apps Devices Messaging Message Sensors & IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENT S FILES MESSAGES STREAMS LoggingIoTApplicationsTransportMessaging Process / analyze
  34. 34. PROCESS / ANALYZE
  35. 35. Process / Analyze • Batch - Minutes or hours on cold data • Daily/weekly/monthly reports • Interactive – Seconds on warm/cold data • Self-service dashboards • Messaging – Milliseconds or seconds on hot data • Message/event buffering • Streaming - Milliseconds or seconds on hot data • Billing/fraud alerts, 1 minute metrics
  36. 36. What Analytics Technology Should I Use? Amazon Redshift Amazon EMR Presto Spark Hive Query latency Low Low Low High Durability High High High High Data volume 1.6 PB max ~Nodes ~Nodes ~Nodes AWS managed Yes Yes Yes Yes Storage Native HDFS / S3 HDFS / S3 HDFS / S3 SQL compatibility High High Low (SparkSQL) Medium (HQL) Slow
  37. 37. Predictions via Machine Learning ML gives computers the ability to learn without being explicitly programmed Machine learning algorithms: Supervised learning ← “teach” program - Classification ← Is this transaction fraud? (yes/no) - Regression ← Customer life-time value? Unsupervised learning ← Let it learn by itself - Clustering ← Market segmentation
  38. 38. Tools and Frameworks Machine Learning • Amazon ML, Amazon EMR (Spark ML) Interactive • Amazon Redshift, Amazon EMR (Presto, Spark) Batch • Amazon EMR (MapReduce, Hive, Pig, Spark) Messaging • Amazon SQS application on Amazon EC2 Streaming • Micro-batch: Spark Streaming, KCL • Real-time: Amazon Kinesis Analytics, Storm, AWS Lambda, KCL Amazon SQS apps Streaming Amazon Kinesis Analytics Amazon KCL apps AWS Lambda Amazon Redshift PROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR FastSlowFast BatchMessageInteractiveStreamML Amazon EC2 Amazon EC2
  39. 39. What Streaming / Messaging Technology Should I Use? Spark Streaming Apache Storm Kinesis KCL Application AWS Lambda Amazon SQS Apps Scale ~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes Micro-batch or Real-time Micro-batch Real-time Near-real-time Near-real-time Near-real-time AWS managed service Yes (EMR) No (EC2) No (KCL + EC2 + Auto Scaling) Yes No (EC2 + Auto Scaling) Scalability No limits ~ nodes No limits ~ nodes No limits ~ nodes No limits No limits Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ Programming languages Java, Python, Scala Any language via Thrift Java, via MultiLang Daemon (.NET, Python, Ruby, Node.js) Node.js, Java, Python AWS SDK languages (Java, .NET, Python, …)
  40. 40. What About ETL? https://aws.amazon.com/big-data/partner-solutions/ ETLSTORE PROCESS / ANALYZE
  41. 41. Amazon SQS apps Streaming Amazon Kinesis Analytics Amazon KCL apps AWS Lambda Amazon Redshift COLLECT STORE CONSUMEPROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR Amazon Elasticsearch Service Apache Kafka Amazon SQS Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS Amazon DynamoDB Streams HotHotWarm FastSlowFast BatchMessageInteractiveStreamML SearchSQLNoSQLCacheFileMessage Stream Amazon EC2 Amazon EC2 Mobile apps Web apps Devices Messaging Message Sensors & IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENT S FILES MESSAGES STREAMS LoggingIoTApplicationsTransportMessaging ETL
  42. 42. CONSUME
  43. 43. STORE CONSUMEPROCESS / ANALYZE Amazon QuickSight Apps & Services Analysis&visualizationNotebooksIDEAPI Applications & API Analysis and visualization Notebooks IDE Business users Data scientist, developers COLLECT ETL
  44. 44. Putting It All Together
  45. 45. Amazon SQS apps Streaming Amazon Kinesis Analytics Amazon KCL apps AWS Lambda Amazon Redshift COLLECT STORE CONSUMEPROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR Amazon Elasticsearch Service Apache Kafka Amazon SQS Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS Amazon DynamoDB Streams HotHotWarm FastSlowFast BatchMessageInteractiveStreamML SearchSQLNoSQLCacheFileQueueStream Amazon EC2 Amazon EC2 Mobile apps Web apps Devices Messaging Message Sensors & IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENT S FILES MESSAGES STREAMS Amazon QuickSight Apps & Services Analysis&visualizationNotebooksIDEAPI Reference architecture LoggingIoTApplicationsTransportMessaging ETL
  46. 46. AWS Summit Sanoma Big Data Migration Case Sander Kieft
  47. 47. Sanoma, Publishing and Learning company 27 May 2016 AWSSummit– Sanoma Big Data MIgration47 2+100 2 Finnish newspapers Over 100 magazines inThe Netherland, Belgium and Finland 7 TV channels inFinlandand The Netherlands, incl. on demand platforms 200+ Websites 100 Mobile applications on various mobile platforms 30+ Learning applications
  48. 48. § Users of Sanoma’s websites,mobile applications,online TV products generate large volumes of data § We use this data to improve our products for our users and our advertisers § Use cases: Sanoma, Big Data use cases 27 May 2016 AWSSummit– Sanoma Big Data MIgration48 Dashboards and Reporting Product Improvements Advertising optimization • Reporting on various data sources • Self Service Data Access • Editorial Guidance • A/B testing • Recommenders • Search optimizations • Adaptive learning/tutoring • (Re)Targeting • Attribution • Auction price optimization
  49. 49. Starting point 27 May 2016 AWSSummit– Sanoma Big Data MIgration4949 ETL (Jenkins & Jython) QlikView Reporting Hive Collecting Serving R Studio Stream processing (Storm)Kafka Sources EDW Redis / Druid.io Compute & Storage (Cloudera Dist. Hadoop) AWS cloud Sanoma data center S3 Hadoop User Environment (HUE)
  50. 50. 60+ NODES
  51. 51. 720TB CAPACITY
  52. 52. 475TB DATA STORED (GROSS)
  53. 53. 180TB DATA STORED (NETTO)
  54. 54. 150GB DAILY GROWTH
  55. 55. 50+ DATA SOURCES
  56. 56. 200+ DATA PROCESSES
  57. 57. 3000+ DAILY HADOOP JOBS
  58. 58. 200+ DASHBOARDS
  59. 59. 275+ AVG MONTHLY DASHBOARD USERS
  60. 60. Challenges Positives § Users have been steadily increasing § Demand for .. – more real time processing – faster data availability – higher availability (SLA) – quicker availability of new versions – specialized hardware (GPU’s/SSD’s) – quicker experiments (Fail Fast) Negatives § Data Center network support end of life § Outsourcing of own operations team § Version upgrades harder to manage § Poor job isolation between test, dev, prod and interactive, etl and data science workloads § Higher level of security and access control 27 May 2016 AWSSummit– Sanoma Big Data MIgration61
  61. 61. § Three scenario’s evaluated for moving Amazon Migration Scenario’s 27 May 2016 AWSSummit– Sanoma Big Data MIgration62 S M L EC2 EBS S3 S3EMREC2 EBS § All services on EC2 § Only source dataonS3 § All data on S3 § EMR for Hadoop § EC2 only for utility services not providedby EMR S3EMREC2 RedshiftEBS § All data on S3 § EMR for Hadoop § Interactive querying workload is movedto Redshift insteadofHive
  62. 62. Easier to leverage spot pricing, due to data on S3 § Three scenario’s evaluated for moving Amazon Migration Scenario’s 27 May 2016 AWSSummit– Sanoma Big Data MIgration63 S M L EC2 EBS S3 S3EMREC2 EBS § All services on EC2 § Only source dataonS3 § All data on S3 § EMR for Hadoop § EC2 only for utility services not providedby EMR S3EMREC2 RedshiftEBS § All data on S3 § EMR for Hadoop § Interactive querying workload is movedto Redshift insteadofHive
  63. 63. Target architecture 27 May 2016 AWSSummit– Sanoma Big Data MIgration6464 ETL (Jenkins & Jython) QlikView Reporting R Studio Sources EDW AWS cloud Sanoma data center Amazon RDS S3 S3 Buckets Sources Work Hive Data warehouse EMR Cluster 1 Core instances Task instances Task Spot Instances Amazon EMR HUE Zepp elin Hive Collecting Serving Stream processing (Storm) Kafka Redis / Druid.io
  64. 64. § We’re almost done with the migration. Running in parallel now. § Setup solves our challenges, some still require work § Take your time testing and validating your setup – If you have time, rethink whole setup – No time, move as-is first and then start optimizing Conclusion & Learnings 27 May 2016 AWSSummit– Sanoma Big Data MIgration65 EMR – Check data formats! RC vs ORC/Parquet – Start jobs from master node – Break up long running jobs, shorter independent – Spot pricing & pre-empted nodes /w Spark – HUE and Zeppelin meta data on RDS – Research EMR FS behavior for your use case § S3 – Bucket structure impacts performance – Setup access control, human error will occur – Uploading data takes time. Start early! § Check Snowballor newuploadservice
  65. 65. Design Patterns
  66. 66. Primitive: Multi-Stage Decoupled “Data Bus” Multiple stages Storage decouples multiple processing stages Store Process Store Process process store
  67. 67. Primitive: Multiple Stream Processing Applications Can Read from Amazon Kinesis Amazon Kinesis AWS Lambda Amazon DynamoDB Amazon Kinesis S3 Connector process store Amazon S3
  68. 68. Amazon EMR Primitive: Analysis Frameworks Could Read from Multiple Data Stores Amazon Kinesis AWS Lambda Amazon S3 Amazon DynamoDB Spark Streaming Amazon Kinesis S3 Connector process store Spark SQL
  69. 69. Spark Streaming Apache Storm AWS Lambda KCL apps Amazon Redshift Amazon Redshift Hive Spark Presto Amazon Kinesis Apache Kafka Amazon DynamoDB Amazon S3data Hot Cold Data temperature Processingspeed Fast Slow Answers Hive Native apps KCL apps AWS Lambda Data Temperature vs. Processing Speed
  70. 70. Amazon EMR Real-time Analytics Apache Kafka KCL AWS Lambda Spark Streaming Apache Storm Amazon SNS Amazon ML Notifications Amazon ElastiCache (Redis) Amazon DynamoDB Amazon RDS Amazon ES Alert App state Real-time prediction KPI process store DynamoDB Streams Amazon Kinesis
  71. 71. Message / Event Processing Amazon SQS Amazon SQS App Amazon SQS App Amazon SNS Subscribers Amazon ElastiCache (Redis) Amazon DynamoDB Amazon RDS Amazon ES Publish App state KPI process store Amazon SQS App Amazon SQS App Auto Scaling group Amazon SQSPriority queue Messages / events
  72. 72. Interactive & Batch Analytics Amazon S3 Amazon EMR Hive Pig Spark Amazon ML process store Consume Amazon Redshift Amazon EMR Presto Spark Batch Interactive Batch prediction Real-time prediction Amazon Kinesis Firehose
  73. 73. Batch layer Amazon Kinesis Data stream process store Amazon Kinesis S3 Connector Amazon S3 A p p l i c a t i o n s Amazon Redshift Amazon EMR Presto Hive Pig Spark answer Speed layer answer Serving layer Amazon ElastiCache Amazon DynamoDB Amazon RDS Amazon ES answer Amazon ML KCL AWS Lambda Storm Lambda Architecture Spark Streaming on Amazon EMR
  74. 74. Summary Decoupled “data bus” • Data → Store → Process → Store → Analyze → Answers Use the right tool for the job • Data structure, latency, throughput, access patterns Use Lambda architecture ideas • Immutable (append-only) log, batch/speed/serving layer Leverage AWS managed services • Scalable/elastic, available, reliable, secure, no/low admin Big data ≠ big cost
  75. 75. Thank you! aws.amazon.com/big-data

×