In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on AWS
1. Martin Schade, Solutions Architect, Amazon Web Services
January 2016
Big Data Architectural
Patterns and Best Practices
on AWS
2. What to Expect from the Session
Big data challenges
How to simplify big data processing
What technologies should you use?
• Why?
• How?
Reference architecture
Design patterns
5. Plethora of Tools
Amazon
Glacier
S3 DynamoDB
RDS
EMR
Amazon
Redshift
Data PipelineAmazon Kinesis
Cassandra
CloudSearch
Kinesis-
enabled
app
Lambda ML
SQS
ElastiCache
DynamoDB
Streams
6. Is there a reference architecture ?
What tools should I use ?
How ?
Why ?
7. Architectural Principles
• Decoupled “data bus”
• Data → Store → Process → Answers
• Use the right tool for the job
• Data structure, latency, throughput, access patterns, cost
• Use Lambda architecture ideas
• Immutable (append-only) log, batch/speed/serving layer
• Leverage AWS managed services
• No/low admin
• Big data ≠ big cost
8. Simplify Big Data Processing
ingest /
collect
store process /
analyze
consume /
visualize
Time to Answer (Latency)
Throughput
Cost
15. Why Is Amazon S3 Good for Big Data?
• Natively supported by big data frameworks (Spark, Hive, Presto, etc.)
• No need to run compute clusters for storage (unlike HDFS)
• Can run transient Hadoop clusters & Amazon EC2 Spot instances
• Multiple distinct (Spark, Hive, Presto) clusters can use the same data
• Unlimited number of objects
• Very high bandwidth – no aggregate throughput limit
• Highly available – can tolerate AZ failure
• Designed for 99.999999999% durability
• Tired-storage (Standard, IA, Amazon Glacier) via life-cycle policy
• Secure – SSL, client/server-side encryption at rest
• Low cost
16. What about HDFS & Amazon Glacier?
• Use HDFS for very frequently
accessed (hot) data
• Use Amazon S3 Standard for
frequently accessed data
• Use Amazon S3 Standard –
IA for infrequently accessed
data
• Use Amazon Glacier for
archiving cold data
17. Database +
Search
Tier
A
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
Amazon
ES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
ElastiCache
SearchSQLNoSQLCacheStreamStorageFileStorage
Transactional Data
File Data
Stream Data
Mobile
Apps
Search Data
Collect Store
19. Best Practice — Use the Right Tool for the Job
Data Tier
Search
Amazon
Elasticsearch
Service
Amazon
CloudSearch
Cache
Redis
Memcached
SQL
Amazon Aurora
MySQL
PostgreSQL
Oracle
SQL Server
NoSQL
Cassandra
Amazon
DynamoDB
HBase
MongoDB
Database + Search Tier
20. What Data Store Should I Use?
• Data structure → Fixed schema, JSON, key-value
• Access patterns → Store data in the format you will
access it
• Data / access characteristics → Hot, warm, cold
• Cost → Right cost
21. Data Structure and Access Patterns
Access Patterns What to use?
Put/Get (Key, Value) Cache, NoSQL
Simple relationships → 1:N, M:N NoSQL
Cross table joins, transaction, SQL SQL
Faceting, Search Search
Data Structure What to use?
Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
(Key, Value) Cache, NoSQL
24. Process / Analyze
Analysis of data is a process of inspecting, cleaning,
transforming, and modeling data with the goal of discovering
useful information, suggesting conclusions, and supporting
decision-making.
Examples
• Interactive dashboards → Interactive analytics
• Daily/weekly/monthly reports → Batch analytics
• Billing/fraud alerts, 1 minute metrics → Real-time analytics
• Sentiment analysis, prediction models → Machine learning
26. What Stream Processing Technology Should I Use?
Spark Streaming Apache Storm Amazon Kinesis
Client Library
AWS Lambda Amazon EMR (Hive,
Pig)
Scale /
Throughput
~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes
Batch or Real-
time
Real-time Real-time Real-time Real-time Batch
Manageability Yes (Amazon EMR) Do it yourself Amazon EC2 +
Auto Scaling
AWS managed Yes (Amazon EMR)
Fault Tolerance Single AZ Configurable Multi-AZ Multi-AZ Single AZ
Programming
languages
Java, Python, Scala Any language
via Thrift
Java, via
MultiLangDaemon (
.Net, Python, Ruby,
Node.js)
Node.js, Java Hive, Pig, Streaming
languages
High
27. What Data Processing Technology Should I Use?
Amazon
Redshift
Impala Presto Spark Hive
Query
Latency
Low Low Low Low Medium (Tez) –
High (MapReduce)
Durability High High High High High
Data Volume 2 PB
Max
~Nodes ~Nodes ~Nodes ~Nodes
Managed Yes Yes (EMR) Yes (EMR) Yes (EMR) Yes (EMR)
Storage Native HDFS / S3A* HDFS / S3 HDFS / S3 HDFS / S3
SQL
Compatibility
High Medium High Low (SparkSQL) Medium (HQL)
HighMedium
28. What about ETL?
Store Analyze
https://aws.amazon.com/big-data/partner-solutions/
ETL
30. Collect Store Analyze Consume
A
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
Amazon
ES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift
Impala
Pig
Amazon ML
Streaming
Amazon
Kinesis
AWS
Lambda
AmazonElasticMapReduce
Amazon
ElastiCache
SearchSQLNoSQLCache
StreamProcessingBatchInteractive
Logging
StreamStorage
IoTApplications
FileStorage
Analysis&Visualization
Hot
Cold
Warm
Hot
Slow
Hot
ML
Fast
Fast
Transactional Data
File Data
Stream Data
Notebooks
Predictions
Apps & APIs
Mobile
Apps
IDE
Search Data
ETL
Amazon
QuickSight
31. Consume
• Predictions
• Analysis and Visualization
• Notebooks
• IDE
• Applications & API
Consume
Analysis&Visualization
Amazon
QuickSight
Notebooks
Predictions
Apps & APIs
IDE
Store Analyze ConsumeETL
Business
users
Data Scientist,
Developers
33. Collect Store Analyze Consume
A
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
Amazon
ES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift
Impala
Pig
Amazon ML
Streaming
Amazon
Kinesis
AWS
Lambda
AmazonElasticMapReduce
Amazon
ElastiCache
SearchSQLNoSQLCache
StreamProcessingBatchInteractive
Logging
StreamStorage
IoTApplications
FileStorage
Analysis&Visualization
Hot
Cold
Warm
Hot
Slow
Hot
ML
Fast
Fast
Amazon
QuickSight
Transactional Data
File Data
Stream Data
Notebooks
Predictions
Apps & APIs
Mobile
Apps
IDE
Search Data
ETL
Reference Architecture
36. Multi-Stage Decoupled “Data Bus”
• Multiple stages
• Storage decoupled from processing
Store Process Store Process
process
store
37. Multiple Processing Applications (or
Connectors) Can Read from or Write to Multiple
Data Stores
Amazon
Kinesis
AWS
Lambda
Amazon
DynamoDB
Amazon
Kinesis S3
Connector
Amazon S3
process
store
38. Processing Frameworks (KCL, Storm, Hive,
Spark, etc.) Could Read from Multiple Data
Stores
Amazon
Kinesis
AWS
Lambda
Amazon
S3
Amazon
DynamoDB
Hive SparkStorm
Amazon
Kinesis S3
Connector
process
store
41. Batch Layer
Amazon
Kinesis
data
process
store
Lambda Architecture
Amazon
Kinesis S3
Connector
Amazon S3
A
p
p
l
i
c
a
t
i
o
n
s
Amazon
Redshift
Amazon EMR
Presto
Hive
Pig
Spark
answer
Speed Layer
answer
Serving
Layer
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
answer
Amazon
ML
KCL
AWS Lambda
Spark Streaming
Storm
42. Summary
• Build decoupled “data bus”
• Data → Store ↔ Process → Answers
• Use the right tool for the job
• Latency, throughput, access patterns
• Use Lambda architecture ideas
• Immutable (append-only) log, batch/speed/serving layer
• Leverage AWS managed services
• No/low admin
• Be cost conscious
• Big data ≠ big cost
Wednesday, Oct 7
2:45 PM - 3:45 PM
Marcello 4501B
Siva Raghupathy leads the Americas Big Data Solutions Architecture team at AWS. He guides developers and architects to build successful big data solutions on AWS. Previously, as a principal technical program manager for AWS database service, he gathered emerging NoSQL requirements and wrote the first version of DynamoDB product specification. Later, as a development manager for Amazon Relational Database Services (RDS) he drove several enhancements. Prior to AWS, he spent several years at Microsoft.
Abstract:
The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many services for solving big data problems. But what service should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
Abstract:
The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many services for solving big data problems. But what service should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
Volume – 100 to 150TB a day
Velocity – 1million reads and writes per second is becoming a norm
Variety ->
IOT/log data/ streaming data
Transactional data
File data
Fixed Schema
CSV
Parquet
Avero
Schema-free
JSON
Key-value
Small files, large files,
Hourly server logs: were your systems misbehaving 1hr ago
Weekly / Monthly Bill: what you spent this billing cycle
Daily customer-preferences report from your web site’s click stream: what deal or ad to try next time
Daily fraud reports: was there fraud yesterday
Real-time alerts: what went wrong now
Real-time spending caps: prevent overspending now
Real-time analysis: what to offer the current customer now
Real-time detection: block fraudulent use now
I need to harness big data, fast
I want more happy customers
I want to save/make more money
Before we go into solving the Big architecture, I want to introduce some “tried and test” architecture principles.
Here at AWS we believe you should be using the right tool for the job – “instead of using a big swiss army knife for using a screw dreive, it will be best to use a screw drive - this is especially important for big data architectures. We’ll talk about this more.
Decoupled architecture http://whatis.techtarget.com/definition/decoupled-architecture - In general, a decoupled architecture is a framework for complex work that allows components to remain completely autonomous and unaware of each other…this has been tried and battle test.
Managed services – this is relatively now - Should I install Cassandra or MongoDB or CouchDB on AWS. You obviously can. Sometimes there are good reasons for doing this. Many customers still do this. Netflix is a great example. They run a multi-region Cassandra and are a poster child for how to do this. But for most customers, delegating this task to AWS makes more sense….you are better of spending your time on building features for your customers rather than building highly scalable distributed systems.
Lambda Architecture - Nathan Marz designed this generic architecture addressing common requirements Twitter
throughput = f (volume, request rate)
latency
Cost
Event to action/answer latency?
http://calculator.s3.amazonaws.com/index.html#r=IAD&key=calc-BE3BA3E4-1AC5-4E7A-B542-015056D8EDAF
Kinesis -> $52.14 per month
SQS -> $133.42 per month for puts or $400/month (put, get, delete)
DynamoDB -> $3809.88 per month (10TB of storage cost itself is $2500/month)
Cost (100rpsx 35KB)
$52/month
$133/month * 2 = $266/month
?
Amazon DynamoDB Service (US-East)
$
Provisioned Throughput Capacity:
$120
Indexed Data Storage:
$2560.90
DynamoDB Streams:
$1.3
Amazon SQS Service (US-East)
Pricing Example
Let’s assume that our data producers put 100 records per second in aggregate, and each record is 35KB. In this case, the total data input rate is 3.4MB/sec (100 records/sec*35KB/record). For simplicity, we assume that the throughput and data size of each record are stable and constant throughout the day. Please note that we can dynamically adjust the throughput of our Amazon Kinesis stream at any time.
We first calculate the number of shards needed for our stream to achieve the required throughput. As one shard provides a capacity of 1MB/sec data input and supports 1000 records/sec, four shards provide a capacity of 4MB/sec data input and support 4000 records/sec. So a stream with four shards satisfies our required throughput of 3.4MB/sec at 100 records/sec.
We then calculate our monthly Amazon Kinesis costs using Amazon Kinesis pricing in the US-East Region:
Shard Hour: One shard costs $0.015 per hour, or $0.36 per day ($0.015*24). Our stream has four shards so that it costs $1.44 per day ($0.36*4). For a month with 31 days, our monthly Shard Hour cost is $44.64 ($1.44*31).
PUT Payload Unit (25KB): As our record is 35KB, each record contains two PUT Payload Units. Our data producers put 100 records or 200 PUT Payload Units per second in aggregate. That is 267,840,000 records or 535,680,000 PUT Payload Units per month. As one million PUT Payload Units cost $0.014, our monthly PUT Payload Units cost is $7.499 ($0.014*535.68).
Adding the Shard Hour and PUT Payload Unit costs together, our total Amazon Kinesis costs are $1.68 per day, or $52.14 per month. For $1.68 per day, we have a fully-managed streaming data infrastructure that enables us to continuously ingest 4MB of data per second, or 337GB of data per day in a reliable and elastic manner.
No need to run compute clusters for storage (unlike HDFS)
Can run transient Hadoop clusters & Amazon EC2 spot Instances
Multiple distinct (Spark, Hive, Presto) clusters can use the same data
Highly durable, highly available, highly scalable
Secure
Designed for 99.999999999% durability
https://aws.amazon.com/s3/storage-classes/
Lifecycle Policies - migrated to S3-Standard - IA, archive to Amazon Glacier, or deleted after a specific period of time!
No need to run a Hadoop cluster for storage (unlike HDFS)
Unlimited number of objects
Object size up to 5TB
Very high bandwidth
Versioning
Lifecycle Policies to Achieve to Glacier
Designed for 99.999999999% durability
Server size encryption
Highly available – SLA 99.99 (Standard)
Highly scalable
Low cost
Event notification
Cross-region replication
Natively supported by almost all big data frameworks & tools
2 x 2 Matrix
Structured
Level of query (from none to complex)
Draw down the slide
Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making.
Examples:
Data Warehousing: Monthly bill, Interactive dashboards
Machine Learning: Sentiment Analysis, Prediction
Stream processing: Alerts, 1 minute metrics, etc.
http://www.jmlr.org/papers/volume9/li08a/li08a.pdf
Forecasting Web Page Views:
http://nlp.stanford.edu/sentiment/
http://nlp.stanford.edu/sentiment/Methods and Observations
https://en.wikipedia.org/wiki/Data_analysis
Add connector
Direct Acyclic Graphs?
Exactly once processing & DAG? – how do you do this??
https://storm.apache.org/documentation/Rationale.html
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
Cost:
Redshift – Moderate
Impala -
Presto – Low
S3A* is an open source connector. It is not in EMR 1.2.1 – using bootstrap you can install 2.2 ( we have a bootstrap action)
Query Speed
Redshift – Extremely fast SQL queries
Spark, Impala – Extremely Fast to Fast Hive QL
Hive, Tez – Moderately Fast to Slow Hive QL
Data Volume?
UDFs?
Manageability?
http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-tez-and-yarn
https://amplab.cs.berkeley.edu/benchmark/
http://nerds.airbnb.com/redshift-performance-cost/
Similar to multi-tier web-app-data architectures
Concept of a “data bus” or “data pipeline”
Best practices
BDT318 - Netflix Keystone: How Netflix Handles Data Streams Up to 8 Million Events Per Second
Use Kinesis or Kafka for stream storage
Use appropriate stream processing tool
KCL – Simple, only for Kinesis
Apache Storm – general purpose
Spark Streaming – Windowing, MLlib, Statefull
AWS Lambda – fully managed, no servers, stateless
Use Amazon SNS for Alerts
Use Amazon ML for predictions
Use a managed database/cache for state management
Use a visualization tool for KPI visualization
Lambda Architecture best practices
Use Kinesis or Kafka for stream storage
Maintain an immutable (append only) raw master data in Amazon S3 - use Amazon Kinesis S3 connector or Firehose
Use appropriate batch processing technologies for creating batch views
Use appropriate stream processing technology for real-time views
Use a serving layer with an appropriate database to serve downstream applications
Before we go into solving the Big architecture, I want to introduce some “tried and test” architecture principles.
Here at AWS we believe you should be using the right tool for the job – “instead of using a big swiss army knife for using a screw dreive, it will be best to use a screw drive - this is especially important for big data architectures. We’ll talk about this more.
Decoupled architecture http://whatis.techtarget.com/definition/decoupled-architecture - In general, a decoupled architecture is a framework for complex work that allows components to remain completely autonomous and unaware of each other…this has been tried and battle test.
Managed services – this is relatively now - Should I install Cassandra or MongoDB or CouchDB on AWS. You obviously can. Sometimes there are good reasons for doing this. Many customers still do this. Netflix is a great example. They run a multi-region Cassandra and are a poster child for how to do this. But for most customers, delegating this task to AWS makes more sense….you are better of spending your time on building features for your customers rather than building highly scalable distributed systems.
Lambda Architecture -