The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
3. Powerful Services, Open Ecosystem
Amazon
Glacier
S3 DynamoDB
RDS
EMR
Amazon
Redshift
Data Pipeline
Amazon
Kinesis
Kinesis-enabled
app
Lambda ML
SQS
ElastiCache
DynamoDB
Streams
Amazon Elasticsearch
Service
Amazon Web Services Open Source & Commercial Ecosystem
4. Separation of compute & storage
• Event driven, loose coupling, independent scaling & sizing
“Data Sympathy” - use the right tool for the job
Data structure, latency, throughput, access patterns
Managed services, open technologies
Scalable/elastic, available, reliable, secure, no/low admin
Focus on the data, and your business, not undifferentiated tasks
Architectural Principles
5. Big Data != Big Cost
Aim to reduce the cost of experimentation to near 0
6. Simplify Big Data Processing
COLLECT STORE
PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost
7. Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoTApplications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Messaging
Message MESSAGES
Messaging
COLLECT
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Mobile SDK
AWS IoT
Amazon CloudWatchLogs Agent
Amazon CognitoSync
AWS Mobile Analytics
Easily Capture Mobile Data
through Sync and Server
Logging
File based integration to
Streaming Storage
API Integration via SDK’s
and Ecosystem Tools
Amazon Kinesis Producer Library
Amazon Kinesis Agent
8. Principles for Collection Architecture
• Should be easy to install, manage, and
understand
• Meet requirements for data durability & recovery
• Ensure ability to meet latency & performance
requirements
• Data handling should be sympathetic to the type
of data
9. Simplify Big Data Processing
COLLECT STORE
PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost
10. Types of Data
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoTApplications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Messaging
Message MESSAGES
Messaging
COLLECT
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Database
STORE
Search
File store
Queue
Stream
storage
Cache Heavily read, application defined structure
Consistently read, relational or document data
Frequently read, often requiring natural language
processing or geospatial access
Data received as files, or where object storage is
appropriate
Data intended for a single recipient, with low
latency
Data intended for many recipients, with low
latency
11. COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENT
S
FILES
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Stream
Amazon SQS
Message
Amazon Elasticsearch
Service
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
SearchSQLNoSQLCacheFile
LoggingIoTApplicationsTransportMessaging
File store
Queue
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams
Stream
storage
Cache,
Database,
Search
13. Myriad of Applications
Best Practice - Use the Right Tool for the job
Data Tier
Search
Amazon Elasticsearch Service
Amazon CloudSearch
Cache
Amazon ElastiCache
Redis
Memcached
SQL
Amazon Aurora
Amazon RDS
Amazon Redshift
NoSQL
Amazon DynamoDB
Cassandra
HBase
MongoDB
Database Tier
14. COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENT
S
FILES
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Stream
Amazon SQS
Message
Amazon S3
File
LoggingIoTApplicationsTransportMessaging
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams
Queue
Stream
storage
Amazon Elasticsearch
Service
Amazon DynamoDB
Amazon ElastiCache
Amazon RDS
File store
15. Why is Amazon S3 Good for Big Data?
• Virtually unlimited storage, unlimited object count
• Very high bandwidth – no aggregate throughput limit
• Highly available – can tolerate AZ failure
• Designed for 99.999999999% durability
• Tired-storage (Standard, IA, Amazon Glacier) via life-cycle
policies
• Native support for Versioning
• Secure – SSL, client/server-side encryption at rest
• Low cost
16. Why is Amazon S3 Good for Big Data?
• Natively supported by big data frameworks (Spark, Hive,
Presto, etc.)
• No need to run compute clusters for storage (unlike HDFS)
• No need to pay for data replication
• Can run transient Hadoop clusters & Amazon EC2 Spot
Instances
• Multiple & heterogenous analysis clusters can use the same
data
17. What about HDFS & Data Tiering (S3 IA & Amazon Glacier)?
• Use HDFS for very frequently accessed
(hot) data
• Use Amazon S3 Standard as system of
record for durability
• Use Amazon S3 Standard – IA for near line
archive data
• Use Amazon Glacier for cold/offline archive
data
18. Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams
Amazon SQS
Amazon SQS
Managed message queue
service
Amazon Kinesis Streams
Managed stream storage +
processing
Amazon Kinesis Firehose
Managed data delivery
Amazon DynamoDB
Streams
Managed NoSQL database
Tables can be stream-enabled
Message & Stream Data
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoT
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Applications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Messaging
Message MESSAGES
Messaging
Queue
Stream
Amazon Elasticsearch
Service
Amazon DynamoDB
Amazon ElastiCache
Amazon RDS
Amazon S3
Queue
Stream
storage
20. What Data Store Should I Use?
Data structure → Fixed schema, JSON, key-value
Access patterns → Store data in the format you will access it
Data / access characteristics → Hot, warm, cold
Cost → Right cost
21. Data Structure and Access Patterns
Access Patterns What to use?
Put/Get (key, value) Cache, NoSQL
Simple relationships → 1:N, M:N NoSQL
Cross table joins, transaction, SQL SQL
Faceting, natural language, geo Search
Data Structure What to use?
Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
(Key, value) Cache, NoSQL
22. What Is the Temperature of Your Data / Access ?
23. Hot Warm Cold
Volume MB–GB GB–TB PB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low – high High Very high
Request rate Very high High-Med Low
Cost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
Data / Access Characteristics: Hot, Warm, Cold
24. Amazon ElastiCache Amazon
DynamoDB
Amazon
RDS/Aurora
Amazon
Elasticsearch
Amazon S3 Amazon Glacier
Average
latency
ms ms ms, sec ms,sec ms,sec,min
(~ size)
hrs
Typical
data stored
GB GB–TBs
(no limit)
GB–TB
(64 TB max)
GB–TB MB–PB
(no limit)
GB–PB
(no limit)
Typical
item size
B-KB KB
(400 KB max)
KB
(64 KB max)
KB
(2 GB max)
KB-TB
(5 TB max)
GB
(40 TB max)
Request
Rate
High – very high Very high
(no limit)
High High Low – high
(no limit)
Very low
Storage cost
GB/month
$$ ¢¢ ¢¢ ¢¢ ¢ ¢/10
Durability Low - moderate Very high Very high High Very high Very high
Availability High
2 AZ
Very high
3 AZ
Very high
3 AZ
High
2 AZ
Very high
3 AZ
Very high
3 AZ
Hot data Warm data Cold data
What Data Store Should I Use?
25. Simplify Big Data Processing
COLLECT STORE
PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost
26. COLLECT STORE PROCESS / ANALYZE
Amazon Elasticsearch
Service
Amazon SQS
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
HotWarm
SearchSQLNoSQLCacheFileMessage
Stream
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENT
S
FILES
MESSAGES
STREAMS
LoggingIoTApplicationsTransportMessaging
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams
Hot
27. Tools and Frameworks
Amazon SQS apps
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
PROCESS / ANALYZE
Amazon Machine
Learning
Presto
Amazon
EMR
FastSlowFast
BatchMessageInteractiveStreamML
Amazon EC2
Amazon EC2
Batch - Minutes or hours on cold data
Daily/weekly/monthly reports
Interactive – Seconds on warm/cold data
Self-service dashboards
Messaging – Milliseconds or seconds on hot data
Message/eventbuffering
Streaming - Milliseconds or seconds on hot data
Billing/fraud alerts, 1 minute metrics
29. What Analytics Technology Should I Use?
Amazon
Redshift
Amazon EMR
Presto Spark Hive
Query
latency
Low Low Low High
Durability High High High High
Data volume 1.6 PB max ~Nodes ~Nodes ~Nodes
AWS
managed
Yes Yes Yes Yes
Storage Native HDFS / S3 HDFS / S3 HDFS / S3
SQL
compatibility
High High Low (SparkSQL) Medium (HQL)
Slow
30. What Streaming Analysis Technology Should I Use?
Spark
Streaming
Apache
Storm
Kinesis KCL
Application
AWS Lambda Amazon SQS
Apps
Scale ~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes
Micro-batch or
Real-time
Micro-batch Real-time Near-real-time Near-real-time Near-real-time
AWS managed
service
Yes (EMR) No (EC2) No (KCL + EC2
+ Auto Scaling)
Yes No (EC2 + Auto
Scaling)
Scalability No limits
~ nodes
No limits
~ nodes
No limits
~ nodes
No limits No limits
Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ
Programming
languages
Java, Python,
Scala
Any language
via Thrift
Java, via
MultiLang
Daemon (.NET,
Python, Ruby,
Node.js)
Node.js, Java,
Python
AWS SDK
languages
(Java, .NET,
Python, …)
31. Simplify Big Data Processing
COLLECT STORE
PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost
33. STORE CONSUMEPROCESS / ANALYZE
Amazon QuickSight
Analysis&visualizationNotebooksIDE
Apps & Services
API
Applications & API
Reporting, Exploratory and
Ad-Hoc Analysis and Data
Visualization
Notebooks
Data Science
Business
users
Data scientist,
developers
COLLECT ETL
39. Combining Primitives: Event Processing Network
Combines data bus and pub/sub patterns to create
asynchronous, continuously executing transformations
Store Process Store Process
Process Store Process
Process
Process
41. Real Time
Processing
Batch Analytics
process
store
Amazon S3
Amazon
Redshift
Amazon EMR
Presto
Hive
Pig
Spark
Real Time
StorageAmazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
Amazon
ML
KCL
AWS Lambda
Storm
Real Time &
Batch
Architecture
Spark Streaming
on Amazon EMR
Applications
Amazon
Kinesis
43. About Adbrain
• Adbrain is a market leader in providing Probabilistic
Identity Solutions
• At the core of our business we build an identity graph
that represents connected devices on the planet
• We enable our clients to understand and target their
audience better
44. What does that mean?
• Ingesting large amounts of log data from
different data sources
• Build a graph using statistical and
machine learning algorithms
• Extract a relevant user graph for our
clients
45. Our main scenarios
• Production graph generation and servicing our clients
• Data Science team to run computations (Spark)
• R&D work with ML
• Data feed evaluation
• Graph analytics
• Data Engineering development workload
46. Our original setup
• HDFS Cluster on 18 nodes with most of our hot data
• Not leveraging co-location of data
• Spark clusters are separated
• Spark instance type optimized according to workload
• Custom cluster provisioning solution to run Spark jobs
• Built in-house
• Required maintenance
48. What are the issues?
• “Hadoop is a cheap Big Data solution on
commodity hardware” … define cheap for a
startup
• Scaling Hadoop means more focus on
infrastructure
• Data grows really quickly (adding a new country,
data feed, etc.)
49. What do we need?
• Better scale
• Cheaper solution
• Less focus on infrastructure
• Reduced risk
50. EMR and S3 to the rescue!
• S3 is significantly cheaper compared to HDFS
• We use infrequent access and Glacier storage as
well
• EMR is a fully managed solution
• Up to date Spark distributions
• Support of Spot instances
• Managed Hadoop when / if needed
Amazon
S3
Amazon
EMR
52. Some technical challenges
• Spark 1.3.1 did not work well with S3 and
Parquet files
• Writing to S3 using the default
ParquetOutputComitter is slow
• Calculating output summary metadata is slow
• S3 is not consistent by default (it’s not a file
system)
• S3 can be slower than HDFS
53. Solutions
• Upgrade to Spark 1.5.1+
• Using DirectParquetOutputCommitter with speculation
disabled
• parquet.enable.summary-metadata = false
• Enable S3 Consistent View in EMR!
• (Remember the DynamoDB tables!)
• In case of a multi-step Spark workflow, utilize HDFS on
the Spark Cluster
54. The end result
• Decreasing costs by 70%+
• Focusing on development instead of
commodities (hardware/versioning/patching)
• Eliminated risks