In this session, you will learn best practices for implementing simple to advanced real-time streaming data use cases on AWS. First, we will review decision points on near real-time versus real time scenarios. Next, we will take a look at streaming data architecture patterns that include Amazon Kinesis Analytics, Amazon Kinesis Firehose, Amazon Kinesis Streams, Spark Streaming on Amazon EMR, and other open source libraries. Finally, we will dive deep into the most common of these patterns and cover design and implementation considerations.
2. Batch Processing
Hourly server logs
Weekly or monthly bills
Daily website clickstream
Daily fraud reports
It’s all about the pace
Stream Processing
Real-time metrics
Real-time spending alerts/caps
Real-time clickstream analysis
Real-time detection
3. Simple pattern for streaming data
Continuously creates
data
Continuously writes
data to a stream
Can be almost
anything
Data Producer
Durably stores data
Provides temporary
buffer
Supports very high-
throughput
Streaming Storage
Continuously
processes data
Cleans, prepares, &
aggregates
Transforms data to
information
Data Consumer
Mobile Client Amazon Kinesis Amazon Kinesis app
5. Amazon Kinesis: Streaming data made easy
Services make it easy to capture, deliver, process streams on AWS
Amazon Kinesis
Streams
Amazon Kinesis
Analytics
Amazon Kinesis
Firehose
6. Amazon Kinesis Streams
• Easy administration
• Build real-time applications with framework of choice
• Low cost
7. Amazon Kinesis Firehose
• Zero administration and seamless elasticity
• Direct-to-data store integration
• Continuous data transformations
Capture and submit
streaming data
Analyze streaming data using
your favorite BI tools
Firehose loads streaming data
continuously into Amazon S3, Amazon
Redshift, and Elasticsearch
8. Easily capture and deliver data
• Write data to a Firehose
delivery stream from a
variety of sources
• Transform, encrypt, and/or
compress data along the
way
• Buffer and aggregate data
by time and size before it is
written to destination
• Elastically scales with no
resource provisioning
AWS Platform SDKs Mobile SDKs Kinesis Agent AWS IoT
Amazon
S3
Amazon
Redshift
Amazon Kinesis Firehose
Amazon
Elasticsearch Service
9. Amazon Kinesis Analytics
• Apply SQL on streams
• Build real-time, stream-processing applications
• Easy scalability
Connect to Kinesis streams,
Firehose delivery streams
Run standard SQL queries
against data streams
Kinesis Analytics can send processed data
to analytics tools so you can create alerts
and respond in real-time
10. Use SQL to build real-time applications
Easily write SQL code to process
streaming data
Connect to streaming source
Continuously deliver SQL results
11. AWS Lambda
• Function code triggered from newly arriving events
• Simple event-based processing of records
• Serverless processing with low administration
Social media stream is loaded
into Kinesis in real-time
Lambda runs code that generates hashtag
trend data and stores it in DynamoDB
Social media trend data
immediately available
for business users to
query
LAMBDASTREAMS DYNAMODB
12. Amazon Elastic MapReduce (EMR)
• Ingest streaming data from many sources
• Easily configure clusters with latest versions of open
source frameworks
• Less underlying performance management
Ingest streaming data
through Amazon Kinesis
Streams
Your choice of stream
data processing engine,
Spark Streaming or Apache Flink
EMRSTREAMS S3
Send processed
data to S3, HDFS,
or a custom
destination using
an open source
connector
14. Compute analytics as the data is generated
Continuous
Metric
Generation
Actionable
Insights
React to analytics based off of insights
Stream Processing – Three Example Use Cases
Streaming
Ingest-
Transform-Load
Deliver data to analytics tools faster and
cheaper
15. Streaming Ingest-Transform-Load
Use Cases
• Ingest and aggregate log data into Amazon S3
• Clean IoT sensor data and deliver to ElasticSearch
Key Requirements
• Ingest large volumes of small events
• Perform simple transformations
• Persist and store data efficiently
16. Deep Dive: Analyzing VPC Flow Logs
• Virtual Private Cloud (VPC) Flow Logs capture
information about IP traffic
• Cost-effective storage and near real-time analysis
VPC
subnet
Amazon
CloudWatch
Logs
AWS
Lambda Amazon
Kinesis
Firehose
Amazon S3
bucket
Amazon
Athena
Amazon
QuickSight
Forward VPC
flow logs
Aggregation and
transformation
Near real-time
queries and
visualization
17. Best Practices for VPC Flow Log Ingestion
• Buffer data in multiple steps, e.g. data producer,
Amazon Kinesis, data consumer
• Choose a data format that supports your use case;
JSON vs. GZIP vs. Parquet
• Understand tradeoffs with latency including accuracy
and cost
18. Continuous metric generation
Examples
• Compute metrics from application logs
• Build leaderboards for top web pages or app screens
Key Requirements
• Produce accurate results with late and out of order data
• Combine results with historical data
• Quickly provide data to tech and non-tech end users
19. IoT sensors AWS IoT
RDS
MySQL DB
instance
Amazon
Kinesis
Streams
Amazon
Kinesis
Streams
Amazon
Kinesis
Analytics
AWS
Lambda function
Compute average
temperature every 10secIngest sensor data
Persist time series
analytic to database
Deep Dive: Time-Series Analytics
• Ingest and stream IoT data via MQTT
• Compute metrics like average temperature every 10
seconds
20. Best Practices for Time-Series Analytics
• Handle data that arrives late and out of order from
IoT sensors that are not always online
• Use event time in your aggregation; only use
processing time if your use case is purely real-time
• Perform upserts for your MySQL table, e.g., insert [..]
on duplicate key update [..]
21. Responsive Analysis
Examples
• Recommendation engines for retail websites
• Device operational intelligence and alerts
• Detecting trends and anomalies on user behavior
Key Requirements
• Ability to notify users and machines with low latency
• Long running, stateful operations over streams
23. Best Practices for Anomaly Detection
• Keep up with the stream of data through scalable
ingest and processing
• Understand your data deeply; an algorithm isn’t a
replacement for a data scientist
• Provide enough context around the data for an app or
human to derive the “why”
25. Try these use cases yourself
Many variations of these use cases have sample code on the AWS Big
Data Blog. Follow the blog!
Some good examples:
• Analyzing VPC Flow Logs with Amazon Kinesis Firehose, Amazon
Athena, and Amazon QuickSight
• Real-time Clickstream Anomaly Detection with Amazon Kinesis
Analytics
• Writing SQL on Streaming Data with Amazon Kinesis Analytics |
Part 1, Part 2
26. Lots of customer examples
1 billion events/wk from
connected devices | IoT
17 PB of game data per
season | Entertainment
80 billion ad
impressions/day, 30 ms
response time | Ad Tech
100 GB/day click streams
from 250+ sites |
Enterprise
50 billion ad
impressions/day sub-50
ms responses | Ad Tech
10 million events/day
| Retail
Amazon Kinesis as Databus -
Migrate from Kafka to Kinesis| Enterprise
Funnel all
production events
through Amazon
Kinesis