Explore the various options for streaming data on AWS, such as Amazon Kinesis and Amazon Managed Streaming for Kafka, and the various options for processing streams of data such as Apache Spark, Apache Flink, AWS Lambda, and Amazon Kinesis Analytics for Java. Let's explore what an architecture for processing Australia's new Open Banking data format at 60,000 transactions per second could look like.
2. Typical credit card transaction
Credit Card
Institution
User
Merchant Mobile
client
3. Mobile Apps Web Clickstream Application Logs
Metering Records IoT Sensors Smart Cities
[Wed Oct 11 14:32:52
2018] [error] [client
127.0.0.1] client
denied by server
configuration:
/export/home/live/ap/h
tdocs/test
Data is produced continuously
4. The diminishing value of data over time
Historical
Reactive
Actionable
Preventive/
Predictive
Informationin
Decision-Making
Time-critical
Decisions
Traditional“Batch”
BusinessIntelligence
Months…DaysHoursMinutesSecondsRealTime
ValueofDatatoDecision-Making
5. Customer examples
1 billion events per
week from
connected devices
Near-real-time
home valuation
(Zestimates)
Live clickstream
dashboards refreshed
under 10 sec.
50 billion daily ad
impressions, sub-
50-ms responses
Facilitate
communications
between 100+
microservices
Analyse billions of
network flows in
real time
IoT predictive
analytics
Analyse game
events in near
real time
6. What do we need from our streaming architecture
Event driven
Source for
batch
Consumption models
Schema flexibility
Microservices
Loosely coupled
Elasticity
Horizontal
scalability
Operations
7. Components of a streaming architecture
Producer Message Buffer
Topic A
Topic B
Consumer
Producer
Producer
Producer
Producer
Producer
Schema
Repository
Consumer
Consumer
Consumer
Consumer
8. Checkpoint: Amazon SQS, Amazon Kinesis and
Amazon Managed Streaming for Kafka
• Traditional Messaging
semantics
• Transparent scaling
• Individual message
delay
• Dead letter queues
• Multiple Consumers
• Native AWS
Integrations
• Fully Managed
• Control over ordering
• Highly configurable
retention
• Managed Kafka and
Zookeeper
• Existing applications
• Full configurability
• Log compaction
Amazon Simple
Queue Service
Amazon
Kinesis
Amazon Managed
Streaming for Kafka
Kafka on Elastic Compute
Cloud (EC2)
13. Real-time
Fully managed
Scalable
Secure
Cost effective
Amazon EMR/Spark
Custom code on
Amazon EC2
Amazon S3
Amazon
Redshift
Splunk
Ingest,
store data
streams
Amazon
Kinesis Data
Streams
Amazon
Kinesis Data
Analytics
Aggregate,
filter,
enrich data
Amazon
Kinesis Data
Firehose
Egress
data
streams
AWS Lambda
Amazon
Elasticsearch
Service
Amazon Kinesis Data Streaming
Collect Process and analyse data streams in real-time
18. Apache Flink
Framework and distributed engine for stateful processing of data streams.
Simple programming
High
performance
Stateful Processing Strong data integrity
Easy to use and flexible
APIs make building
apps fast
In-memory computing
provides low latency &
high throughput
Durable application
state saves
Exactly-once processing
and consistent state
25. Aggregations and windowing
Sliding
select TUMBLE_START(created_date, INTERVAL '20' SECOND) as wStart,
TUMBLE_END(created_date, INTERVAL '20' SECOND) as wEnd,
SUM(amount),
AVG(amount),
COUNT(*)
from Transactions
GROUP BY TUMBLE(created_date, INTERVAL '20' SECOND)
Tumbling
Session
26. Categorise transactions from a stream
Amazon Managed
Streaming for Kafka
Mobile
client
Users
Amazon Simple
Notification Service
Amazon API GatewayAmazon DynamoDB
Amazon SageMaker Amazon Managed
Streaming for Kafka
27. Grow with your requirements
Amazon Managed
Streaming for Kafka
Mobile
client
Amazon API Gateway
Amazon Simple Storage
Service (S3)
Amazon Athena
Amazon RedshiftAmazon Elasticsearch
Service
DesktopAmazon RDS
AWS
Lambda
Amazon Neptune
Amazon EC2
Analyst
28. Grow with your requirements
Amazon Managed
Streaming for Kafka
Mobile
client
Amazon API Gateway
Amazon Simple Storage
Service (S3)
Amazon Athena
Amazon RedshiftAmazon Elasticsearch
Service
DesktopAmazon RDS
AWS
Lambda
Amazon Neptune
Amazon EC2
Analyst
30. Data61 - CSIRO
• Standards developed as part of the introduction in Australia of the Consumer Data
Right legislation to give Australians greater control over their data
• The Consumer Data Right is intended to apply sector by sector across the whole
economy, beginning in the banking, energy and telecommunications sectors.
• https://consumerdatastandardsaustralia.github.io
31. Resources
• Free online training - https://www.aws.training
• Flink Homepage - https://flink.apache.org/
• AWS Big Data Blog
• https://aws.amazon.com/blogs/big-data/
• Real-time bushfire alerting with Complex Event Processing in Apache Flink on Amazon EMR and IoT
sensor network
• Confluent - https://www.confluent.io/blog/
• Designing Data-Intensive Applications by Martin Kleppmann