My recent talk at Riga DevDay about Lambda architect at AWS. It illustrates few design simplifications that we can get when we implement Lambda Architecture in Cloud Native way
4. What is Streaming?
We often want to deploy data models based on new data that
continuously arrive from the multiple sources
0
1
0
1
0
10
1
0
1
0
1
0
10
1
5. Challenges
Users expect data will appear immediately after it arrived
Fault tolerant
Distributed data consistency
Scalability (how not to lose data when scale down)
6. What is “λ”
0
1
0 10
10
1 00
0
1
110
1
Speed Layer Batch Layer
new data
master
data
realtime
view
Serving Layer
view View View…
map-red
query query
realtime
view
7. What is “λ” architecture
Batch Layer: Master Data sets and Pre-compute aggregations
• Slow Data Ingestion – minutes to days intervals
• Append-only data sets eventually supersedes data
captured in speed layer
Speed Layer: High throughput, near-real-time data ingestion
• Fast Data Ingestion – seconds interval
• Concurrent information processing
• Retrieval of most recent information
Serving Layer: Provide query capability over the Batch Layer
• Low-latency ad-hoc query
• May also provide assess to speed layer views
15. AWS Blueprint for Lambda Architectures
https://d0.awsstatic.com/whitepapers/lambda-architecure-on-for-batch-aws.pdf
Published at July 2015
Amazon
Kinesis
AmazonKinesis–
enabledapp
S3 buckets
Amazon EMR
speed layer
batch layer
emr on serving
and merging layer
19. Kinesis
aws region
az1 az2 az3
Lambda
S3 storage
Redshift
consumers
EC2 Instance
EMR
AmazonKinesis kinesis = ...
...
PutRecordRequest putRecord = new PutRecordRequest();
putRecord.setStreamName(streamName);
putRecord.setData(ByteBuffer.wrap(bytes));
putRecord.setSequenceNumberForOrdering(null);
...
kinesis.putRecord(putRecord);
Producer
AmazonKinesisClient kinesisClient = ...
GetShardIteratorRequest req = ...
req.setStreamName("my-kinesis");
req.setShardIteratorType("TRIM_HORIZON");
...
GetRecordsResult result = kinesisClient.getRecords(req);
records = result.getRecords();
for (Record record : records) {
... = record.getData();
}
Consumerproducers
20. Kinesis streams
What: Enables to build near-real-time data processing
applications
Use cases:
• Real time analytics
• Log files processing
• Reporting
Durability: data streams replicated across 3AZ
21. Kinesis streams
Cost Model:
Shard Hour:
• 5 read transaction per second
• 2 MB data read per second
• 100 write transactions per second
• 1 MB data write per second
aprox 12.5USD/Mo
Extended data retention
• Up to 7 days
22. Kinesis streams
Not good when:
• Small scale throughput less than 200KB/sec
• Long term data storage (more than 24H)
23. Lambda
What: Lambda allows to write function without having actual
server
Use cases:
• Real time Stream processing
• Tiny ETL
• In few cases can replace EC2
• Process IaaS Events
Runtimes: Java8, NodeJS, Python
Backed by: provides /tmp for ephemeral storage.
Durability: No maintenance windows, 3 retries before failure
27. EMR
What: Managed service of Apache Hadoop
Use cases:
• MapRed data processing
• Large data ETL jobs
• Data movement
• Log processing and analytics
Backed by: 1 or cluster of EC2 instances
Durability: on storage level provides by S3
See more:
https://media.amazonwebservices.com/AWS_Amazon_EMR_Best_Practices.pdf
32. S3
Not good when:
• S3 write can be slow
• Glacier can restore up to 5% of storage per months
33. Redshift
What: Petabytes scale Data Warehouse as managed service
• Data warehouse (OLAP)
• BI and ETL
• Store large historical data
Backed by: AWS provides automatic data backup
Durability: on storage level provides by S3
Scaling: Start with 160GB node and then you can scale