AWS offers many data services, each optimized for a specific set of structure, size, latency, and concurrency requirements. Making the best use of all specialized services has historically required custom, error-prone data transformation and transport. Now, users can use the AWS Data Pipeline service to orchestrate data flows between Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon Redshift, and on-premise data stores, seamlessly and efficiently applying EC2 instances and EMR clusters to process and transform data. In this session, we demonstrate how you can use AWS Data Pipeline to coordinate your Big Data workflows, applying the optimal data storage technology to each part of your data integration architecture. Swipely's Head of Engineering shows how Swipely uses AWS Data Pipeline to build batch analytics, backfilling all their data, while using resources efficiently. Consequently, Swipely launches novel product features with less development time and less operational complexity.
2. What are some of the challenges
in dealing with data?
Friday, November 15, 13
3. 1. Data is stored in different formats and
locations, making it hard to integrate
Amazon Redshift
Amazon RDS
Amazon S3
Amazon EMR
Friday, November 15, 13
Amazon DynamoDB
On-Premises
4. 2. Data workflows require complex
dependencies
•
For example, a data
processing step may
depend on:
• Input data being ready
• Prior step completing
• Time of day
• Etc.
Friday, November 15, 13
Input Data
Ready?
No
Yes
Run…
5. 3. Things go wrong - you must handle
exceptions
•
For example, do you want
to:
• Retry in the case of
failure?
• Wait if a dependent
step is taking longer
than expected?
• Be notified if something
goes wrong?
Friday, November 15, 13
6. 4. Existing tools are not a good fit
•
•
•
•
•
Expensive upfront licenses
Scaling issues
Don’t support scheduling
Not designed for the cloud
Don’t support newer data
stores (e.g., Amazon DynamoDB)
Friday, November 15, 13
8. A simple pipeline
Input DataNode with PreCondition check
Activity with failure & delay notifications
Output DataNode
Friday, November 15, 13
9. Manages scheduled data movement and
processing across AWS services
Activities
Amazon Redshift
Amazon RDS
Amazon S3
Amazon EMR
Friday, November 15, 13
Amazon DynamoDB
•
•
•
•
•
•
Copy
MapReduce
Hive
Pig (New)
SQL (New)
Shell command
10. Facilitates periodic data movement
to/from AWS
Amazon Redshift
Amazon RDS
Amazon S3
Amazon EMR
Friday, November 15, 13
Amazon DynamoDB
On-Premises
11. Supports dependencies (Preconditions)
•
•
•
•
•
Amazon DynamoDB table exists/has data
Amazon S3 key exists
Amazon S3 prefix is not empty
Success of custom Unix/Linux shell command
Success of other pipeline tasks
Yes
S3 key
exists?
No
Friday, November 15, 13
Copy…
12. Alerting and exception handling
•
Notification
• On failure
• On delay
• Automatic retry logic
Task 1
Success
Failure
Alert
Task 2
Success
Friday, November 15, 13
Failure
Alert
13. Flexible scheduling
•
Choose a schedule
• Run every: 15 minutes, hour, day, week, etc.
• User defined
• Backfill support
• Start pipeline on past date
• Rapidly backfills to present day
Friday, November 15, 13
14. Massively scalable
• Creates and
terminates AWS
resources (Amazon
EC2 and Amazon
EMR) to process data
• Manage resources in
multiple regions
Friday, November 15, 13
15. Easy to get started
• Templates for
•
•
•
common use cases
Graphical interface
Natively understands
CSV and TSV
Automatically
configures Amazon
EMR clusters
Friday, November 15, 13
17. An ETL example (1 of 2)
•
•
•
•
•
Friday, November 15, 13
Combine logs in
Amazon S3 with
customer data in
Amazon RDS
Process using Hive
on Amazon EMR
Put output in
Amazon S3
Load into Amazon Redshift
Run SQL query and
load table for BI
tools
18. An ETL example (2 of 2)
•
•
•
•
Friday, November 15, 13
Run on a schedule
(e.g. hourly)
Use a precondition
to make Hive
activity depend on
Amazon S3 logs being
available
Set up Amazon SNS
notification on
failure
Change default
retry logic
20. How big is your data?
1 TB
Friday, November 15, 13
21. How big is your data?
Do you have a
big data problem?
Friday, November 15, 13
22. How big is your data?
Don’t use Hadoop:
your data isn’t that big.
Do you have a
big data problem?
Friday, November 15, 13
23. How big is your data?
Don’t use Hadoop:
your data isn’t that big.
Keep your data small
and manageable.
Do you have a
big data problem?
Friday, November 15, 13
24. Get ahead of your Big Data
don’t wait for data to become a problem
Friday, November 15, 13
25. Get ahead of your Big Data
don’t wait for data to become a problem
Build novel product features
with a batch architecture
Friday, November 15, 13
26. Get ahead of your Big Data
don’t wait for data to become a problem
Build novel product features
with a batch architecture
Decrease development time
by easily backfilling data
Friday, November 15, 13
27. Get ahead of your Big Data
don’t wait for data to become a problem
Build novel product features
with a batch architecture
Decrease development time
by easily backfilling data
Vastly simplify operations
with scalable on-demand services
Friday, November 15, 13
30. must innovate
by making payments data actionable
and rapidly iterate
deploying multiple times a day
Friday, November 15, 13
31. must innovate
by making payments data actionable
and rapidly iterate
deploying multiple times a day
with a lean team.
we have 2 ops engineers
Friday, November 15, 13
32. Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.
Friday, November 15, 13
33. Swipely uses AWS Data Pipeline to
build batch analytics,
Fast, dynamic reports
by mashing up data
from facts.
backfilling all our data,
using resources efficiently.
Friday, November 15, 13
36. AWS Data Pipeline orchestrates
building of documents from facts
Transaction
Facts
Friday, November 15, 13
EMR
Intermediate
S3 Bucket
insert
Sales by Day
Documents
37. AWS Data Pipeline orchestrates
building of documents from facts
EMR Data
Transformer
Transaction
Facts
Friday, November 15, 13
Data
Post-Processor
EMR
insert
Intermediate
S3 Bucket
Sales by Day
Documents
38. AWS Data Pipeline orchestrates
building of documents from facts
AWS Data Pipeline
EMR Data
Transformer
Transaction
Facts
Friday, November 15, 13
Data
Post-Processor
EMR
insert
Intermediate
S3 Bucket
Sales by Day
Documents
44. Mash up data for efficient processing
Transactions
Visits
Sales by Day
Cafe 3/30 4980 $72
Spa 5/11 8278 $140
Cafe 5/11 2472 $57
Cafe 2472 5/11: $57 0 new
Cafe 4980 3/30: $72 1 new
Cafe 4980 5/11: $49 0 new
Cafe 5/10: $4030 60 new
Cafe 5/11: $5432 80 new
Cafe 5/12: $6292 135 new
EMR
Customer Spend
Card Opt-In
2472 Bob
8278 Mary
Friday, November 15, 13
EMR
Hive (EMR)
Mary
4980
Bob
5/11: $309
5/11: $218
5/11: $198
45. AWS Data Pipeline orchestrates
building of documents from facts
AWS Data Pipeline
EMR Data
Transformer
Transaction
Facts
Friday, November 15, 13
Data
Post-Processor
EMR
insert
Intermediate
S3 Bucket
Sales by Day
Documents
46. Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.
Friday, November 15, 13
47. Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
Regularly rebuild
to rapidly iterate,
using agile process.
using resources efficiently.
Friday, November 15, 13
48. Regularly rebuild to avoid backfilling
web service
Analytics
Documents
daily
transactions
card opt-in
Friday, November 15, 13
Fact
Store
49. Regularly rebuild to avoid backfilling
web service
Recent
Activity
Analytics
Documents
daily
transactions
card opt-in
Friday, November 15, 13
Fact
Store
54. Leverage agile development process
Wrap pipeline definition
Reduce variability
Quickly diagnose failures
Automate common tasks
Friday, November 15, 13
57. Reduce variability
No small instances
"coreInstanceType":
"m1.large"
Lock versions
"installHive":
"0.8.1.8"
Security groups by database
"securityGroups":
Friday, November 15, 13
[ "customerdb" ]
58. Quickly diagnose failures
Turn on logging
"enableDebugging", "logUri", "emrLogUri"
Namespace your logs
"s3://#{LOGS_BUCKET}/#{@s3prefix}/#{START_TIME}/SalesByDayEMRLogs"
Log into dev instances
"keyPair"
Friday, November 15, 13
59. Automate common tasks
Clean up
"terminateAfter":
"6 hours"
Bootstrap your environment
{
"id":
"type":
"scriptUri":
"runsOn":
}
Friday, November 15, 13
"BootstrapEnvironment",
"ShellCommandActivity",
".../bootstrap_ec2.sh",
{ "ref": "SalesByDayEC2Resource" }
60. Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.
Friday, November 15, 13
61. Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.
Scale horizontally,
backfilling in 50 min,
storing all your data.
Friday, November 15, 13
64. Cost vs latency sweet spot at 50 min
Friday, November 15, 13
65. Cost vs latency sweet spot at 50 min
Use smallest capable on-demand instance type
fixed hourly cost, no idle time
Friday, November 15, 13
66. Cost vs latency sweet spot at 50 min
Use smallest capable on-demand instance type
fixed hourly cost, no idle time
Scale EMR-heavy jobs horizontally
cost ( 1 instance, N hours ) = cost ( N instances, 1 hour )
Friday, November 15, 13
67. Cost vs latency sweet spot at 50 min
Use smallest capable on-demand instance type
fixed hourly cost, no idle time
Scale EMR-heavy jobs horizontally
cost ( 1 instance, N hours ) = cost ( N instances, 1 hour )
Target < 1 hour
~10 min runtime variability
Friday, November 15, 13
68. Cost vs latency sweet spot at 50 min
Use smallest capable on-demand instance type
fixed hourly cost, no idle time
Scale EMR-heavy jobs horizontally
cost ( 1 instance, N hours ) = cost ( N instances, 1 hour )
Target < 1 hour
~10 min runtime variability
Crunch 50 GB facts in 50 min
using 40 instances for < $10
Friday, November 15, 13
69. Store all your data - it’s cheap
Friday, November 15, 13
70. Store all your data - it’s cheap
Store all your facts in Amazon S3
your source of truth: 50 GB, $5 / month
Friday, November 15, 13
71. Store all your data - it’s cheap
Store all your facts in Amazon S3
your source of truth: 50 GB, $5 / month
Store your analytics documents in Amazon RDS
for indexed queries: 20 GB, $250 / month
Friday, November 15, 13
72. Store all your data - it’s cheap
Store all your facts in Amazon S3
your source of truth: 50 GB, $5 / month
Store your analytics documents in Amazon RDS
for indexed queries: 20 GB, $250 / month
Retain intermediate data in Amazon S3
for diagnosis: 1.1 TB (60 days), $100 / month
Friday, November 15, 13
73. Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.
Friday, November 15, 13
74. Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.
Friday, November 15, 13
76. Please give us your feedback on this
presentation
BDT207
As a thank you, we will select prize
winners daily for completed surveys!
Friday, November 15, 13
Thank You