Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Orchestrating Big Data Integration
and Analytics Data Flows with
AWS Data Pipeline
Jon Einkauf (Sr. Product Manager, AWS)
Anthony Accardi (Head of Engineering, Swipely)
November 14, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Friday, November 15, 13

What are some of the challenges
in dealing with data?


1. Data is stored in different formats and
locations, making it hard to integrate

Amazon Redshift
Amazon RDS
Amazon S3
Amazon EMR


Amazon DynamoDB

On-Premises

2. Data workflows require complex
dependencies
•

For example, a data
processing step may
depend on:
• Input data being ready
• Prior step completing
• Time of day
• Etc.


Input Data
Ready?
No

Yes

Run…

3. Things go wrong - you must handle
exceptions
•

For example, do you want
to:
• Retry in the case of
failure?
• Wait if a dependent
step is taking longer
than expected?
• Be notified if something
goes wrong?


4. Existing tools are not a good fit
•
•
•
•
•

Expensive upfront licenses
Scaling issues
Don’t support scheduling
Not designed for the cloud
Don’t support newer data
stores (e.g., Amazon DynamoDB)


Introducing AWS Data Pipeline


A simple pipeline
Input DataNode with PreCondition check

Activity with failure & delay notifications

Output DataNode


Manages scheduled data movement and
processing across AWS services
Activities

Amazon Redshift
Amazon RDS
Amazon S3
Amazon EMR


Amazon DynamoDB

•
•
•
•
•
•

Copy
MapReduce
Hive
Pig (New)
SQL (New)
Shell command

Facilitates periodic data movement
to/from AWS

Amazon Redshift
Amazon RDS
Amazon S3
Amazon EMR


Amazon DynamoDB

On-Premises

Supports dependencies (Preconditions)
•
•
•
•
•

Amazon DynamoDB table exists/has data
Amazon S3 key exists
Amazon S3 prefix is not empty
Success of custom Unix/Linux shell command
Success of other pipeline tasks
Yes

S3 key
exists?
No

Copy…

Alerting and exception handling
•

Notification
• On failure
• On delay
• Automatic retry logic

Task 1

Success

Failure

Alert

Task 2

Success


Failure

Alert

Flexible scheduling
•

Choose a schedule
• Run every: 15 minutes, hour, day, week, etc.
• User defined
• Backfill support
• Start pipeline on past date
• Rapidly backfills to present day


Massively scalable
• Creates and
terminates AWS
resources (Amazon
EC2 and Amazon
EMR) to process data
• Manage resources in
multiple regions


Easy to get started
• Templates for
•
•
•

common use cases
Graphical interface
Natively understands
CSV and TSV
Automatically
configures Amazon
EMR clusters


Inexpensive
•
•
•
•

Free tier
Pay per activity/precondition
No commitment
Simple pricing:


An ETL example (1 of 2)
•

•
•
•
•


Combine logs in
Amazon S3 with
customer data in
Amazon RDS
Process using Hive
on Amazon EMR
Put output in
Amazon S3
Load into Amazon Redshift
Run SQL query and
load table for BI
tools

An ETL example (2 of 2)
•
•

•
•

Run on a schedule
(e.g. hourly)
Use a precondition
to make Hive
activity depend on
Amazon S3 logs being
available
Set up Amazon SNS
notification on
failure
Change default
retry logic

Swipely


How big is your data?

1 TB



Do you have a
big data problem?


Don’t use Hadoop:
your data isn’t that big.

Do you have a
big data problem?


Don’t use Hadoop:
your data isn’t that big.
Keep your data small
and manageable.
Do you have a
big data problem?


Get ahead of your Big Data
don’t wait for data to become a problem


Build novel product features
with a batch architecture



Decrease development time
by easily backfilling data



Decrease development time
by easily backfilling data

Vastly simplify operations
with scalable on-demand services


must innovate
by making payments data actionable


must innovate

and rapidly iterate
deploying multiple times a day


must innovate

and rapidly iterate
deploying multiple times a day

with a lean team.
we have 2 ops engineers


Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.



Fast, dynamic reports
by mashing up data
from facts.



Generate fast, dynamic reports


AWS Data Pipeline orchestrates
building of documents from facts

Transaction
Facts

EMR

Intermediate
S3 Bucket

insert

Sales by Day
Documents


EMR Data
Transformer

Transaction
Facts

Data
Post-Processor

EMR

insert

Intermediate
S3 Bucket

Sales by Day
Documents

AWS Data Pipeline

EMR Data
Transformer

Transaction
Facts

Data
Post-Processor

EMR

insert

Intermediate
S3 Bucket

Sales by Day
Documents

Mash up data for efficient processing
Transactions
Cafe 3/30 4980 $72
Spa 5/11 8278 $140
Cafe 5/11 2472 $57


Sales by Day
Cafe 5/10: $4030
Cafe 5/11: $5432
Cafe 5/12: $6292

EMR

Transactions

Visits

Sales by Day

Cafe 3/30 4980 $72
Spa 5/11 8278 $140
Cafe 5/11 2472 $57

Cafe 2472 5/11: $57 0 new
Cafe 4980 3/30: $72 1 new
Cafe 4980 5/11: $49 0 new

Cafe 5/10: $4030 60 new
Cafe 5/11: $5432 80 new
Cafe 5/12: $6292 135 new


EMR

EMR

Transactions

Visits

Sales by Day

Cafe 3/30 4980 $72
Spa 5/11 8278 $140
Cafe 5/11 2472 $57

Cafe 2472 5/11: $57 0 new
Cafe 4980 3/30: $72 1 new
Cafe 4980 5/11: $49 0 new

Cafe 5/10: $4030 60 new
Cafe 5/11: $5432 80 new
Cafe 5/12: $6292 135 new

EMR

Customer Spend

Card Opt-In
2472 Bob
8278 Mary


EMR

Hive (EMR)

Mary
4980
Bob

5/11: $309
5/11: $218
5/11: $198


Regularly rebuild
to rapidly iterate,
using agile process.



Regularly rebuild to avoid backfilling
web service
Analytics
Documents
daily
transactions
card opt-in

Fact
Store

Regularly rebuild to avoid backfilling
web service
Recent
Activity

Analytics
Documents
daily

transactions
card opt-in

Fact
Store

Minor changes require little work


Minor changes require little work

change accounting rules
without a migration

Rapidly iterate your product


Rapidly iterate your product

redeﬁne “best”


Leverage agile development process
Wrap pipeline definition
Reduce variability
Quickly diagnose failures
Automate common tasks


{
"id":
"GenerateSalesByDay",
"type":
"EmrActivity",
"onFail":
{ "ref": "FailureNotify" },
"schedule": { "ref": "Nightly" },
"runsOn":
{ "ref": "SalesByDayEMRCluster" },
"dependsOn": { "ref": "GenerateIndexedSwipes" },
"step":
"/.../hadoop-streaming.jar,
-input, s3n://<%= s3_data_path %>/indexed_swipes.csv,
-output, s3://<%= s3_data_path %>/sales_by_day,
-mapper, s3n://<%= s3_code_path %>/sales_by_day_mapper.rb,
-reducer,s3n://<%= s3_code_path %>/sales_by_day_reducer.rb"
}


{
"id":
"type":
"onFail":
"schedule":
"runsOn":
"dependsOn":
"step":

}


"GenerateSalesByDay",
"EmrActivity",
{ "ref": "FailureNotify" },
{ "ref": "Nightly" },
{ "ref": "SalesByDayEMRCluster" },
{ "ref": "GenerateIndexedSwipes" },
"<%= streaming_hadoop_step(
input:
'/indexed_swipes.csv',
output: '/sales_by_day',
mapper: '/sales_by_day_mapper.rb',
reducer: '/sales_by_day_reducer.rb'
) %>"

Reduce variability
No small instances
"coreInstanceType":

"m1.large"

Lock versions
"installHive":

"0.8.1.8"

Security groups by database
"securityGroups":


[ "customerdb" ]

Quickly diagnose failures
Turn on logging
"enableDebugging", "logUri", "emrLogUri"

Namespace your logs
"s3://#{LOGS_BUCKET}/#{@s3prefix}/#{START_TIME}/SalesByDayEMRLogs"

Log into dev instances
"keyPair"


Automate common tasks
Clean up
"terminateAfter":

"6 hours"

Bootstrap your environment
{
"id":
"type":
"scriptUri":
"runsOn":
}


"BootstrapEnvironment",
"ShellCommandActivity",
".../bootstrap_ec2.sh",
{ "ref": "SalesByDayEC2Resource" }


Scale horizontally,
backfilling in 50 min,
storing all your data.


Scale Amazon EMR pipelines horizontally


Cost vs latency sweet spot at 50 min


Use smallest capable on-demand instance type
fixed hourly cost, no idle time


Scale EMR-heavy jobs horizontally
cost ( 1 instance, N hours ) = cost ( N instances, 1 hour )


Target < 1 hour
~10 min runtime variability


Target < 1 hour
~10 min runtime variability
Crunch 50 GB facts in 50 min
using 40 instances for < $10

Store all your data - it’s cheap


Store all your facts in Amazon S3
your source of truth: 50 GB, $5 / month



Store your analytics documents in Amazon RDS
for indexed queries: 20 GB, $250 / month



Store your analytics documents in Amazon RDS
for indexed queries: 20 GB, $250 / month

Retain intermediate data in Amazon S3
for diagnosis: 1.1 TB (60 days), $100 / month


Please give us your feedback on this
presentation

BDT207
As a thank you, we will select prize
winners daily for completed surveys!


Thank You

Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (17)

Similaire à Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Similaire à Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013 (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Dernier

Dernier (20)

Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013