SlideShare une entreprise Scribd logo
1  sur  76
Télécharger pour lire hors ligne
Orchestrating Big Data Integration
and Analytics Data Flows with
AWS Data Pipeline
Jon Einkauf (Sr. Product Manager, AWS)
Anthony Accardi (Head of Engineering, Swipely)
November 14, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Friday, November 15, 13
What are some of the challenges
in dealing with data?

Friday, November 15, 13
1. Data is stored in different formats and
locations, making it hard to integrate

Amazon Redshift
Amazon RDS
Amazon S3
Amazon EMR

Friday, November 15, 13

Amazon DynamoDB

On-Premises
2. Data workflows require complex
dependencies
•

For example, a data
processing step may
depend on:
• Input data being ready
• Prior step completing
• Time of day
• Etc.

Friday, November 15, 13

Input Data
Ready?
No

Yes

Run…
3. Things go wrong - you must handle
exceptions
•

For example, do you want
to:
• Retry in the case of
failure?
• Wait if a dependent
step is taking longer
than expected?
• Be notified if something
goes wrong?

Friday, November 15, 13
4. Existing tools are not a good fit
•
•
•
•
•

Expensive upfront licenses
Scaling issues
Don’t support scheduling
Not designed for the cloud
Don’t support newer data
stores (e.g., Amazon DynamoDB)

Friday, November 15, 13
Introducing AWS Data Pipeline

Friday, November 15, 13
A simple pipeline
Input DataNode with PreCondition check

Activity with failure & delay notifications

Output DataNode

Friday, November 15, 13
Manages scheduled data movement and
processing across AWS services
Activities

Amazon Redshift
Amazon RDS
Amazon S3
Amazon EMR

Friday, November 15, 13

Amazon DynamoDB

•
•
•
•
•
•

Copy
MapReduce
Hive
Pig (New)
SQL (New)
Shell command
Facilitates periodic data movement
to/from AWS

Amazon Redshift
Amazon RDS
Amazon S3
Amazon EMR

Friday, November 15, 13

Amazon DynamoDB

On-Premises
Supports dependencies (Preconditions)
•
•
•
•
•

Amazon DynamoDB table exists/has data
Amazon S3 key exists
Amazon S3 prefix is not empty
Success of custom Unix/Linux shell command
Success of other pipeline tasks
Yes

S3 key
exists?
No
Friday, November 15, 13

Copy…
Alerting and exception handling
•

Notification
• On failure
• On delay
• Automatic retry logic

Task 1

Success

Failure

Alert

Task 2

Success

Friday, November 15, 13

Failure

Alert
Flexible scheduling
•

Choose a schedule
• Run every: 15 minutes, hour, day, week, etc.
• User defined
• Backfill support
• Start pipeline on past date
• Rapidly backfills to present day

Friday, November 15, 13
Massively scalable
• Creates and
terminates AWS
resources (Amazon
EC2 and Amazon
EMR) to process data
• Manage resources in
multiple regions

Friday, November 15, 13
Easy to get started
• Templates for
•
•
•

common use cases
Graphical interface
Natively understands
CSV and TSV
Automatically
configures Amazon
EMR clusters

Friday, November 15, 13
Inexpensive
•
•
•
•

Free tier
Pay per activity/precondition
No commitment
Simple pricing:

Friday, November 15, 13
An ETL example (1 of 2)
•

•
•
•
•

Friday, November 15, 13

Combine logs in
Amazon S3 with
customer data in
Amazon RDS
Process using Hive
on Amazon EMR
Put output in
Amazon S3
Load into Amazon Redshift
Run SQL query and
load table for BI
tools
An ETL example (2 of 2)
•
•

•
•
Friday, November 15, 13

Run on a schedule
(e.g. hourly)
Use a precondition
to make Hive
activity depend on
Amazon S3 logs being
available
Set up Amazon SNS
notification on
failure
Change default
retry logic
Swipely

Friday, November 15, 13
How big is your data?

1 TB

Friday, November 15, 13
How big is your data?

Do you have a
big data problem?

Friday, November 15, 13
How big is your data?
Don’t use Hadoop:
your data isn’t that big.

Do you have a
big data problem?

Friday, November 15, 13
How big is your data?
Don’t use Hadoop:
your data isn’t that big.
Keep your data small
and manageable.
Do you have a
big data problem?

Friday, November 15, 13
Get ahead of your Big Data
don’t wait for data to become a problem

Friday, November 15, 13
Get ahead of your Big Data
don’t wait for data to become a problem
Build novel product features
with a batch architecture

Friday, November 15, 13
Get ahead of your Big Data
don’t wait for data to become a problem
Build novel product features
with a batch architecture

Decrease development time
by easily backfilling data

Friday, November 15, 13
Get ahead of your Big Data
don’t wait for data to become a problem
Build novel product features
with a batch architecture

Decrease development time
by easily backfilling data

Vastly simplify operations
with scalable on-demand services

Friday, November 15, 13
Friday, November 15, 13
must innovate
by making payments data actionable

Friday, November 15, 13
must innovate
by making payments data actionable

and rapidly iterate
deploying multiple times a day

Friday, November 15, 13
must innovate
by making payments data actionable

and rapidly iterate
deploying multiple times a day

with a lean team.
we have 2 ops engineers

Friday, November 15, 13
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.

Friday, November 15, 13
Swipely uses AWS Data Pipeline to
build batch analytics,

Fast, dynamic reports
by mashing up data
from facts.

backfilling all our data,
using resources efficiently.

Friday, November 15, 13
Generate fast, dynamic reports

Friday, November 15, 13
Friday, November 15, 13
AWS Data Pipeline orchestrates
building of documents from facts

Transaction
Facts
Friday, November 15, 13

EMR

Intermediate
S3 Bucket

insert

Sales by Day
Documents
AWS Data Pipeline orchestrates
building of documents from facts

EMR Data
Transformer

Transaction
Facts
Friday, November 15, 13

Data
Post-Processor

EMR

insert

Intermediate
S3 Bucket

Sales by Day
Documents
AWS Data Pipeline orchestrates
building of documents from facts
AWS Data Pipeline

EMR Data
Transformer

Transaction
Facts
Friday, November 15, 13

Data
Post-Processor

EMR

insert

Intermediate
S3 Bucket

Sales by Day
Documents
Friday, November 15, 13
Mash up data for efficient processing
Transactions
Cafe 3/30 4980 $72
Spa 5/11 8278 $140
Cafe 5/11 2472 $57

Friday, November 15, 13

Sales by Day
Cafe 5/10: $4030
Cafe 5/11: $5432
Cafe 5/12: $6292

EMR
Friday, November 15, 13
Mash up data for efficient processing
Transactions

Visits

Sales by Day

Cafe 3/30 4980 $72
Spa 5/11 8278 $140
Cafe 5/11 2472 $57

Cafe 2472 5/11: $57 0 new
Cafe 4980 3/30: $72 1 new
Cafe 4980 5/11: $49 0 new

Cafe 5/10: $4030 60 new
Cafe 5/11: $5432 80 new
Cafe 5/12: $6292 135 new

Friday, November 15, 13

EMR

EMR
Friday, November 15, 13
Mash up data for efficient processing
Transactions

Visits

Sales by Day

Cafe 3/30 4980 $72
Spa 5/11 8278 $140
Cafe 5/11 2472 $57

Cafe 2472 5/11: $57 0 new
Cafe 4980 3/30: $72 1 new
Cafe 4980 5/11: $49 0 new

Cafe 5/10: $4030 60 new
Cafe 5/11: $5432 80 new
Cafe 5/12: $6292 135 new

EMR

Customer Spend

Card Opt-In
2472 Bob
8278 Mary

Friday, November 15, 13

EMR

Hive (EMR)

Mary
4980
Bob

5/11: $309
5/11: $218
5/11: $198
AWS Data Pipeline orchestrates
building of documents from facts
AWS Data Pipeline

EMR Data
Transformer

Transaction
Facts
Friday, November 15, 13

Data
Post-Processor

EMR

insert

Intermediate
S3 Bucket

Sales by Day
Documents
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.

Friday, November 15, 13
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,

Regularly rebuild
to rapidly iterate,
using agile process.

using resources efficiently.

Friday, November 15, 13
Regularly rebuild to avoid backfilling
web service
Analytics
Documents
daily
transactions
card opt-in
Friday, November 15, 13

Fact
Store
Regularly rebuild to avoid backfilling
web service
Recent
Activity

Analytics
Documents
daily

transactions
card opt-in
Friday, November 15, 13

Fact
Store
Minor changes require little work

Friday, November 15, 13
Minor changes require little work

change accounting rules
without a migration
Friday, November 15, 13
Rapidly iterate your product

Friday, November 15, 13
Rapidly iterate your product

redefine “best”

Friday, November 15, 13
Leverage agile development process
Wrap pipeline definition
Reduce variability
Quickly diagnose failures
Automate common tasks

Friday, November 15, 13
Wrap pipeline definition
{
"id":
"GenerateSalesByDay",
"type":
"EmrActivity",
"onFail":
{ "ref": "FailureNotify" },
"schedule": { "ref": "Nightly" },
"runsOn":
{ "ref": "SalesByDayEMRCluster" },
"dependsOn": { "ref": "GenerateIndexedSwipes" },
"step":
"/.../hadoop-streaming.jar,
-input, s3n://<%= s3_data_path %>/indexed_swipes.csv,
-output, s3://<%= s3_data_path %>/sales_by_day,
-mapper, s3n://<%= s3_code_path %>/sales_by_day_mapper.rb,
-reducer,s3n://<%= s3_code_path %>/sales_by_day_reducer.rb"
}

Friday, November 15, 13
Wrap pipeline definition
{
"id":
"type":
"onFail":
"schedule":
"runsOn":
"dependsOn":
"step":

}

Friday, November 15, 13

"GenerateSalesByDay",
"EmrActivity",
{ "ref": "FailureNotify" },
{ "ref": "Nightly" },
{ "ref": "SalesByDayEMRCluster" },
{ "ref": "GenerateIndexedSwipes" },
"<%= streaming_hadoop_step(
input:
'/indexed_swipes.csv',
output: '/sales_by_day',
mapper: '/sales_by_day_mapper.rb',
reducer: '/sales_by_day_reducer.rb'
) %>"
Reduce variability
No small instances
"coreInstanceType":

"m1.large"

Lock versions
"installHive":

"0.8.1.8"

Security groups by database
"securityGroups":

Friday, November 15, 13

[ "customerdb" ]
Quickly diagnose failures
Turn on logging
"enableDebugging", "logUri", "emrLogUri"

Namespace your logs
"s3://#{LOGS_BUCKET}/#{@s3prefix}/#{START_TIME}/SalesByDayEMRLogs"

Log into dev instances
"keyPair"

Friday, November 15, 13
Automate common tasks
Clean up
"terminateAfter":

"6 hours"

Bootstrap your environment
{
"id":
"type":
"scriptUri":
"runsOn":
}

Friday, November 15, 13

"BootstrapEnvironment",
"ShellCommandActivity",
".../bootstrap_ec2.sh",
{ "ref": "SalesByDayEC2Resource" }
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.

Friday, November 15, 13
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.

Scale horizontally,
backfilling in 50 min,
storing all your data.

Friday, November 15, 13
Scale Amazon EMR pipelines horizontally

Friday, November 15, 13
Scale Amazon EMR pipelines horizontally

Friday, November 15, 13
Cost vs latency sweet spot at 50 min

Friday, November 15, 13
Cost vs latency sweet spot at 50 min
Use smallest capable on-demand instance type
fixed hourly cost, no idle time

Friday, November 15, 13
Cost vs latency sweet spot at 50 min
Use smallest capable on-demand instance type
fixed hourly cost, no idle time
Scale EMR-heavy jobs horizontally
cost ( 1 instance, N hours ) = cost ( N instances, 1 hour )

Friday, November 15, 13
Cost vs latency sweet spot at 50 min
Use smallest capable on-demand instance type
fixed hourly cost, no idle time
Scale EMR-heavy jobs horizontally
cost ( 1 instance, N hours ) = cost ( N instances, 1 hour )
Target < 1 hour
~10 min runtime variability

Friday, November 15, 13
Cost vs latency sweet spot at 50 min
Use smallest capable on-demand instance type
fixed hourly cost, no idle time
Scale EMR-heavy jobs horizontally
cost ( 1 instance, N hours ) = cost ( N instances, 1 hour )
Target < 1 hour
~10 min runtime variability
Crunch 50 GB facts in 50 min
using 40 instances for < $10
Friday, November 15, 13
Store all your data - it’s cheap

Friday, November 15, 13
Store all your data - it’s cheap
Store all your facts in Amazon S3
your source of truth: 50 GB, $5 / month

Friday, November 15, 13
Store all your data - it’s cheap
Store all your facts in Amazon S3
your source of truth: 50 GB, $5 / month

Store your analytics documents in Amazon RDS
for indexed queries: 20 GB, $250 / month

Friday, November 15, 13
Store all your data - it’s cheap
Store all your facts in Amazon S3
your source of truth: 50 GB, $5 / month

Store your analytics documents in Amazon RDS
for indexed queries: 20 GB, $250 / month

Retain intermediate data in Amazon S3
for diagnosis: 1.1 TB (60 days), $100 / month

Friday, November 15, 13
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.

Friday, November 15, 13
Swipely uses AWS Data Pipeline to
build batch analytics,
backfilling all our data,
using resources efficiently.

Friday, November 15, 13
Friday, November 15, 13
Please give us your feedback on this
presentation

BDT207
As a thank you, we will select prize
winners daily for completed surveys!

Friday, November 15, 13

Thank You

Contenu connexe

Tendances

How Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SFHow Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SFAmazon Web Services
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!Chris Taylor
 
BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012Amazon Web Services
 
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...Amazon Web Services
 
Large Scale Data Analysis with AWS
Large Scale Data Analysis with AWSLarge Scale Data Analysis with AWS
Large Scale Data Analysis with AWSAmazon Web Services
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...Amazon Web Services
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...Amazon Web Services
 
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech TalksAmazon Web Services
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Amazon Web Services
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014
(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014
(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014Amazon Web Services
 
Taking the Performance of your Data Warehouse to the Next Level with Amazon R...
Taking the Performance of your Data Warehouse to the Next Level with Amazon R...Taking the Performance of your Data Warehouse to the Next Level with Amazon R...
Taking the Performance of your Data Warehouse to the Next Level with Amazon R...Amazon Web Services
 
AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAmazon Web Services
 
AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
 AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
AWS Cloud Kata 2014 | Jakarta - 2-3 Big DataAmazon Web Services
 
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAmazon Web Services
 

Tendances (20)

How Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SFHow Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SF
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!
 
BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
 
Large Scale Data Analysis with AWS
Large Scale Data Analysis with AWSLarge Scale Data Analysis with AWS
Large Scale Data Analysis with AWS
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
 
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
 
Big Data and Analytics
Big Data and AnalyticsBig Data and Analytics
Big Data and Analytics
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014
(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014
(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014
 
Taking the Performance of your Data Warehouse to the Next Level with Amazon R...
Taking the Performance of your Data Warehouse to the Next Level with Amazon R...Taking the Performance of your Data Warehouse to the Next Level with Amazon R...
Taking the Performance of your Data Warehouse to the Next Level with Amazon R...
 
AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
 
AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
 AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
 
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
 

En vedette

(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & DataductAmazon Web Services
 
Data discovery through federated dataset catalogs
Data discovery through federated dataset catalogsData discovery through federated dataset catalogs
Data discovery through federated dataset catalogsValeria Pesce
 
Attivio Predictions 2017
Attivio Predictions 2017Attivio Predictions 2017
Attivio Predictions 2017Attivio
 
Dataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabulariesDataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabulariesValeria Pesce
 
Sharing Agricultural Events Information: When and where is that workshop?
Sharing Agricultural Events Information: When and where is that workshop?Sharing Agricultural Events Information: When and where is that workshop?
Sharing Agricultural Events Information: When and where is that workshop?Gauri Salokhe
 
The path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial ServicesThe path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial ServicesHortonworks
 
Inventory of data standards for food & agriculture
Inventory of data standards for food & agricultureInventory of data standards for food & agriculture
Inventory of data standards for food & agricultureValeria Pesce
 
How to describe a dataset. Interoperability issues
How to describe a dataset. Interoperability issuesHow to describe a dataset. Interoperability issues
How to describe a dataset. Interoperability issuesValeria Pesce
 
Semantics for food and agriculture: the GODAN Action map of data standards
Semantics for food and agriculture: the GODAN Action map of data standardsSemantics for food and agriculture: the GODAN Action map of data standards
Semantics for food and agriculture: the GODAN Action map of data standardsValeria Pesce
 
The agINFRA Linked Data layer
The agINFRA Linked Data layerThe agINFRA Linked Data layer
The agINFRA Linked Data layerValeria Pesce
 
Cognitive Search for Knowledge Management
Cognitive Search for Knowledge ManagementCognitive Search for Knowledge Management
Cognitive Search for Knowledge ManagementAttivio
 
Data Modeling & Data Integration
Data Modeling & Data IntegrationData Modeling & Data Integration
Data Modeling & Data IntegrationDATAVERSITY
 
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMicrosoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMark Kromer
 
A global linked and open data infrastructure for agricultural development
A global linked and open data infrastructure for agricultural developmentA global linked and open data infrastructure for agricultural development
A global linked and open data infrastructure for agricultural developmentValeria Pesce
 
Semantic challenges in sharing dataset metadata and creating federated datase...
Semantic challenges in sharing dataset metadata and creating federated datase...Semantic challenges in sharing dataset metadata and creating federated datase...
Semantic challenges in sharing dataset metadata and creating federated datase...Valeria Pesce
 
Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612Mark Tabladillo
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 

En vedette (17)

(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
Data discovery through federated dataset catalogs
Data discovery through federated dataset catalogsData discovery through federated dataset catalogs
Data discovery through federated dataset catalogs
 
Attivio Predictions 2017
Attivio Predictions 2017Attivio Predictions 2017
Attivio Predictions 2017
 
Dataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabulariesDataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabularies
 
Sharing Agricultural Events Information: When and where is that workshop?
Sharing Agricultural Events Information: When and where is that workshop?Sharing Agricultural Events Information: When and where is that workshop?
Sharing Agricultural Events Information: When and where is that workshop?
 
The path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial ServicesThe path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial Services
 
Inventory of data standards for food & agriculture
Inventory of data standards for food & agricultureInventory of data standards for food & agriculture
Inventory of data standards for food & agriculture
 
How to describe a dataset. Interoperability issues
How to describe a dataset. Interoperability issuesHow to describe a dataset. Interoperability issues
How to describe a dataset. Interoperability issues
 
Semantics for food and agriculture: the GODAN Action map of data standards
Semantics for food and agriculture: the GODAN Action map of data standardsSemantics for food and agriculture: the GODAN Action map of data standards
Semantics for food and agriculture: the GODAN Action map of data standards
 
The agINFRA Linked Data layer
The agINFRA Linked Data layerThe agINFRA Linked Data layer
The agINFRA Linked Data layer
 
Cognitive Search for Knowledge Management
Cognitive Search for Knowledge ManagementCognitive Search for Knowledge Management
Cognitive Search for Knowledge Management
 
Data Modeling & Data Integration
Data Modeling & Data IntegrationData Modeling & Data Integration
Data Modeling & Data Integration
 
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMicrosoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
 
A global linked and open data infrastructure for agricultural development
A global linked and open data infrastructure for agricultural developmentA global linked and open data infrastructure for agricultural development
A global linked and open data infrastructure for agricultural development
 
Semantic challenges in sharing dataset metadata and creating federated datase...
Semantic challenges in sharing dataset metadata and creating federated datase...Semantic challenges in sharing dataset metadata and creating federated datase...
Semantic challenges in sharing dataset metadata and creating federated datase...
 
Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 

Similaire à Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013
SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013
SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013Amazon Web Services
 
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...Amazon Web Services
 
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...Amazon Web Services
 
DevOps Nirvana: Seven Steps to a Peaceful Life on AWS (ARC210) | AWS re:Inven...
DevOps Nirvana: Seven Steps to a Peaceful Life on AWS (ARC210) | AWS re:Inven...DevOps Nirvana: Seven Steps to a Peaceful Life on AWS (ARC210) | AWS re:Inven...
DevOps Nirvana: Seven Steps to a Peaceful Life on AWS (ARC210) | AWS re:Inven...Amazon Web Services
 
Scaling a Mobile Web App to 100 Million Clients and Beyond (MBL302) | AWS re:...
Scaling a Mobile Web App to 100 Million Clients and Beyond (MBL302) | AWS re:...Scaling a Mobile Web App to 100 Million Clients and Beyond (MBL302) | AWS re:...
Scaling a Mobile Web App to 100 Million Clients and Beyond (MBL302) | AWS re:...Amazon Web Services
 
How Trend Micro Build their Enterprise Security Offering on AWS (SEC307) | AW...
How Trend Micro Build their Enterprise Security Offering on AWS (SEC307) | AW...How Trend Micro Build their Enterprise Security Offering on AWS (SEC307) | AW...
How Trend Micro Build their Enterprise Security Offering on AWS (SEC307) | AW...Amazon Web Services
 
Introduction to Data Analysis, Storage & Processing Solutions
Introduction to Data Analysis, Storage & Processing SolutionsIntroduction to Data Analysis, Storage & Processing Solutions
Introduction to Data Analysis, Storage & Processing SolutionsAnjani Phuyal
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWSAmazon Web Services
 
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services 2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services PHP Conference Argentina
 
Escalando una PHP App con DB sharding - PHP Conference
Escalando una PHP App con DB sharding - PHP ConferenceEscalando una PHP App con DB sharding - PHP Conference
Escalando una PHP App con DB sharding - PHP ConferenceMatias Paterlini
 
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWSAWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWSAmazon Web Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from LynchpinMeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from LynchpinLynchpin Analytics Consultancy
 
AWS Summit 2013 | Singapore - Understanding AWS Storage Options
AWS Summit 2013 | Singapore - Understanding AWS Storage OptionsAWS Summit 2013 | Singapore - Understanding AWS Storage Options
AWS Summit 2013 | Singapore - Understanding AWS Storage OptionsAmazon Web Services
 
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Web Services
 
How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013
How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013
How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013Amazon Web Services
 
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...Amazon Web Services
 
Scaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case studyScaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case studyOliver Seemann
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...Amazon Web Services
 

Similaire à Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013 (20)

SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013
SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013
SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013
 
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
 
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
Running Lean and Mean: Designing Cost-efficient Architectures on AWS (ARC313)...
 
DevOps Nirvana: Seven Steps to a Peaceful Life on AWS (ARC210) | AWS re:Inven...
DevOps Nirvana: Seven Steps to a Peaceful Life on AWS (ARC210) | AWS re:Inven...DevOps Nirvana: Seven Steps to a Peaceful Life on AWS (ARC210) | AWS re:Inven...
DevOps Nirvana: Seven Steps to a Peaceful Life on AWS (ARC210) | AWS re:Inven...
 
Scaling a Mobile Web App to 100 Million Clients and Beyond (MBL302) | AWS re:...
Scaling a Mobile Web App to 100 Million Clients and Beyond (MBL302) | AWS re:...Scaling a Mobile Web App to 100 Million Clients and Beyond (MBL302) | AWS re:...
Scaling a Mobile Web App to 100 Million Clients and Beyond (MBL302) | AWS re:...
 
How Trend Micro Build their Enterprise Security Offering on AWS (SEC307) | AW...
How Trend Micro Build their Enterprise Security Offering on AWS (SEC307) | AW...How Trend Micro Build their Enterprise Security Offering on AWS (SEC307) | AW...
How Trend Micro Build their Enterprise Security Offering on AWS (SEC307) | AW...
 
Introduction to Data Analysis, Storage & Processing Solutions
Introduction to Data Analysis, Storage & Processing SolutionsIntroduction to Data Analysis, Storage & Processing Solutions
Introduction to Data Analysis, Storage & Processing Solutions
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWS
 
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services 2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
 
Escalando una PHP App con DB sharding - PHP Conference
Escalando una PHP App con DB sharding - PHP ConferenceEscalando una PHP App con DB sharding - PHP Conference
Escalando una PHP App con DB sharding - PHP Conference
 
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWSAWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Treasure Data Cloud Strategy
Treasure Data Cloud StrategyTreasure Data Cloud Strategy
Treasure Data Cloud Strategy
 
MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from LynchpinMeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
 
AWS Summit 2013 | Singapore - Understanding AWS Storage Options
AWS Summit 2013 | Singapore - Understanding AWS Storage OptionsAWS Summit 2013 | Singapore - Understanding AWS Storage Options
AWS Summit 2013 | Singapore - Understanding AWS Storage Options
 
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
 
How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013
How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013
How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013
 
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
 
Scaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case studyScaling a SaaS backend with PostgreSQL - A case study
Scaling a SaaS backend with PostgreSQL - A case study
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Dernier

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Dernier (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

  • 1. Orchestrating Big Data Integration and Analytics Data Flows with AWS Data Pipeline Jon Einkauf (Sr. Product Manager, AWS) Anthony Accardi (Head of Engineering, Swipely) November 14, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Friday, November 15, 13
  • 2. What are some of the challenges in dealing with data? Friday, November 15, 13
  • 3. 1. Data is stored in different formats and locations, making it hard to integrate Amazon Redshift Amazon RDS Amazon S3 Amazon EMR Friday, November 15, 13 Amazon DynamoDB On-Premises
  • 4. 2. Data workflows require complex dependencies • For example, a data processing step may depend on: • Input data being ready • Prior step completing • Time of day • Etc. Friday, November 15, 13 Input Data Ready? No Yes Run…
  • 5. 3. Things go wrong - you must handle exceptions • For example, do you want to: • Retry in the case of failure? • Wait if a dependent step is taking longer than expected? • Be notified if something goes wrong? Friday, November 15, 13
  • 6. 4. Existing tools are not a good fit • • • • • Expensive upfront licenses Scaling issues Don’t support scheduling Not designed for the cloud Don’t support newer data stores (e.g., Amazon DynamoDB) Friday, November 15, 13
  • 7. Introducing AWS Data Pipeline Friday, November 15, 13
  • 8. A simple pipeline Input DataNode with PreCondition check Activity with failure & delay notifications Output DataNode Friday, November 15, 13
  • 9. Manages scheduled data movement and processing across AWS services Activities Amazon Redshift Amazon RDS Amazon S3 Amazon EMR Friday, November 15, 13 Amazon DynamoDB • • • • • • Copy MapReduce Hive Pig (New) SQL (New) Shell command
  • 10. Facilitates periodic data movement to/from AWS Amazon Redshift Amazon RDS Amazon S3 Amazon EMR Friday, November 15, 13 Amazon DynamoDB On-Premises
  • 11. Supports dependencies (Preconditions) • • • • • Amazon DynamoDB table exists/has data Amazon S3 key exists Amazon S3 prefix is not empty Success of custom Unix/Linux shell command Success of other pipeline tasks Yes S3 key exists? No Friday, November 15, 13 Copy…
  • 12. Alerting and exception handling • Notification • On failure • On delay • Automatic retry logic Task 1 Success Failure Alert Task 2 Success Friday, November 15, 13 Failure Alert
  • 13. Flexible scheduling • Choose a schedule • Run every: 15 minutes, hour, day, week, etc. • User defined • Backfill support • Start pipeline on past date • Rapidly backfills to present day Friday, November 15, 13
  • 14. Massively scalable • Creates and terminates AWS resources (Amazon EC2 and Amazon EMR) to process data • Manage resources in multiple regions Friday, November 15, 13
  • 15. Easy to get started • Templates for • • • common use cases Graphical interface Natively understands CSV and TSV Automatically configures Amazon EMR clusters Friday, November 15, 13
  • 16. Inexpensive • • • • Free tier Pay per activity/precondition No commitment Simple pricing: Friday, November 15, 13
  • 17. An ETL example (1 of 2) • • • • • Friday, November 15, 13 Combine logs in Amazon S3 with customer data in Amazon RDS Process using Hive on Amazon EMR Put output in Amazon S3 Load into Amazon Redshift Run SQL query and load table for BI tools
  • 18. An ETL example (2 of 2) • • • • Friday, November 15, 13 Run on a schedule (e.g. hourly) Use a precondition to make Hive activity depend on Amazon S3 logs being available Set up Amazon SNS notification on failure Change default retry logic
  • 20. How big is your data? 1 TB Friday, November 15, 13
  • 21. How big is your data? Do you have a big data problem? Friday, November 15, 13
  • 22. How big is your data? Don’t use Hadoop: your data isn’t that big. Do you have a big data problem? Friday, November 15, 13
  • 23. How big is your data? Don’t use Hadoop: your data isn’t that big. Keep your data small and manageable. Do you have a big data problem? Friday, November 15, 13
  • 24. Get ahead of your Big Data don’t wait for data to become a problem Friday, November 15, 13
  • 25. Get ahead of your Big Data don’t wait for data to become a problem Build novel product features with a batch architecture Friday, November 15, 13
  • 26. Get ahead of your Big Data don’t wait for data to become a problem Build novel product features with a batch architecture Decrease development time by easily backfilling data Friday, November 15, 13
  • 27. Get ahead of your Big Data don’t wait for data to become a problem Build novel product features with a batch architecture Decrease development time by easily backfilling data Vastly simplify operations with scalable on-demand services Friday, November 15, 13
  • 29. must innovate by making payments data actionable Friday, November 15, 13
  • 30. must innovate by making payments data actionable and rapidly iterate deploying multiple times a day Friday, November 15, 13
  • 31. must innovate by making payments data actionable and rapidly iterate deploying multiple times a day with a lean team. we have 2 ops engineers Friday, November 15, 13
  • 32. Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, using resources efficiently. Friday, November 15, 13
  • 33. Swipely uses AWS Data Pipeline to build batch analytics, Fast, dynamic reports by mashing up data from facts. backfilling all our data, using resources efficiently. Friday, November 15, 13
  • 34. Generate fast, dynamic reports Friday, November 15, 13
  • 36. AWS Data Pipeline orchestrates building of documents from facts Transaction Facts Friday, November 15, 13 EMR Intermediate S3 Bucket insert Sales by Day Documents
  • 37. AWS Data Pipeline orchestrates building of documents from facts EMR Data Transformer Transaction Facts Friday, November 15, 13 Data Post-Processor EMR insert Intermediate S3 Bucket Sales by Day Documents
  • 38. AWS Data Pipeline orchestrates building of documents from facts AWS Data Pipeline EMR Data Transformer Transaction Facts Friday, November 15, 13 Data Post-Processor EMR insert Intermediate S3 Bucket Sales by Day Documents
  • 40. Mash up data for efficient processing Transactions Cafe 3/30 4980 $72 Spa 5/11 8278 $140 Cafe 5/11 2472 $57 Friday, November 15, 13 Sales by Day Cafe 5/10: $4030 Cafe 5/11: $5432 Cafe 5/12: $6292 EMR
  • 42. Mash up data for efficient processing Transactions Visits Sales by Day Cafe 3/30 4980 $72 Spa 5/11 8278 $140 Cafe 5/11 2472 $57 Cafe 2472 5/11: $57 0 new Cafe 4980 3/30: $72 1 new Cafe 4980 5/11: $49 0 new Cafe 5/10: $4030 60 new Cafe 5/11: $5432 80 new Cafe 5/12: $6292 135 new Friday, November 15, 13 EMR EMR
  • 44. Mash up data for efficient processing Transactions Visits Sales by Day Cafe 3/30 4980 $72 Spa 5/11 8278 $140 Cafe 5/11 2472 $57 Cafe 2472 5/11: $57 0 new Cafe 4980 3/30: $72 1 new Cafe 4980 5/11: $49 0 new Cafe 5/10: $4030 60 new Cafe 5/11: $5432 80 new Cafe 5/12: $6292 135 new EMR Customer Spend Card Opt-In 2472 Bob 8278 Mary Friday, November 15, 13 EMR Hive (EMR) Mary 4980 Bob 5/11: $309 5/11: $218 5/11: $198
  • 45. AWS Data Pipeline orchestrates building of documents from facts AWS Data Pipeline EMR Data Transformer Transaction Facts Friday, November 15, 13 Data Post-Processor EMR insert Intermediate S3 Bucket Sales by Day Documents
  • 46. Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, using resources efficiently. Friday, November 15, 13
  • 47. Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, Regularly rebuild to rapidly iterate, using agile process. using resources efficiently. Friday, November 15, 13
  • 48. Regularly rebuild to avoid backfilling web service Analytics Documents daily transactions card opt-in Friday, November 15, 13 Fact Store
  • 49. Regularly rebuild to avoid backfilling web service Recent Activity Analytics Documents daily transactions card opt-in Friday, November 15, 13 Fact Store
  • 50. Minor changes require little work Friday, November 15, 13
  • 51. Minor changes require little work change accounting rules without a migration Friday, November 15, 13
  • 52. Rapidly iterate your product Friday, November 15, 13
  • 53. Rapidly iterate your product redefine “best” Friday, November 15, 13
  • 54. Leverage agile development process Wrap pipeline definition Reduce variability Quickly diagnose failures Automate common tasks Friday, November 15, 13
  • 55. Wrap pipeline definition { "id": "GenerateSalesByDay", "type": "EmrActivity", "onFail": { "ref": "FailureNotify" }, "schedule": { "ref": "Nightly" }, "runsOn": { "ref": "SalesByDayEMRCluster" }, "dependsOn": { "ref": "GenerateIndexedSwipes" }, "step": "/.../hadoop-streaming.jar, -input, s3n://<%= s3_data_path %>/indexed_swipes.csv, -output, s3://<%= s3_data_path %>/sales_by_day, -mapper, s3n://<%= s3_code_path %>/sales_by_day_mapper.rb, -reducer,s3n://<%= s3_code_path %>/sales_by_day_reducer.rb" } Friday, November 15, 13
  • 56. Wrap pipeline definition { "id": "type": "onFail": "schedule": "runsOn": "dependsOn": "step": } Friday, November 15, 13 "GenerateSalesByDay", "EmrActivity", { "ref": "FailureNotify" }, { "ref": "Nightly" }, { "ref": "SalesByDayEMRCluster" }, { "ref": "GenerateIndexedSwipes" }, "<%= streaming_hadoop_step( input: '/indexed_swipes.csv', output: '/sales_by_day', mapper: '/sales_by_day_mapper.rb', reducer: '/sales_by_day_reducer.rb' ) %>"
  • 57. Reduce variability No small instances "coreInstanceType": "m1.large" Lock versions "installHive": "0.8.1.8" Security groups by database "securityGroups": Friday, November 15, 13 [ "customerdb" ]
  • 58. Quickly diagnose failures Turn on logging "enableDebugging", "logUri", "emrLogUri" Namespace your logs "s3://#{LOGS_BUCKET}/#{@s3prefix}/#{START_TIME}/SalesByDayEMRLogs" Log into dev instances "keyPair" Friday, November 15, 13
  • 59. Automate common tasks Clean up "terminateAfter": "6 hours" Bootstrap your environment { "id": "type": "scriptUri": "runsOn": } Friday, November 15, 13 "BootstrapEnvironment", "ShellCommandActivity", ".../bootstrap_ec2.sh", { "ref": "SalesByDayEC2Resource" }
  • 60. Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, using resources efficiently. Friday, November 15, 13
  • 61. Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, using resources efficiently. Scale horizontally, backfilling in 50 min, storing all your data. Friday, November 15, 13
  • 62. Scale Amazon EMR pipelines horizontally Friday, November 15, 13
  • 63. Scale Amazon EMR pipelines horizontally Friday, November 15, 13
  • 64. Cost vs latency sweet spot at 50 min Friday, November 15, 13
  • 65. Cost vs latency sweet spot at 50 min Use smallest capable on-demand instance type fixed hourly cost, no idle time Friday, November 15, 13
  • 66. Cost vs latency sweet spot at 50 min Use smallest capable on-demand instance type fixed hourly cost, no idle time Scale EMR-heavy jobs horizontally cost ( 1 instance, N hours ) = cost ( N instances, 1 hour ) Friday, November 15, 13
  • 67. Cost vs latency sweet spot at 50 min Use smallest capable on-demand instance type fixed hourly cost, no idle time Scale EMR-heavy jobs horizontally cost ( 1 instance, N hours ) = cost ( N instances, 1 hour ) Target < 1 hour ~10 min runtime variability Friday, November 15, 13
  • 68. Cost vs latency sweet spot at 50 min Use smallest capable on-demand instance type fixed hourly cost, no idle time Scale EMR-heavy jobs horizontally cost ( 1 instance, N hours ) = cost ( N instances, 1 hour ) Target < 1 hour ~10 min runtime variability Crunch 50 GB facts in 50 min using 40 instances for < $10 Friday, November 15, 13
  • 69. Store all your data - it’s cheap Friday, November 15, 13
  • 70. Store all your data - it’s cheap Store all your facts in Amazon S3 your source of truth: 50 GB, $5 / month Friday, November 15, 13
  • 71. Store all your data - it’s cheap Store all your facts in Amazon S3 your source of truth: 50 GB, $5 / month Store your analytics documents in Amazon RDS for indexed queries: 20 GB, $250 / month Friday, November 15, 13
  • 72. Store all your data - it’s cheap Store all your facts in Amazon S3 your source of truth: 50 GB, $5 / month Store your analytics documents in Amazon RDS for indexed queries: 20 GB, $250 / month Retain intermediate data in Amazon S3 for diagnosis: 1.1 TB (60 days), $100 / month Friday, November 15, 13
  • 73. Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, using resources efficiently. Friday, November 15, 13
  • 74. Swipely uses AWS Data Pipeline to build batch analytics, backfilling all our data, using resources efficiently. Friday, November 15, 13
  • 76. Please give us your feedback on this presentation BDT207 As a thank you, we will select prize winners daily for completed surveys! Friday, November 15, 13 Thank You