SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
DAT306 - How Amazon.com, with One of the
World’s Largest Data Warehouses, is Leveraging
Amazon Redshift
Erik Selberg (selberg@amazon.com) and
Abhishek Agrawal (abhagrwa@amazon.com)
November 14, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Agenda
• Amazon Data Warehouse Overview
• Amazon Data Warehouse and Amazon Redshift
Integration Project
• Amazon Redshift Best Practices
• Conclusion
Amazon Data Warehouse
Overview
Erik Selberg <selberg@amazon.com>
Amazon Data Warehouse
• Authoritative repository of data for all Amazon
• Petabytes of data

• Existing EDW is Oracle RAC; also using Amazon Elastic MapReduce
and now Amazon Redshift
• Owns managing the hardware and software infrastructure
–

Apart from Oracle DB, just Amazon IP

• Not part of AWS
Introducing the Elephant…
• Mission: Provide customers the
best value
–
–

Leverage AWS only if it provides the
best value
We aren’t moving 100% to Amazon
Redshift

• Publish best practices
–

If AWS isn’t the best, we’ll say so

• There is a conflict of interest
Control Plane (ETL
Manager)

Existing
EDW

Amazon
EMR

Amazon
Redshift

Amazon Data Warehouse Architecture
Amazon Data Warehouse – Growth Story
• Petabytes of data
• Growth of data volume – YoY storage
requirements have grown 67%
• Growth of processing volume – YoY
processing demand has grown 47%
Long-Term Sustainable Scale
$$ Wasted

Demand
SAN-based

Redshift
Coping with Change
Growth
changes
Demand
SAN

Capacity
Unmet

Redshift
Amazon Data Warehouse – Cost per Job
• Our main efficiency metric – Cost per Job (CPJ)

$CapEx $DataCenter $VendorSup port
PeakJobsPe rDay
What Drives Cost per Job…
Up?

Down?

•

Number of disks
– Data gets bigger!

•

Bidding
– 2+ vendors

•

Number of servers

•

Moore’s Law
– Vendors fight this!

•

Short-sighted negotiations
– 4th year support…

•

Data design

Data Center costs (power, rent)

•

Software (e.g. DBM)

•
Current State and Problems
• Existing EDW
– Multiple multi-petabyte clusters (redundancy and jobs)
– Why not <x>? CPJ not lower

• Data stored in SANs (not Exadata)
• Performs poorly on scans of 10T+
• Long procurement cycles (3 month minimum)
Amazon Data Warehouse and Amazon Redshift
Integration Project
• Spent 2013 evaluating Amazon Redshift for Amazon data
warehouse
– Where does Amazon Redshift provide a better CPJ?
– Can Amazon Redshift solve some pain (without introducing new pain)?

• Picked 10K jobs and 275 tables to copy
Current State of Affairs
• Biggest cluster size: 20+1 8XL
• Peak daily jobs: 7211 (using all 4 clusters)
• 4159 extracts
• 3052 loads
Some Results
• Benchmarking for 4159 jobs
– Outperforming 2719
– Underperforming 1440
– Avg. runtime
• 4:43 mins in Amazon Redshift
• 17:38 mins in existing EDW

• LOADS are slower
• EXTRACTS are faster

Job Type

RS Performance
Category

Job Count by
Category

EXTRACT
EXTRACT
EXTRACT
EXTRACT
EXTRACT
EXTRACT
LOAD
LOAD
LOAD
LOAD
LOAD
LOAD

10X Faster
5X Faster
3X Faster
2X Faster
1X or same
2X Slower
10X Faster
5X Faster
3X Faster
2X Faster
1X or same
2X Slower

945
487
393
301
480
1150
7
15
23
23
45
290
Amazon Redshift Best Practices
Abhishek Agrawal <abhagrwa@amazon.com>
Amazon Redshift Integration Best Practices
•

Integrating via Amazon S3 (Manifests)

•

Primary key enforcement

•

Idempotent loads
–
–

MERGE via INSERT/UPDATE
Mimic Trunc-Load [Backfills]

•

Trunc-partition using sort keys

•

Administration automation

•

Ensuring data correctness
Integrating via Amazon S3
• S3 in the US Standard Region is eventually consistent!
• S3 LIST might not give the entire list of data right after
you save it (this WILL eventually happen to you!)
• Amazon Redshift loads everything it sees in a bucket
– You may see all data files, Amazon Redshift may not, which can cause
missing data
Best Practices – Using Amazon S3
• Read/COPY
–
–

System table validation – STL_LOAD_ERRORS,
Verify files loaded are ‘intended’ files

• Write/ UNLOAD
–
–

System table validation – STL_UNLOAD_LOG
Verify all files that has the data are on S3

• Manifests
–
–
–

Metadata to know what to exactly to read from S3
Provides authoritative reference to data
Powerful in terms of user metadata format, encryption, etc.
Primary Key Enforcement
• Amazon Redshift does not enforce primary key
– You will need to do this to ensure data quality

• Best practice
– Introduce temp table to check duplicates in incoming data
– Validate against incoming data to catch offenders
– Put the data in target table and validate target data in the same
transaction before commit

• Yes, this IS a lot of overhead
Idempotent Loads
• Idempotent Loads – doing a load 2+ times the same as
doing 1 load
– Needed to manage load failures

• MERGE – leverages primary key, row at a time

• TRUNC / INSERT – load a partition at a time
MERGE
• No native Amazon Redshift MERGE support
• Merge is implemented as a multi-step process
–
–
–
–

Load the data in temp table
Figure out inserts and load
Figure out updates and modify target table
Validation for duplicates
TRUNC - INSERT
• Solution
– Distribute randomly
– Use sort keys to align data (mimics partition)
– Selectively delete and insert

• Issues
– Inserts are in an “unsorted” bucket – performance degrades without
periodic VACUUM
– Very slow (effectively row at a time)
Other Temp Table Uses
• Partial column data load
• Filtered data load
• Column transformations
Automating Administration
• Stored procs / Oracle workflow used to do
admin task like retention, stats, etc.
• Solution
– We introduced a software layer to prepare the administrative
task statements based on defined inputs
– Execute using JDBC connection
– Can schedule work like stats collection, vacuum, etc.
2013 Results
• CPJ is 55% less on Amazon Redshift in general
–
–
–
–

We can’t share the math, sorry YMMV
Between Redshift and Amazon data warehouse, known improvements get us to ~66%
Big wins are in big queries
Loads are slow and expensive

• Moved ~10K jobs to ~60 8XLs (4 clusters)

• We could move at most 45% of our work to Amazon Redshift with
minimal changes
2014 Plan
• Focus on big tables (100T+)
– Need to solve data expiry and backfill challenges

• Solve problems with CPU bound
• Interactive analytics (third-party vendor apps
with Amazon Redshift + Oracle)
Please give us your feedback on this
presentation

DAT306
As a thank you, we will select prize
winners daily for completed surveys!

Contenu connexe

Tendances

Tendances (20)

Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Deep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performance
 
Getting Started with Amazon QuickSight
Getting Started with Amazon QuickSightGetting Started with Amazon QuickSight
Getting Started with Amazon QuickSight
 
Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Getting Maximum Performance from Amazon Redshift: Complex Queries
Getting Maximum Performance from Amazon Redshift: Complex QueriesGetting Maximum Performance from Amazon Redshift: Complex Queries
Getting Maximum Performance from Amazon Redshift: Complex Queries
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon RedshiftUses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
AWS Webinar - Dynamo DB + Redshift 13_09_19
AWS Webinar - Dynamo DB + Redshift 13_09_19AWS Webinar - Dynamo DB + Redshift 13_09_19
AWS Webinar - Dynamo DB + Redshift 13_09_19
 
AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
 
Amazon Redshift Deep Dive
Amazon Redshift Deep Dive Amazon Redshift Deep Dive
Amazon Redshift Deep Dive
 
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
Data Replication Options in AWS (ARC302) | AWS re:Invent 2013
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
 

En vedette

Best practices for content delivery using amazon cloud front
Best practices for content delivery using amazon cloud frontBest practices for content delivery using amazon cloud front
Best practices for content delivery using amazon cloud front
Amazon Web Services
 
AWS Enterprise Day | Big Data Analytics
AWS Enterprise Day | Big Data AnalyticsAWS Enterprise Day | Big Data Analytics
AWS Enterprise Day | Big Data Analytics
Amazon Web Services
 
AWS Customer Presentation: Centrastage - AWS Summit 2012 - London Customer Ta...
AWS Customer Presentation: Centrastage - AWS Summit 2012 - London Customer Ta...AWS Customer Presentation: Centrastage - AWS Summit 2012 - London Customer Ta...
AWS Customer Presentation: Centrastage - AWS Summit 2012 - London Customer Ta...
Amazon Web Services
 

En vedette (20)

Scalability and Availability
Scalability and AvailabilityScalability and Availability
Scalability and Availability
 
Application Portfolio Migration
Application Portfolio MigrationApplication Portfolio Migration
Application Portfolio Migration
 
Secure Hadoop as a Service - Session Sponsored by Intel
Secure Hadoop as a Service - Session Sponsored by IntelSecure Hadoop as a Service - Session Sponsored by Intel
Secure Hadoop as a Service - Session Sponsored by Intel
 
AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...
AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...
AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
 
Media Success Stories from the Cloud
Media Success Stories from the CloudMedia Success Stories from the Cloud
Media Success Stories from the Cloud
 
Masterclass Live: Amazon EC2
Masterclass Live: Amazon EC2 Masterclass Live: Amazon EC2
Masterclass Live: Amazon EC2
 
Deep Dive: Amazon Virtual Private Cloud
Deep Dive: Amazon Virtual Private CloudDeep Dive: Amazon Virtual Private Cloud
Deep Dive: Amazon Virtual Private Cloud
 
AWS Webcast - Using the AWS Cloud for Disaster recovery_Public Sector
AWS Webcast - Using the AWS Cloud for Disaster recovery_Public SectorAWS Webcast - Using the AWS Cloud for Disaster recovery_Public Sector
AWS Webcast - Using the AWS Cloud for Disaster recovery_Public Sector
 
DAT203 Optimizing Your MongoDB Database on AWS - AWS re: Invent 2012
DAT203 Optimizing Your MongoDB Database on AWS - AWS re: Invent 2012DAT203 Optimizing Your MongoDB Database on AWS - AWS re: Invent 2012
DAT203 Optimizing Your MongoDB Database on AWS - AWS re: Invent 2012
 
AWS Enterprise Summit Manila Windows .net
AWS Enterprise Summit Manila Windows .netAWS Enterprise Summit Manila Windows .net
AWS Enterprise Summit Manila Windows .net
 
Best practices for content delivery using amazon cloud front
Best practices for content delivery using amazon cloud frontBest practices for content delivery using amazon cloud front
Best practices for content delivery using amazon cloud front
 
AWS Enterprise Day | Big Data Analytics
AWS Enterprise Day | Big Data AnalyticsAWS Enterprise Day | Big Data Analytics
AWS Enterprise Day | Big Data Analytics
 
The 2014 AWS Enterprise Summit Keynote
The 2014 AWS Enterprise Summit Keynote The 2014 AWS Enterprise Summit Keynote
The 2014 AWS Enterprise Summit Keynote
 
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
 AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
 
Workshop part3 – IOT
Workshop part3 – IOTWorkshop part3 – IOT
Workshop part3 – IOT
 
AWS for Start-ups - Case Study - Go Squared
AWS for Start-ups - Case Study - Go SquaredAWS for Start-ups - Case Study - Go Squared
AWS for Start-ups - Case Study - Go Squared
 
Protecting Your Data in AWS
Protecting Your Data in AWSProtecting Your Data in AWS
Protecting Your Data in AWS
 
AWS Customer Presentation: Centrastage - AWS Summit 2012 - London Customer Ta...
AWS Customer Presentation: Centrastage - AWS Summit 2012 - London Customer Ta...AWS Customer Presentation: Centrastage - AWS Summit 2012 - London Customer Ta...
AWS Customer Presentation: Centrastage - AWS Summit 2012 - London Customer Ta...
 
REA Sydney Customer Appreciation Day
REA Sydney Customer Appreciation DayREA Sydney Customer Appreciation Day
REA Sydney Customer Appreciation Day
 

Similaire à How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013

Similaire à How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013 (20)

Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)
 
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesMigrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
 
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
 
Breaking data
Breaking dataBreaking data
Breaking data
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
 
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
 
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
 
Metail and Elastic MapReduce
Metail and Elastic MapReduceMetail and Elastic MapReduce
Metail and Elastic MapReduce
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
 
Piranha vs. mammoth predator appliances that chew up big data
Piranha vs. mammoth   predator appliances that chew up big dataPiranha vs. mammoth   predator appliances that chew up big data
Piranha vs. mammoth predator appliances that chew up big data
 
Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive
 
Best Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSBest Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWS
 
Scaling on AWS to the First 10 Million Users
Scaling on AWS to the First 10 Million Users Scaling on AWS to the First 10 Million Users
Scaling on AWS to the First 10 Million Users
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
 
Database 12c is ready for you... Are you ready for 12c?
Database 12c is ready for you... Are you ready for 12c?Database 12c is ready for you... Are you ready for 12c?
Database 12c is ready for you... Are you ready for 12c?
 
透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹
透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹
透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 

Plus de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013

  • 1. DAT306 - How Amazon.com, with One of the World’s Largest Data Warehouses, is Leveraging Amazon Redshift Erik Selberg (selberg@amazon.com) and Abhishek Agrawal (abhagrwa@amazon.com) November 14, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 2. Agenda • Amazon Data Warehouse Overview • Amazon Data Warehouse and Amazon Redshift Integration Project • Amazon Redshift Best Practices • Conclusion
  • 3. Amazon Data Warehouse Overview Erik Selberg <selberg@amazon.com>
  • 4. Amazon Data Warehouse • Authoritative repository of data for all Amazon • Petabytes of data • Existing EDW is Oracle RAC; also using Amazon Elastic MapReduce and now Amazon Redshift • Owns managing the hardware and software infrastructure – Apart from Oracle DB, just Amazon IP • Not part of AWS
  • 5. Introducing the Elephant… • Mission: Provide customers the best value – – Leverage AWS only if it provides the best value We aren’t moving 100% to Amazon Redshift • Publish best practices – If AWS isn’t the best, we’ll say so • There is a conflict of interest
  • 7. Amazon Data Warehouse – Growth Story • Petabytes of data • Growth of data volume – YoY storage requirements have grown 67% • Growth of processing volume – YoY processing demand has grown 47%
  • 8. Long-Term Sustainable Scale $$ Wasted Demand SAN-based Redshift
  • 10. Amazon Data Warehouse – Cost per Job • Our main efficiency metric – Cost per Job (CPJ) $CapEx $DataCenter $VendorSup port PeakJobsPe rDay
  • 11. What Drives Cost per Job… Up? Down? • Number of disks – Data gets bigger! • Bidding – 2+ vendors • Number of servers • Moore’s Law – Vendors fight this! • Short-sighted negotiations – 4th year support… • Data design Data Center costs (power, rent) • Software (e.g. DBM) •
  • 12. Current State and Problems • Existing EDW – Multiple multi-petabyte clusters (redundancy and jobs) – Why not <x>? CPJ not lower • Data stored in SANs (not Exadata) • Performs poorly on scans of 10T+ • Long procurement cycles (3 month minimum)
  • 13. Amazon Data Warehouse and Amazon Redshift Integration Project • Spent 2013 evaluating Amazon Redshift for Amazon data warehouse – Where does Amazon Redshift provide a better CPJ? – Can Amazon Redshift solve some pain (without introducing new pain)? • Picked 10K jobs and 275 tables to copy
  • 14. Current State of Affairs • Biggest cluster size: 20+1 8XL • Peak daily jobs: 7211 (using all 4 clusters) • 4159 extracts • 3052 loads
  • 15. Some Results • Benchmarking for 4159 jobs – Outperforming 2719 – Underperforming 1440 – Avg. runtime • 4:43 mins in Amazon Redshift • 17:38 mins in existing EDW • LOADS are slower • EXTRACTS are faster Job Type RS Performance Category Job Count by Category EXTRACT EXTRACT EXTRACT EXTRACT EXTRACT EXTRACT LOAD LOAD LOAD LOAD LOAD LOAD 10X Faster 5X Faster 3X Faster 2X Faster 1X or same 2X Slower 10X Faster 5X Faster 3X Faster 2X Faster 1X or same 2X Slower 945 487 393 301 480 1150 7 15 23 23 45 290
  • 16. Amazon Redshift Best Practices Abhishek Agrawal <abhagrwa@amazon.com>
  • 17. Amazon Redshift Integration Best Practices • Integrating via Amazon S3 (Manifests) • Primary key enforcement • Idempotent loads – – MERGE via INSERT/UPDATE Mimic Trunc-Load [Backfills] • Trunc-partition using sort keys • Administration automation • Ensuring data correctness
  • 18. Integrating via Amazon S3 • S3 in the US Standard Region is eventually consistent! • S3 LIST might not give the entire list of data right after you save it (this WILL eventually happen to you!) • Amazon Redshift loads everything it sees in a bucket – You may see all data files, Amazon Redshift may not, which can cause missing data
  • 19. Best Practices – Using Amazon S3 • Read/COPY – – System table validation – STL_LOAD_ERRORS, Verify files loaded are ‘intended’ files • Write/ UNLOAD – – System table validation – STL_UNLOAD_LOG Verify all files that has the data are on S3 • Manifests – – – Metadata to know what to exactly to read from S3 Provides authoritative reference to data Powerful in terms of user metadata format, encryption, etc.
  • 20. Primary Key Enforcement • Amazon Redshift does not enforce primary key – You will need to do this to ensure data quality • Best practice – Introduce temp table to check duplicates in incoming data – Validate against incoming data to catch offenders – Put the data in target table and validate target data in the same transaction before commit • Yes, this IS a lot of overhead
  • 21. Idempotent Loads • Idempotent Loads – doing a load 2+ times the same as doing 1 load – Needed to manage load failures • MERGE – leverages primary key, row at a time • TRUNC / INSERT – load a partition at a time
  • 22. MERGE • No native Amazon Redshift MERGE support • Merge is implemented as a multi-step process – – – – Load the data in temp table Figure out inserts and load Figure out updates and modify target table Validation for duplicates
  • 23. TRUNC - INSERT • Solution – Distribute randomly – Use sort keys to align data (mimics partition) – Selectively delete and insert • Issues – Inserts are in an “unsorted” bucket – performance degrades without periodic VACUUM – Very slow (effectively row at a time)
  • 24. Other Temp Table Uses • Partial column data load • Filtered data load • Column transformations
  • 25. Automating Administration • Stored procs / Oracle workflow used to do admin task like retention, stats, etc. • Solution – We introduced a software layer to prepare the administrative task statements based on defined inputs – Execute using JDBC connection – Can schedule work like stats collection, vacuum, etc.
  • 26. 2013 Results • CPJ is 55% less on Amazon Redshift in general – – – – We can’t share the math, sorry YMMV Between Redshift and Amazon data warehouse, known improvements get us to ~66% Big wins are in big queries Loads are slow and expensive • Moved ~10K jobs to ~60 8XLs (4 clusters) • We could move at most 45% of our work to Amazon Redshift with minimal changes
  • 27. 2014 Plan • Focus on big tables (100T+) – Need to solve data expiry and backfill challenges • Solve problems with CPU bound • Interactive analytics (third-party vendor apps with Amazon Redshift + Oracle)
  • 28. Please give us your feedback on this presentation DAT306 As a thank you, we will select prize winners daily for completed surveys!