SlideShare une entreprise Scribd logo
1  sur  25
Powering interactive data analysis by
Amazon Redshift
Jie Li
Data Infra at Pinterest
Pinterest: a place to get inspired and
plan for the future
Data Infra at Pinterest
Production
data pipeline

Kafka

Pinball (*)
Hive

S3

MySQL
HBase
Redis

Cascading

Hadoop

MySQL

Amazon Web Service
* Pinball is our own workflow manager that we plan to open source.

Ad-hoc data
analysis

Analytics Dashboard
We need a low latency data warehouse!
Production
data pipeline

Kafka

Pinball
Hive

S3

MySQL
HBase
Redis

Cascading

Hadoop

High latency!

Ad-hoc data
analysis

MySQL not a viable data warehouse.
MySQL

Amazon Web Service

Analytics Dashboard
Low-latency data warehouse
• SQL on Hadoop
– Shark, Impala, Drill, Tez, Presto, …
– Open source and free 
– Immature? 

• Massive Parallel Processing (MPP)
– Asterdata, Vertica, ParAccel, …
– Built on mature technologies like Postgres 
– Expensive and only available on-premise 

• Amazon Redshift
– ParAccel on AWS 
– Mature but also cost-effective 
Highlights of Redshift
High cost efficiency
on-demand $0.85 per hour
3yr reserved instances $999/TB/year
Free snapshot on S3
Highlights of Redshift
Low maintenance overhead
Fully self-managed
Automated maintenance & upgrade
Built-in admin dashboard
Highlights of Redshift
Superior performance
6000

5000

25-100x over Hive
• Columnar layout
• Index
• Advanced optimizer
• Efficient execution

second

4000

3000

2000

1000

0

Q1

Q2
Hive

Q3
RedShift

Note: based on our own dataset and queries.

Q4
Cool, but how to integrate Redshift with
Hive/Hadoop
First, get data from Hive into Redshift
Unstructure
d
Unclean

Structured
Clean

Extract & Transform
Hive

Columnar
Compact
Compressed

Load
S3

Hadoop/Hive is perfect for heavy-lifting ETL workloads

Redshift
Building ETL from Hive to Redshift
What worked

What didn’t work

Schematizing Hive tables

Writing column-mapping
scripts to generate ETL
queries

N/A

Cleaning data

Filtering out non-ASCII
characters

Loading all characters

Loading big tables with
sortkey

Sorting externally in
Hadoop/Hive and loading in
chunks

Loading unordered data
directly

Loading time-series tables

Appending to the table in
the order of time (sortkey)

A table per day connected
with view performing poorly

Table retention

Insert into a new table

Delete and vacuum (poor
performance)
But it’s just the beginning.
Make sure you audit the ETL from Day 1
Audit ETL for Data Consistency
Everything was good until one day we noticed
one table was only half of its size 
S3 is only eventual consistent (EC) ! 

Hive

Solutions:

S3

① Audit

Redshift

Audit

② Also reduce number of files on S3 to alleviate EC.
Also, recently there is a new feature to specify a manifest for files on S3.
Now we got the data.
Is it ready for superior performance?
Understand the performance
Leader

① Understand the query execution
plan (via “explain”). Always
update system stats after data
loading by running “analyze”.

System
Stats

Compute

Compute

Compute

② Optimize the data layout by choosing consistent
distkeys across tables, and always choose a
sortkey. Watch out for bad distkey with skew (e.g.
distkey with null values).
What if a query took long
It’s worth doing your own homework
Filing tickets doesn’t work well for perf issues
• Requires a lot of information exchange
• May be caused by minor issues
Case: we optimized a query from 3 hours to 7 seconds
after studying the query plan and fixing the system
stats (the broadcast join regarded the larger table as
the smaller one).
Educate users with best practices 
Best Practice

Details

Select only the columns you need

Redshift is a columnar database and it only scans the
columns you need to speed things up. “SELECT *” is
usually bad.

Use the sortkey (dt or created_at)

Using sortkey can skip unnecessary data. Most of our
tables are using dt or created_at as the sortkey.

Avoid slow data transferring

Transferring large query result from Redshift to the local
client may be slow. Try saving the result as a Redshift
table or using the command line client on EC2.

Apply selective filters before join

Join operation can be significantly faster if we filter out
irrelevant data as much as possible.

Run one query at a time

The performance gets diluted with more queries. So be
patient.

Understand the query plan by
EXPLAIN

EXPLAIN gives you idea why a query may be slow. For
advanced users only.
Hopefully users will follow the best practice 
But Redshift is a shared service

One query may slow down the whole cluster 
Proactive monitoring
System tables
(e.g. stl_query)

It’s easy to write scripts for
Real-time
monitoring
slow queries

• Ping users with best practice
• Send alerts to admin

Analyzing
patterns

• Who need help
• Who was “abusing”

Hint: manually backup these system tables as they will be cleaned up weekly.
Optimizing workload management
• Run heavy ETL during night
– ETL is resource intensive
– No easy way to limit the resource usage (IO/CPU)

• Time out user queries during peak hours
– Long queries (>= 30 mins) likely have mistakes
– Sacrifice a few users for the majority

• Unlike Hadoop, there is no preemption in
Redshift 
Current status of Redshift at Pinterest
•
•
•
•

16 node 256TB cluster with 100TB+ core data
Ingesting 1.5TB data per day with retention
30+ daily users
500+ ad-hoc queries per day
– 75% <= 35 seconds, 90% <= 2 minute

• operational effort <= 5 hours/week
Redshift integrated at Pinterest
Pinball
Hive

Kafka

Cascading

Production
data pipeline

Hadoop

Ad-hoc data
analysis

S3

MySQL
HBase
Redis

Redshift

MySQL

Amazon Web Service

Analytics
Dashboard
Next step
• Next generation of analytics dashboards
– Replace offline MySQL with Redshift
– Replace custom dashboards with Tableau
Pinball
Hive

Kafka

Cascading

Production
data pipeline

Hadoop

Ad-hoc data
analysis

S3
MySQL
HBase
Redis

Redshift

Amazon Web Service

Tableau
Remaining risks
• SLA for low latency queries
– Due to the lack of preemption, it can not
guarantee mission-critical queries to finish fast

• High availability
– Takes hours to restore clusters from snapshots
– May need a standby cluster in future
Questions?
• Quora: http://qr.ae/TwRJf
• Twitter: @jay23jack

Contenu connexe

Tendances

Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features Amazon Web Services
 
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013Amazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAmazon Web Services
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Amazon Web Services
 
Deep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceAmazon Web Services
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon AthenaJulien SIMON
 
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Amazon Web Services
 
Building AWS Redshift Data Warehouse with Matillion and Tableau
Building AWS Redshift Data Warehouse with Matillion and TableauBuilding AWS Redshift Data Warehouse with Matillion and Tableau
Building AWS Redshift Data Warehouse with Matillion and TableauLynn Langit
 
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Web Services
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseAmazon Web Services
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon RedshiftKel Graham
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationVolodymyr Rovetskiy
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Amazon Web Services
 

Tendances (20)

Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
 
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Deep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performance
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon Athena
 
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
 
Building AWS Redshift Data Warehouse with Matillion and Tableau
Building AWS Redshift Data Warehouse with Matillion and TableauBuilding AWS Redshift Data Warehouse with Matillion and Tableau
Building AWS Redshift Data Warehouse with Matillion and Tableau
 
Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data Warehouse
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 

En vedette

Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
Googleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービスGoogleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービスEtsuji Nakai
 
Data Center In Healthcare Presentation 02 12
Data Center In Healthcare Presentation 02 12Data Center In Healthcare Presentation 02 12
Data Center In Healthcare Presentation 02 12todmoore
 
Continuous Delivery in Ruby
Continuous Delivery in RubyContinuous Delivery in Ruby
Continuous Delivery in RubyBrian Guthrie
 
Better Together - Using Spark and Redshift to Combine Your Data with Public D...
Better Together - Using Spark and Redshift to Combine Your Data with Public D...Better Together - Using Spark and Redshift to Combine Your Data with Public D...
Better Together - Using Spark and Redshift to Combine Your Data with Public D...C4Media
 
From One to Many: Evolving VPC Design (ARC401) | AWS re:Invent 2013
From One to Many:  Evolving VPC Design (ARC401) | AWS re:Invent 2013From One to Many:  Evolving VPC Design (ARC401) | AWS re:Invent 2013
From One to Many: Evolving VPC Design (ARC401) | AWS re:Invent 2013Amazon Web Services
 
CIS13: Bootcamp: Ping Identity OAuth and OpenID Connect In Action with PingFe...
CIS13: Bootcamp: Ping Identity OAuth and OpenID Connect In Action with PingFe...CIS13: Bootcamp: Ping Identity OAuth and OpenID Connect In Action with PingFe...
CIS13: Bootcamp: Ping Identity OAuth and OpenID Connect In Action with PingFe...CloudIDSummit
 
AWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAmazon Web Services
 
OpenID Connect - An Emperor or Just New Cloths?
OpenID Connect - An Emperor or Just New Cloths?OpenID Connect - An Emperor or Just New Cloths?
OpenID Connect - An Emperor or Just New Cloths?Oliver Pfaff
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
 
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014Amazon Web Services
 
(WEB302) Best Practices for Running WordPress on AWS | AWS re:Invent 2014
(WEB302) Best Practices for Running WordPress on AWS | AWS re:Invent 2014(WEB302) Best Practices for Running WordPress on AWS | AWS re:Invent 2014
(WEB302) Best Practices for Running WordPress on AWS | AWS re:Invent 2014Amazon Web Services
 
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...Amazon Web Services
 
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB DayAmazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB DayAmazon Web Services Korea
 
Hybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsHybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsSteven Francia
 
Core MIDI and Friends
Core MIDI and FriendsCore MIDI and Friends
Core MIDI and FriendsChris Adamson
 
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...Amazon Web Services
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Amazon Web Services
 

En vedette (20)

Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Googleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービスGoogleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービス
 
Data Center In Healthcare Presentation 02 12
Data Center In Healthcare Presentation 02 12Data Center In Healthcare Presentation 02 12
Data Center In Healthcare Presentation 02 12
 
Continuous Delivery in Ruby
Continuous Delivery in RubyContinuous Delivery in Ruby
Continuous Delivery in Ruby
 
Better Together - Using Spark and Redshift to Combine Your Data with Public D...
Better Together - Using Spark and Redshift to Combine Your Data with Public D...Better Together - Using Spark and Redshift to Combine Your Data with Public D...
Better Together - Using Spark and Redshift to Combine Your Data with Public D...
 
From One to Many: Evolving VPC Design (ARC401) | AWS re:Invent 2013
From One to Many:  Evolving VPC Design (ARC401) | AWS re:Invent 2013From One to Many:  Evolving VPC Design (ARC401) | AWS re:Invent 2013
From One to Many: Evolving VPC Design (ARC401) | AWS re:Invent 2013
 
CIS13: Bootcamp: Ping Identity OAuth and OpenID Connect In Action with PingFe...
CIS13: Bootcamp: Ping Identity OAuth and OpenID Connect In Action with PingFe...CIS13: Bootcamp: Ping Identity OAuth and OpenID Connect In Action with PingFe...
CIS13: Bootcamp: Ping Identity OAuth and OpenID Connect In Action with PingFe...
 
AWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions Showcase
 
OpenID Connect - An Emperor or Just New Cloths?
OpenID Connect - An Emperor or Just New Cloths?OpenID Connect - An Emperor or Just New Cloths?
OpenID Connect - An Emperor or Just New Cloths?
 
Presto - SQL on anything
Presto  - SQL on anythingPresto  - SQL on anything
Presto - SQL on anything
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014
 
(WEB302) Best Practices for Running WordPress on AWS | AWS re:Invent 2014
(WEB302) Best Practices for Running WordPress on AWS | AWS re:Invent 2014(WEB302) Best Practices for Running WordPress on AWS | AWS re:Invent 2014
(WEB302) Best Practices for Running WordPress on AWS | AWS re:Invent 2014
 
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
 
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB DayAmazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
 
Hybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsHybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS Applications
 
Core MIDI and Friends
Core MIDI and FriendsCore MIDI and Friends
Core MIDI and Friends
 
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
 

Similaire à Powering Interactive Data Analysis at Pinterest by Amazon Redshift

Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
 
Case Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataCase Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataSchubert Zhang
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudAmazon Web Services
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsKeeyong Han
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systemselliando dias
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platformmartinbpeters
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataHakka Labs
 
World-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftWorld-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftLars Kamp
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftAmazon Web Services
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...Amazon Web Services
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
 

Similaire à Powering Interactive Data Analysis at Pinterest by Amazon Redshift (20)

Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Case Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataCase Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of Data
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
Amazon Redshift Deep Dive
Amazon Redshift Deep Dive Amazon Redshift Deep Dive
Amazon Redshift Deep Dive
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
 
World-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftWorld-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon Redshift
 
Breaking data
Breaking dataBreaking data
Breaking data
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon Redshift
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 

Dernier

Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 

Dernier (20)

Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 

Powering Interactive Data Analysis at Pinterest by Amazon Redshift

  • 1. Powering interactive data analysis by Amazon Redshift Jie Li Data Infra at Pinterest
  • 2. Pinterest: a place to get inspired and plan for the future
  • 3. Data Infra at Pinterest Production data pipeline Kafka Pinball (*) Hive S3 MySQL HBase Redis Cascading Hadoop MySQL Amazon Web Service * Pinball is our own workflow manager that we plan to open source. Ad-hoc data analysis Analytics Dashboard
  • 4. We need a low latency data warehouse! Production data pipeline Kafka Pinball Hive S3 MySQL HBase Redis Cascading Hadoop High latency! Ad-hoc data analysis MySQL not a viable data warehouse. MySQL Amazon Web Service Analytics Dashboard
  • 5. Low-latency data warehouse • SQL on Hadoop – Shark, Impala, Drill, Tez, Presto, … – Open source and free  – Immature?  • Massive Parallel Processing (MPP) – Asterdata, Vertica, ParAccel, … – Built on mature technologies like Postgres  – Expensive and only available on-premise  • Amazon Redshift – ParAccel on AWS  – Mature but also cost-effective 
  • 6. Highlights of Redshift High cost efficiency on-demand $0.85 per hour 3yr reserved instances $999/TB/year Free snapshot on S3
  • 7. Highlights of Redshift Low maintenance overhead Fully self-managed Automated maintenance & upgrade Built-in admin dashboard
  • 8. Highlights of Redshift Superior performance 6000 5000 25-100x over Hive • Columnar layout • Index • Advanced optimizer • Efficient execution second 4000 3000 2000 1000 0 Q1 Q2 Hive Q3 RedShift Note: based on our own dataset and queries. Q4
  • 9. Cool, but how to integrate Redshift with Hive/Hadoop
  • 10. First, get data from Hive into Redshift Unstructure d Unclean Structured Clean Extract & Transform Hive Columnar Compact Compressed Load S3 Hadoop/Hive is perfect for heavy-lifting ETL workloads Redshift
  • 11. Building ETL from Hive to Redshift What worked What didn’t work Schematizing Hive tables Writing column-mapping scripts to generate ETL queries N/A Cleaning data Filtering out non-ASCII characters Loading all characters Loading big tables with sortkey Sorting externally in Hadoop/Hive and loading in chunks Loading unordered data directly Loading time-series tables Appending to the table in the order of time (sortkey) A table per day connected with view performing poorly Table retention Insert into a new table Delete and vacuum (poor performance)
  • 12. But it’s just the beginning. Make sure you audit the ETL from Day 1
  • 13. Audit ETL for Data Consistency Everything was good until one day we noticed one table was only half of its size  S3 is only eventual consistent (EC) !  Hive Solutions: S3 ① Audit Redshift Audit ② Also reduce number of files on S3 to alleviate EC. Also, recently there is a new feature to specify a manifest for files on S3.
  • 14. Now we got the data. Is it ready for superior performance?
  • 15. Understand the performance Leader ① Understand the query execution plan (via “explain”). Always update system stats after data loading by running “analyze”. System Stats Compute Compute Compute ② Optimize the data layout by choosing consistent distkeys across tables, and always choose a sortkey. Watch out for bad distkey with skew (e.g. distkey with null values).
  • 16. What if a query took long It’s worth doing your own homework Filing tickets doesn’t work well for perf issues • Requires a lot of information exchange • May be caused by minor issues Case: we optimized a query from 3 hours to 7 seconds after studying the query plan and fixing the system stats (the broadcast join regarded the larger table as the smaller one).
  • 17. Educate users with best practices  Best Practice Details Select only the columns you need Redshift is a columnar database and it only scans the columns you need to speed things up. “SELECT *” is usually bad. Use the sortkey (dt or created_at) Using sortkey can skip unnecessary data. Most of our tables are using dt or created_at as the sortkey. Avoid slow data transferring Transferring large query result from Redshift to the local client may be slow. Try saving the result as a Redshift table or using the command line client on EC2. Apply selective filters before join Join operation can be significantly faster if we filter out irrelevant data as much as possible. Run one query at a time The performance gets diluted with more queries. So be patient. Understand the query plan by EXPLAIN EXPLAIN gives you idea why a query may be slow. For advanced users only.
  • 18. Hopefully users will follow the best practice  But Redshift is a shared service One query may slow down the whole cluster 
  • 19. Proactive monitoring System tables (e.g. stl_query) It’s easy to write scripts for Real-time monitoring slow queries • Ping users with best practice • Send alerts to admin Analyzing patterns • Who need help • Who was “abusing” Hint: manually backup these system tables as they will be cleaned up weekly.
  • 20. Optimizing workload management • Run heavy ETL during night – ETL is resource intensive – No easy way to limit the resource usage (IO/CPU) • Time out user queries during peak hours – Long queries (>= 30 mins) likely have mistakes – Sacrifice a few users for the majority • Unlike Hadoop, there is no preemption in Redshift 
  • 21. Current status of Redshift at Pinterest • • • • 16 node 256TB cluster with 100TB+ core data Ingesting 1.5TB data per day with retention 30+ daily users 500+ ad-hoc queries per day – 75% <= 35 seconds, 90% <= 2 minute • operational effort <= 5 hours/week
  • 22. Redshift integrated at Pinterest Pinball Hive Kafka Cascading Production data pipeline Hadoop Ad-hoc data analysis S3 MySQL HBase Redis Redshift MySQL Amazon Web Service Analytics Dashboard
  • 23. Next step • Next generation of analytics dashboards – Replace offline MySQL with Redshift – Replace custom dashboards with Tableau Pinball Hive Kafka Cascading Production data pipeline Hadoop Ad-hoc data analysis S3 MySQL HBase Redis Redshift Amazon Web Service Tableau
  • 24. Remaining risks • SLA for low latency queries – Due to the lack of preemption, it can not guarantee mission-critical queries to finish fast • High availability – Takes hours to restore clusters from snapshots – May need a standby cluster in future

Notes de l'éditeur

  1. Add Hadoop Logo
  2. To verify machine generated queries
  3. Use logo for hadoop/hive. Add Pinball?