SlideShare une entreprise Scribd logo
1  sur  38
June 7th, 2017
Introduction to Amazon Redshift Spectrum
Maor Kleider, Sr Product Manager, Amazon Redshift
©2016, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What is Big Data?
When your data sets become so large and diverse
that you have to start innovating around how to
collect, store, process, analyze and share them
Generate
Collect & Store
Analyze
Collaborate & Act
Individual AWS customers
generate over a PB/day
It’s never been easier to generate vast amounts of data
Generate
Collect & Store
Analyze
Collaborate & Act
Individual AWS customers
generating over PB/day
Amazon S3 lets you collect and store all this data
Store exabytes of
data in S3
Generate
Collect & Store
Analyze
Collaborate & Act
Individual AWS customers
generating over PB/day
Highly
Constrained
But how do you analyze it?
Store exabytes of
data in S3
1990 2000 2010 2020
Generated Data
Available for Analysis
Sources:
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Data Volume
Year
The Dark Data Problem
Most generated data is unavailable for analysis
The tyranny of “OR”
Amazon EMR
Directly access data in S3
Scale out to thousands of nodes
Open data formats
Popular big data frameworks
Anything you can dream up and code
Amazon Redshift
Super-fast local disk performance
Sophisticated query optimization
Join-optimized data formats
Query using standard SQL
Optimized for data warehousing
But I don’t want to choose.
I shouldn’t have to choose
I want “all of the above”
I want
sophisticated query optimization and scale-out processing
super fast performance and support for open formats
the throughput of local disk and the scale of S3
I want all this
From one data processing engine
With my data accessible from all data processing engines
Now and in the future
Amazon Redshift Spectrum
Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
S3
SQL
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
1
Query is optimized and compiled at
the leader node. Determine what gets
run locally and what goes to Amazon
Redshift Spectrum
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
2
Query plan is sent to
all compute nodes
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
3
Compute nodes obtain partition info from
Data Catalog; dynamically prune partitions
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
4
Each compute node issues multiple
requests to the Amazon Redshift
Spectrum layer
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
5
Amazon Redshift Spectrum nodes
scan your S3 data
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
6
7
Amazon Redshift
Spectrum projects,
filters, joins and
aggregates
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
Final aggregations and joins
with local Amazon Redshift
tables done in-cluster
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
8
Result is sent back to client
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
9
Running an analytic query
over an exabyte in S3
Lets build an analytic query - #1
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets get the prior books she’s written.
1 Table
2 Filters
SELECT
P.ASIN,
P.TITLE
FROM
products P
WHERE
P.TITLE LIKE ‘%POTTER%’ AND
P.AUTHOR = ‘J. K. Rowling’
Lets build an analytic query - #2
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values
2 Tables (1 S3, 1 local)
2 Filters
1 Join
2 Group By columns
1 Order By
1 Limit
1 Aggregation
SELECT
P.ASIN,
P.TITLE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
products P
WHERE
D.ASIN = P.ASIN AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
GROUP BY P.ASIN, P.TITLE
ORDER BY SALES_sum DESC
LIMIT 20;
Lets build an analytic query - #3
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values, just for the first three days
of sales of first editions
3 Tables (1 S3, 2 local)
5 Filters
2 Joins
3 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
SELECT
P.ASIN,
P.TITLE,
P.RELEASE_DATE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
asin_attributes A,
products P
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
A.EDITION LIKE '%FIRST%' AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
LIMIT 20;
Lets build an analytic query - #4
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values, just for the first three days
of sales of first editions in the city of Seattle, WA, USA
4 Tables (1 S3, 3 local)
8 Filters
3 Joins
4 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
SELECT
P.ASIN,
P.TITLE,
R.POSTAL_CODE,
P.RELEASE_DATE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
asin_attributes A,
products P,
regions R
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
D.REGION_ID = R.REGION_ID AND
A.EDITION LIKE '%FIRST%' AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
R.COUNTRY_CODE = ‘US’ AND
R.CITY = ‘Seattle’ AND
R.STATE = ‘WA’ AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
LIMIT 20;
Now let’s run that query over an exabyte of data in S3
Roughly 140 TB of customer item order detail
records for each day over past 20 years.
190 million files across 15,000 partitions in S3.
One partition per day for USA and rest of world.
Need a billion-fold reduction in data processed.
Running this query using a 1000 node Hive cluster
would take over 5 years.*
• Compression ……………..….……..5X
• Columnar file format……….......…10X
• Scanning with 2500 nodes…....2500X
• Static partition elimination…............2X
• Dynamic partition elimination..….350X
• Redshift’s query optimizer……......40X
---------------------------------------------------
Total reduction……….…………3.5B X
* Estimated using 20 node Hive cluster & 1.4TB, assume linear
* Query used a 20 node DC1.8XLarge Amazon Redshift cluster
* Not actual sales data - generated for this demo based on data
format used by Amazon Retail.
Amazon Redshift Spectrum is fast
Leverages Amazon Redshift’s advanced cost-based optimizer
Pushes down projections, filters, aggregations and join reduction
Dynamic partition pruning to minimize data processed
Automatic parallelization of query execution against S3 data
Efficient join processing within the Amazon Redshift cluster
Amazon Redshift Spectrum is cost-effective
You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3
Each query can leverage 1000s of Amazon Redshift Spectrum nodes
You can reduce the TB scanned and improve query performance by:
Partitioning data
Using a columnar file format
Compressing data
Amazon Redshift Spectrum is secure
End-to-end
data encryption
Alerts &
notifications
Virtual private cloud
Audit logging
Certifications &
compliance
Encrypt S3 data using SSE and
AWS KMS
Encrypt all Amazon Redshift
data using KMS, AWS
CloudHSM or your on-premises
HSMs
Enforce SSL with perfect forward
encryption using ECDHE
Amazon Redshift leader node in
your VPC. Compute nodes in
private VPC. Spectrum nodes in
private VPC, store no state.
Communicate event-specific
notifications via email, text
message, or call with Amazon
SNS
All API calls are logged using
AWS CloudTrail
All SQL statements are logged
within Amazon Redshift
PCI/DSSFedRAMP
SOC1/2/3 HIPAA/BAA
Amazon Redshift Spectrum uses standard SQL
Redshift Spectrum seamlessly integrates with your existing SQL & BI apps
Support for complex joins, nested queries & window functions
Support for data partitioned in S3 by any key
Date, Time and any other custom keys
e.g., Year, Month, Day, Hour
Defining External Schema and Creating Tables
Define an external schema in Amazon Redshift using the Amazon Athena data
catalog or your own Apache Hive Metastore
CREATE EXTERNAL SCHEMA <schema_name>
Query external tables using <schema_name>.<table_name>
Register external tables using Athena, your Hive Metastore client, or from
Amazon Redshift CREATE EXTERNAL TABLE SCHEMA syntax
CREATE EXTERNAL TABLE <table_name>
[PARTITIONED BY <column_name, data_type, …>]
STORED AS file_format
LOCATION s3_location
[TABLE PROPERTIES property_name=property_value, …];
Amazon Redshift Spectrum – Current support
File formats
• Parquet
• CSV
• Sequence
• RCFile
• ORC (coming soon)
• RegExSerDe (coming soon)
Compression
• Gzip
• Snappy
• Lzo (coming soon)
• Bz2
Encryption
• SSE with AES256
• SSE KMS with default
key
Column types
• Numeric: bigint, int, smallint, float, double
and decimal
• Char/varchar/string
• Timestamp
• Boolean
• DATE type can be used only as a
partitioning key
Table type
• Non-partitioned table
(s3://mybucket/orders/..)
• Partitioned table
(s3://mybucket/orders/date=YYYY-MM-
DD/..)
Converting to Parquet and ORC using Amazon EMR
You can use Hive CREATE TABLE AS SELECT to convert data
CREATE TABLE data_converted
STORED AS PARQUET
AS
SELECT col_1, col2, col3 FROM data_source
Or use Spark - 20 lines of Pyspark code, running on Amazon EMR
• 1TB of text data reduced to 130 GB in Parquet format with snappy compression
• Total cost of EMR job to do this: $5
https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
Is Amazon Redshift Spectrum useful if I don’t have an exabyte?
Your data will get bigger
On average, data warehousing volumes grow 10x every 5 years
The average Amazon Redshift customer doubles data each year
Amazon Redshift Spectrum makes data analysis simpler
Access your data without ETL pipelines
Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake
Amazon Redshift Spectrum improves availability and concurrency
Run multiple Amazon Redshift clusters against common data
Isolate jobs with tight SLAs from ad hoc analysis
The Emerging Analytics Architecture
AthenaAmazon Athena
Interactive Query
AWS Glue
ETL & Data Catalog
Storage
Serverless
Compute
Data
Processing
Amazon S3
Exabyte-scale Object Storage
Amazon Kinesis Firehose
Real-Time Data Streaming
Amazon EMR
Managed Hadoop Applications
AWS Lambda
Trigger-based Code Execution
AWS Glue Data Catalog
Hive-compatible Metastore
Amazon Redshift Spectrum
Fast @ Exabyte scale
Amazon Redshift
Petabyte-scale Data Warehousing
Over 20 customers helped preview Amazon Redshift Spectrum
Thank you!
Questions?

Contenu connexe

Tendances

Tendances (20)

BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
DynamodbDB Deep Dive
DynamodbDB Deep DiveDynamodbDB Deep Dive
DynamodbDB Deep Dive
 
Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3
 
Data Warehousing in the Era of Big Data
Data Warehousing in the Era of Big DataData Warehousing in the Era of Big Data
Data Warehousing in the Era of Big Data
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
Aws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon AthenaAws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon Athena
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Building a Modern Data Warehouse: Deep Dive on Amazon Redshift - SRV337 - Chi...
Building a Modern Data Warehouse: Deep Dive on Amazon Redshift - SRV337 - Chi...Building a Modern Data Warehouse: Deep Dive on Amazon Redshift - SRV337 - Chi...
Building a Modern Data Warehouse: Deep Dive on Amazon Redshift - SRV337 - Chi...
 
February 2016 Webinar Series - Introduction to DynamoDB
February 2016 Webinar Series - Introduction to DynamoDBFebruary 2016 Webinar Series - Introduction to DynamoDB
February 2016 Webinar Series - Introduction to DynamoDB
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 

Similaire à Intro to Amazon Redshift Spectrum: Quickly Query Exabytes of Data in S3 - June 2017 AWS Online Tech Talks

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 

Similaire à Intro to Amazon Redshift Spectrum: Quickly Query Exabytes of Data in S3 - June 2017 AWS Online Tech Talks (20)

The Power of Big Data - AWS Summit Bahrain 2017
The Power of Big Data - AWS Summit Bahrain 2017The Power of Big Data - AWS Summit Bahrain 2017
The Power of Big Data - AWS Summit Bahrain 2017
 
Webinar | Accessing Your Data Lake Assets from Amazon Redshift Spectrum
Webinar | Accessing Your Data Lake Assets from Amazon Redshift SpectrumWebinar | Accessing Your Data Lake Assets from Amazon Redshift Spectrum
Webinar | Accessing Your Data Lake Assets from Amazon Redshift Spectrum
 
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal HealthSRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
(WRK302) Event-Driven Programming
(WRK302) Event-Driven Programming(WRK302) Event-Driven Programming
(WRK302) Event-Driven Programming
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
Amazon DynamoDB Design Workshop
Amazon DynamoDB Design WorkshopAmazon DynamoDB Design Workshop
Amazon DynamoDB Design Workshop
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018
 
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
 
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
 
Darin Briskman_Amazon_June_9_2017_Presentation
Darin Briskman_Amazon_June_9_2017_PresentationDarin Briskman_Amazon_June_9_2017_Presentation
Darin Briskman_Amazon_June_9_2017_Presentation
 
Adding Search to Relational Databases
Adding Search to Relational DatabasesAdding Search to Relational Databases
Adding Search to Relational Databases
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Adding Search to Relational Databases
Adding Search to Relational DatabasesAdding Search to Relational Databases
Adding Search to Relational Databases
 
DynamoDB Design Workshop
DynamoDB Design WorkshopDynamoDB Design Workshop
DynamoDB Design Workshop
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 

Plus de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Dernier

Dernier (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 

Intro to Amazon Redshift Spectrum: Quickly Query Exabytes of Data in S3 - June 2017 AWS Online Tech Talks

  • 1. June 7th, 2017 Introduction to Amazon Redshift Spectrum Maor Kleider, Sr Product Manager, Amazon Redshift ©2016, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 2. What is Big Data? When your data sets become so large and diverse that you have to start innovating around how to collect, store, process, analyze and share them
  • 3. Generate Collect & Store Analyze Collaborate & Act Individual AWS customers generate over a PB/day It’s never been easier to generate vast amounts of data
  • 4. Generate Collect & Store Analyze Collaborate & Act Individual AWS customers generating over PB/day Amazon S3 lets you collect and store all this data Store exabytes of data in S3
  • 5. Generate Collect & Store Analyze Collaborate & Act Individual AWS customers generating over PB/day Highly Constrained But how do you analyze it? Store exabytes of data in S3
  • 6. 1990 2000 2010 2020 Generated Data Available for Analysis Sources: Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares Data Volume Year The Dark Data Problem Most generated data is unavailable for analysis
  • 7. The tyranny of “OR” Amazon EMR Directly access data in S3 Scale out to thousands of nodes Open data formats Popular big data frameworks Anything you can dream up and code Amazon Redshift Super-fast local disk performance Sophisticated query optimization Join-optimized data formats Query using standard SQL Optimized for data warehousing
  • 8. But I don’t want to choose. I shouldn’t have to choose I want “all of the above”
  • 9. I want sophisticated query optimization and scale-out processing super fast performance and support for open formats the throughput of local disk and the scale of S3
  • 10. I want all this From one data processing engine With my data accessible from all data processing engines Now and in the future
  • 12. Amazon Redshift Spectrum Run SQL queries directly against data in S3 using thousands of nodes Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query High concurrency: Multiple clusters access same data No ETL: Query data in-place using open file formats Full Amazon Redshift SQL support S3 SQL
  • 13. Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY… Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 1
  • 14. Query is optimized and compiled at the leader node. Determine what gets run locally and what goes to Amazon Redshift Spectrum Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 2
  • 15. Query plan is sent to all compute nodes Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 3
  • 16. Compute nodes obtain partition info from Data Catalog; dynamically prune partitions Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 4
  • 17. Each compute node issues multiple requests to the Amazon Redshift Spectrum layer Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 5
  • 18. Amazon Redshift Spectrum nodes scan your S3 data Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 6
  • 19. 7 Amazon Redshift Spectrum projects, filters, joins and aggregates Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore
  • 20. Final aggregations and joins with local Amazon Redshift tables done in-cluster Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 8
  • 21. Result is sent back to client Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 9
  • 22. Running an analytic query over an exabyte in S3
  • 23. Lets build an analytic query - #1 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets get the prior books she’s written. 1 Table 2 Filters SELECT P.ASIN, P.TITLE FROM products P WHERE P.TITLE LIKE ‘%POTTER%’ AND P.AUTHOR = ‘J. K. Rowling’
  • 24. Lets build an analytic query - #2 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books she’s written in this series and return the top 20 values 2 Tables (1 S3, 1 local) 2 Filters 1 Join 2 Group By columns 1 Order By 1 Limit 1 Aggregation SELECT P.ASIN, P.TITLE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, products P WHERE D.ASIN = P.ASIN AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND GROUP BY P.ASIN, P.TITLE ORDER BY SALES_sum DESC LIMIT 20;
  • 25. Lets build an analytic query - #3 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books she’s written in this series and return the top 20 values, just for the first three days of sales of first editions 3 Tables (1 S3, 2 local) 5 Filters 2 Joins 3 Group By columns 1 Order By 1 Limit 1 Aggregation 1 Function 2 Casts SELECT P.ASIN, P.TITLE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20;
  • 26. Lets build an analytic query - #4 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books she’s written in this series and return the top 20 values, just for the first three days of sales of first editions in the city of Seattle, WA, USA 4 Tables (1 S3, 3 local) 8 Filters 3 Joins 4 Group By columns 1 Order By 1 Limit 1 Aggregation 1 Function 2 Casts SELECT P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P, regions R WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND D.REGION_ID = R.REGION_ID AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND R.COUNTRY_CODE = ‘US’ AND R.CITY = ‘Seattle’ AND R.STATE = ‘WA’ AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20;
  • 27. Now let’s run that query over an exabyte of data in S3 Roughly 140 TB of customer item order detail records for each day over past 20 years. 190 million files across 15,000 partitions in S3. One partition per day for USA and rest of world. Need a billion-fold reduction in data processed. Running this query using a 1000 node Hive cluster would take over 5 years.* • Compression ……………..….……..5X • Columnar file format……….......…10X • Scanning with 2500 nodes…....2500X • Static partition elimination…............2X • Dynamic partition elimination..….350X • Redshift’s query optimizer……......40X --------------------------------------------------- Total reduction……….…………3.5B X * Estimated using 20 node Hive cluster & 1.4TB, assume linear * Query used a 20 node DC1.8XLarge Amazon Redshift cluster * Not actual sales data - generated for this demo based on data format used by Amazon Retail.
  • 28. Amazon Redshift Spectrum is fast Leverages Amazon Redshift’s advanced cost-based optimizer Pushes down projections, filters, aggregations and join reduction Dynamic partition pruning to minimize data processed Automatic parallelization of query execution against S3 data Efficient join processing within the Amazon Redshift cluster
  • 29. Amazon Redshift Spectrum is cost-effective You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3 Each query can leverage 1000s of Amazon Redshift Spectrum nodes You can reduce the TB scanned and improve query performance by: Partitioning data Using a columnar file format Compressing data
  • 30. Amazon Redshift Spectrum is secure End-to-end data encryption Alerts & notifications Virtual private cloud Audit logging Certifications & compliance Encrypt S3 data using SSE and AWS KMS Encrypt all Amazon Redshift data using KMS, AWS CloudHSM or your on-premises HSMs Enforce SSL with perfect forward encryption using ECDHE Amazon Redshift leader node in your VPC. Compute nodes in private VPC. Spectrum nodes in private VPC, store no state. Communicate event-specific notifications via email, text message, or call with Amazon SNS All API calls are logged using AWS CloudTrail All SQL statements are logged within Amazon Redshift PCI/DSSFedRAMP SOC1/2/3 HIPAA/BAA
  • 31. Amazon Redshift Spectrum uses standard SQL Redshift Spectrum seamlessly integrates with your existing SQL & BI apps Support for complex joins, nested queries & window functions Support for data partitioned in S3 by any key Date, Time and any other custom keys e.g., Year, Month, Day, Hour
  • 32. Defining External Schema and Creating Tables Define an external schema in Amazon Redshift using the Amazon Athena data catalog or your own Apache Hive Metastore CREATE EXTERNAL SCHEMA <schema_name> Query external tables using <schema_name>.<table_name> Register external tables using Athena, your Hive Metastore client, or from Amazon Redshift CREATE EXTERNAL TABLE SCHEMA syntax CREATE EXTERNAL TABLE <table_name> [PARTITIONED BY <column_name, data_type, …>] STORED AS file_format LOCATION s3_location [TABLE PROPERTIES property_name=property_value, …];
  • 33. Amazon Redshift Spectrum – Current support File formats • Parquet • CSV • Sequence • RCFile • ORC (coming soon) • RegExSerDe (coming soon) Compression • Gzip • Snappy • Lzo (coming soon) • Bz2 Encryption • SSE with AES256 • SSE KMS with default key Column types • Numeric: bigint, int, smallint, float, double and decimal • Char/varchar/string • Timestamp • Boolean • DATE type can be used only as a partitioning key Table type • Non-partitioned table (s3://mybucket/orders/..) • Partitioned table (s3://mybucket/orders/date=YYYY-MM- DD/..)
  • 34. Converting to Parquet and ORC using Amazon EMR You can use Hive CREATE TABLE AS SELECT to convert data CREATE TABLE data_converted STORED AS PARQUET AS SELECT col_1, col2, col3 FROM data_source Or use Spark - 20 lines of Pyspark code, running on Amazon EMR • 1TB of text data reduced to 130 GB in Parquet format with snappy compression • Total cost of EMR job to do this: $5 https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
  • 35. Is Amazon Redshift Spectrum useful if I don’t have an exabyte? Your data will get bigger On average, data warehousing volumes grow 10x every 5 years The average Amazon Redshift customer doubles data each year Amazon Redshift Spectrum makes data analysis simpler Access your data without ETL pipelines Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake Amazon Redshift Spectrum improves availability and concurrency Run multiple Amazon Redshift clusters against common data Isolate jobs with tight SLAs from ad hoc analysis
  • 36. The Emerging Analytics Architecture AthenaAmazon Athena Interactive Query AWS Glue ETL & Data Catalog Storage Serverless Compute Data Processing Amazon S3 Exabyte-scale Object Storage Amazon Kinesis Firehose Real-Time Data Streaming Amazon EMR Managed Hadoop Applications AWS Lambda Trigger-based Code Execution AWS Glue Data Catalog Hive-compatible Metastore Amazon Redshift Spectrum Fast @ Exabyte scale Amazon Redshift Petabyte-scale Data Warehousing
  • 37. Over 20 customers helped preview Amazon Redshift Spectrum