SlideShare une entreprise Scribd logo
1  sur  24
Redshift Introduction 
Keeyong Han 
keeyonghan@hotmail.com
Table of Contents 
1. What is Redshift? 
2. Redshift In Action 
1. How to Upload? 
2. How to Query? 
3. Recommendation 
4. Q&A
WHAT IS REDSHIFT?
Brief Introduction (1) 
• A scalable SQL engine in AWS 
– Available except N. California and San Paulo regions as 
of Sep 2014 
– Up to 1.6PB of data in a cluster of servers 
– Fast but still in minutes for big joins 
– Columnar storage 
• Adding or Deleting a column is very fast!! 
• Supports Per column compression 
– Supports bulk update 
• Upload gzipped tsv/csv file to S3 and then run bulk update 
command (called “copy”)
Brief Introduction (2) 
• Supports Postgresql 8.x 
– But not all of the features of Postgresql 
– Accessible through ODBC/JDBC interface 
• You can use any tools/library supporting ODBC/JDBC 
– Still table schema matters! 
• It is still SQL
Brief Introduction (3) 
• Dense Compute vs. Dense Storage 
vCPU ECU Memory Storage Price 
DW1 – Dense Storage 
dw1.xlarge 2 4.4 15 2TB HDD $0.85/hour 
dw1.8xlarge 16 35 120 16TB HDD $6.80/hour 
DW2 – Dense Compute 
dw2.xlarge 2 7 15 0.16TB SSD $0.25/hour 
dw2.8xlarge 32 104 244 2.56TB SSD $4.80/hour
Brief Introduction (4) 
• Cost Analysis 
– If you need 8TB RedShift cluster, you will need 4 
dw1.xlarge instances 
• That will be $2448 per 30 days and about $30K per year 
– You will need to store input records to RedShift in 
S3 at the minimum. So there will be S3 cost as 
well. 
• 1TB with “reduced redundancy” would cost $24.5 per 
month
Brief Introduction (5) 
• Tightly coupled with other AWS services 
– S3, EMR (ElasticMapReduce), Kinesis, DynamoDB, RDS and 
so on 
– Backup and Snapshot to S3 
• No Automatic Resizing 
– You have to manually resize and it takes a while 
• Doubling from 2 nodes to 4 took 8 hours. The other way around 
took 18 hours or so (done in summer of 2013 though) 
– But during resizing, read operation still works 
• 30 minutes Maintenance every week 
– You have to avoid this window
Brief Summary 
• RedShift is a large scale SQL engine which can be 
used as Data Warehouse/Analytics solution 
– You don’t stall your production database! 
– Smoother migration for anyone who knows SQL 
– It supports SQL interface but behind the scene it is a 
NoSQL engine 
• RedShift isn’t for Realtime query engine 
– Semi-realtime data consumption might be doable but 
querying can take a while
Difference from MySQL (1) 
• No guarantee of primary key uniqueness 
– There can be many duplicates if you are not careful 
• You better delete before inserting (based on date/time range) 
– Primary key is just a hint for query optimizer 
• Need to define distkey and sortkey per table 
– distkey is to determine which node to store a record 
– sortkey is to determine in what order records need to be stored in a machine 
create table session_attribute ( 
browser_id decimal(20,0) not null distkey sortkey, 
session_id int, 
name varchar(48), 
value varchar(48), 
primary key(vid, sid, name) 
);
Difference from MySQL (2) 
• char/varchar type is in bytes not in characters 
• "rn” is counted as two characters 
• No text field. The max number of bytes in 
char/varchar is 65535 
• Addition/deletion of a column is very fast 
• Some keywords are reserved (user, tag and so 
on) 
• LIKE is case-sensitive (ILIKE is case-insensitive)
Supported Data Type in RedShift 
• SMALLINT (INT2) 
• INTEGER (INT, INT4) 
• BIGINT (INT8) 
• DECIMAL (NUMERIC) 
• REAL (FLOAT4) 
• DOUBLE PRECISION (FLOAT8) 
• BOOLEAN (BOOL) 
• CHAR (CHARACTER) 
• VARCHAR (CHARACTER VARYING) 
• DATE 
• TIMESTAMP
REDSHIFT IN ACTION
What can be stored? 
• Log Files 
– Web access logs 
– But needs to define schema. Better to add session 
level tables 
• Relational Database Tables 
– MySQL tables 
– Almost one to one mapping 
• Any structured data 
– Any data you can represent as CSV
A bit more about Session Table 
• Hadoop can be used to aggregate pageviews 
into session (on top of pageviews): 
– Group by session key 
– Order pageviews in the same session by 
timestamp 
• This aggregated info -> session table 
• Example of session table 
– Session ID, Browser ID, user ID, IP, UserAgent, 
Referrer info, Start time, Duration, …
How to Upload? 
• Need to define schema of your data 
• Create a table (again it is a SQL engine) 
• Generate a tsv or csv file(s) from your source data 
• Compress the file(s) 
• Upload the file to S3 
– This S3 bucket better be in the same region as the RedShift 
cluster (but it is no longer a must) 
• Run a bulk insert (called “copy”) 
– copy session_attribute [fields] from ‘s3://your_bucket/…’ 
options 
– Options include AWS keys, whether gzipped or not, delimiter 
used, max errors to tolerate and so on 
• Regular insert/update SQL statement can be used
Update Workflow 
Periodically upload 
input files 
S3 RedShift 
A cronjob 
Data Source Server 
Bulk Insert 
You can introduce a queue where S3 
location of all incoming input files are 
pushed. A consumer of this queue 
read from the queue and bulk insert 
to RedShift 
You might have to do ETL on your source 
data using Hadoop and so on
Incremental Update from MySQL 
• Change your table schema if possible 
– Need to have updatedon field in your table 
– Never delete a record but mark it as inactive 
• Monitor your table changes and propagate it 
to Redshift 
– Use DataBus from LinkedIn
HOW TO ACCESS REDSHIFT
Different Ways to Access (1) 
1. JDBC/ODBC desktop tools such as 
– SQLWorkBench, Navicat and so on 
– Requires IP registration for outside access 
2. JDBC/ODBC Library 
– Any PostgreSQL 8.0.x compatible should work 
In both cases, you use SQL statements
Different Ways to Access (2) 
3. Use Analytics Tool such as Tableau or Birst 
– But these have too many features 
– Will likely need a dedicated analyst
RECOMMENDATION
Things to Consider 
• How big are your tables? 
• Dumping your tables would cause issues? 
– Site’s stability and so on 
– Or do you have backup instance? 
• Are your tables friendly for incremental 
update? 
– “updatedon” field 
– no deletion of records
Steps 
• Start from Daily Update 
– Daily full refresh is fine to begin with to set up end-to-end 
cycle 
– If the tables are big, then dumping them can take a 
while 
• Implement Incremental Update Mechanism 
– This will require either table schema change or the 
use of some database change tracking mechanism 
• Go for Shorter update interval

Contenu connexe

Tendances

Dynamo db pros and cons
Dynamo db  pros and consDynamo db  pros and cons
Dynamo db pros and cons
Saniya Khalsa
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Data Con LA
 

Tendances (20)

Scalability of Amazon Redshift Data Loading and Query Speed
Scalability of Amazon Redshift Data Loading and Query SpeedScalability of Amazon Redshift Data Loading and Query Speed
Scalability of Amazon Redshift Data Loading and Query Speed
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
 
(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift
 
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Amazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and OptimizationAmazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and Optimization
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data Warehouse
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
 
Dynamo db pros and cons
Dynamo db  pros and consDynamo db  pros and cons
Dynamo db pros and cons
 
Building your data warehouse with Redshift
Building your data warehouse with RedshiftBuilding your data warehouse with Redshift
Building your data warehouse with Redshift
 
Amazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni VamvadelisAmazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni Vamvadelis
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 
Running Cassandra on Amazon EC2
Running Cassandra on Amazon EC2Running Cassandra on Amazon EC2
Running Cassandra on Amazon EC2
 

En vedette

En vedette (6)

Data analysis scala_spark
Data analysis scala_sparkData analysis scala_spark
Data analysis scala_spark
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Build application using sbt
Build application using sbtBuild application using sbt
Build application using sbt
 
Hadoop admiin demo
Hadoop admiin demoHadoop admiin demo
Hadoop admiin demo
 

Similaire à AWS Redshift Introduction - Big Data Analytics

Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02
Guillermo Julca
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
DataWorks Summit
 
Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]
shuwutong
 

Similaire à AWS Redshift Introduction - Big Data Analytics (20)

Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive
 
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
 
Breaking data
Breaking dataBreaking data
Breaking data
 
NewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDNewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACID
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
AZURE Data Related Services
AZURE Data Related ServicesAZURE Data Related Services
AZURE Data Related Services
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
AWS Cloud SAA Relational Database presentation
AWS Cloud SAA Relational Database presentationAWS Cloud SAA Relational Database presentation
AWS Cloud SAA Relational Database presentation
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]
 

Dernier

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 

Dernier (20)

Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 

AWS Redshift Introduction - Big Data Analytics

  • 1. Redshift Introduction Keeyong Han keeyonghan@hotmail.com
  • 2. Table of Contents 1. What is Redshift? 2. Redshift In Action 1. How to Upload? 2. How to Query? 3. Recommendation 4. Q&A
  • 4. Brief Introduction (1) • A scalable SQL engine in AWS – Available except N. California and San Paulo regions as of Sep 2014 – Up to 1.6PB of data in a cluster of servers – Fast but still in minutes for big joins – Columnar storage • Adding or Deleting a column is very fast!! • Supports Per column compression – Supports bulk update • Upload gzipped tsv/csv file to S3 and then run bulk update command (called “copy”)
  • 5. Brief Introduction (2) • Supports Postgresql 8.x – But not all of the features of Postgresql – Accessible through ODBC/JDBC interface • You can use any tools/library supporting ODBC/JDBC – Still table schema matters! • It is still SQL
  • 6. Brief Introduction (3) • Dense Compute vs. Dense Storage vCPU ECU Memory Storage Price DW1 – Dense Storage dw1.xlarge 2 4.4 15 2TB HDD $0.85/hour dw1.8xlarge 16 35 120 16TB HDD $6.80/hour DW2 – Dense Compute dw2.xlarge 2 7 15 0.16TB SSD $0.25/hour dw2.8xlarge 32 104 244 2.56TB SSD $4.80/hour
  • 7. Brief Introduction (4) • Cost Analysis – If you need 8TB RedShift cluster, you will need 4 dw1.xlarge instances • That will be $2448 per 30 days and about $30K per year – You will need to store input records to RedShift in S3 at the minimum. So there will be S3 cost as well. • 1TB with “reduced redundancy” would cost $24.5 per month
  • 8. Brief Introduction (5) • Tightly coupled with other AWS services – S3, EMR (ElasticMapReduce), Kinesis, DynamoDB, RDS and so on – Backup and Snapshot to S3 • No Automatic Resizing – You have to manually resize and it takes a while • Doubling from 2 nodes to 4 took 8 hours. The other way around took 18 hours or so (done in summer of 2013 though) – But during resizing, read operation still works • 30 minutes Maintenance every week – You have to avoid this window
  • 9. Brief Summary • RedShift is a large scale SQL engine which can be used as Data Warehouse/Analytics solution – You don’t stall your production database! – Smoother migration for anyone who knows SQL – It supports SQL interface but behind the scene it is a NoSQL engine • RedShift isn’t for Realtime query engine – Semi-realtime data consumption might be doable but querying can take a while
  • 10. Difference from MySQL (1) • No guarantee of primary key uniqueness – There can be many duplicates if you are not careful • You better delete before inserting (based on date/time range) – Primary key is just a hint for query optimizer • Need to define distkey and sortkey per table – distkey is to determine which node to store a record – sortkey is to determine in what order records need to be stored in a machine create table session_attribute ( browser_id decimal(20,0) not null distkey sortkey, session_id int, name varchar(48), value varchar(48), primary key(vid, sid, name) );
  • 11. Difference from MySQL (2) • char/varchar type is in bytes not in characters • "rn” is counted as two characters • No text field. The max number of bytes in char/varchar is 65535 • Addition/deletion of a column is very fast • Some keywords are reserved (user, tag and so on) • LIKE is case-sensitive (ILIKE is case-insensitive)
  • 12. Supported Data Type in RedShift • SMALLINT (INT2) • INTEGER (INT, INT4) • BIGINT (INT8) • DECIMAL (NUMERIC) • REAL (FLOAT4) • DOUBLE PRECISION (FLOAT8) • BOOLEAN (BOOL) • CHAR (CHARACTER) • VARCHAR (CHARACTER VARYING) • DATE • TIMESTAMP
  • 14. What can be stored? • Log Files – Web access logs – But needs to define schema. Better to add session level tables • Relational Database Tables – MySQL tables – Almost one to one mapping • Any structured data – Any data you can represent as CSV
  • 15. A bit more about Session Table • Hadoop can be used to aggregate pageviews into session (on top of pageviews): – Group by session key – Order pageviews in the same session by timestamp • This aggregated info -> session table • Example of session table – Session ID, Browser ID, user ID, IP, UserAgent, Referrer info, Start time, Duration, …
  • 16. How to Upload? • Need to define schema of your data • Create a table (again it is a SQL engine) • Generate a tsv or csv file(s) from your source data • Compress the file(s) • Upload the file to S3 – This S3 bucket better be in the same region as the RedShift cluster (but it is no longer a must) • Run a bulk insert (called “copy”) – copy session_attribute [fields] from ‘s3://your_bucket/…’ options – Options include AWS keys, whether gzipped or not, delimiter used, max errors to tolerate and so on • Regular insert/update SQL statement can be used
  • 17. Update Workflow Periodically upload input files S3 RedShift A cronjob Data Source Server Bulk Insert You can introduce a queue where S3 location of all incoming input files are pushed. A consumer of this queue read from the queue and bulk insert to RedShift You might have to do ETL on your source data using Hadoop and so on
  • 18. Incremental Update from MySQL • Change your table schema if possible – Need to have updatedon field in your table – Never delete a record but mark it as inactive • Monitor your table changes and propagate it to Redshift – Use DataBus from LinkedIn
  • 19. HOW TO ACCESS REDSHIFT
  • 20. Different Ways to Access (1) 1. JDBC/ODBC desktop tools such as – SQLWorkBench, Navicat and so on – Requires IP registration for outside access 2. JDBC/ODBC Library – Any PostgreSQL 8.0.x compatible should work In both cases, you use SQL statements
  • 21. Different Ways to Access (2) 3. Use Analytics Tool such as Tableau or Birst – But these have too many features – Will likely need a dedicated analyst
  • 23. Things to Consider • How big are your tables? • Dumping your tables would cause issues? – Site’s stability and so on – Or do you have backup instance? • Are your tables friendly for incremental update? – “updatedon” field – no deletion of records
  • 24. Steps • Start from Daily Update – Daily full refresh is fine to begin with to set up end-to-end cycle – If the tables are big, then dumping them can take a while • Implement Incremental Update Mechanism – This will require either table schema change or the use of some database change tracking mechanism • Go for Shorter update interval

Notes de l'éditeur

  1. thing_color
  2. thing_color