2. Agenda
08:30 AM Breakfast
09:00 AM Introduction and Strengths of Technologies
10:00 AM break + set up query tool
10:20 AM Hadoop hands-on
10:55 AM break
11:10 AM Redshift hands-on
11:40 AM Operationalizing your code
12:00 PM adjourn
12/6/2014 2
3. Session Goals
• Understand:
• Why an Analytic Database?
• What is Amazon Redshift
• Do:
• ‘Fire Up’ an Redshift Database
• Load Data
• Do a few queries
• Shut it down
• Have fun!
12/6/2014 3
4. Why an Analytic Database?
Why use one?
• It a database optimized for read-only queries.
• It’s fast
• It can handle a lot of data
Why not to use one?
• Poor Transaction processing (aka OLTP)
• Rollback, multi-phase commits, etc
12/6/2014 4
5. Under the hood.
Analytic Database typically have features like:
• Compression
• Column (as opposed to row) storage
• Parallel queries across clusters of machines
• Support for partitioning
• Other cool stuff to make your queries fast
12/6/2014 5
30. Load Data
copy uservisits FROM 's3://big-data-benchmark/pavlo/text/tiny/uservisits/' CREDENTIALS
'aws_access_key_id=<your key>;aws_secret_access_key=<your key>' delimiter ',';
12/6/2014 30
Load Data from S3
copy rankings FROM 's3://big-data-benchmark/pavlo/text/tiny/rankings/' CREDENTIALS
'aws_access_key_id =<your key>;aws_secret_access_key =<your key>' delimiter ',';
31. Load Bigger Data
12/6/2014 31
Load Data from S3
's3://big-data-benchmark/pavlo/text/tiny/uservisits/‘
-- options: "tiny", "1node", "5nodes", "10nodes"
32. Simple Queries
12/6/2014 32
Query
select * from uservisits limit 100;
SELECT COUNT(*) from uservisits;
select * from rankings limit 100;
SELECT COUNT(*) from rankings;
33. Complex Queries
12/6/2014 33
Query
SELECT pageURL, pageRank FROM rankings WHERE pageRank > 10;
SELECT sourceIP, SPLIT_PART(sourceIP, '.', 1) as fn, SPLIT_PART(sourceIP, '.', 2) as sn FROM
uservisits LIMIT 100;
SELECT sourceIP,
SUM(adRevenue) AS totalRevenue,
AVG(pageRank) AS pageRank
FROM rankings R
JOIN (SELECT sourceIP,
destinationURL,
adRevenue
FROM uservisits uv) NUV ON (R.pageURL = NUV.destinationURL)
GROUP BY sourceIP
ORDER BY totalRevenue DESC LIMIT 100;