3. What is Big Data?
When your data sets become so large and diverse
that you have to start innovating around how to
collect, store, process, analyze and share them
4. AWS Data Services to Accelerate Your Move to the Cloud
RDS
Open
Source
RDS
Commercial
Aurora
Migration for DB Freedom
DynamoDB
& DAX
ElastiCache EMR Amazon
Redshift
Redshift
Spectrum
AthenaElasticsearch
Service
QuickSightGlue
Lex
Polly
Rekognition Machine
Learning
Databases to Elevate your Apps
Relational Non-Relational
& In-Memory
Analytics to Engage your Data
Inline Data Warehousing Reporting
Data Lake
Amazon AI to Drive the Future
Deep Learning, MXNet
Database Migration
Schema Conversion
5. AWS Glue
AWS Big Data Portfolio
Collect Store Analyze
Amazon Kinesis
Firehose
AWS Direct
Connect
Amazon
Snowball
Amazon Kinesis
Analytics
Amazon Kinesis
Streams
Amazon S3 Amazon Glacier
Amazon RDS,
Amazon Aurora
Amazon
Dynamo DB
Amazon
Elasticsearch
Amazon EMR Amazon EC2
Amazon
Redshift
Amazon Machine
Learning
Amazon
QuickSight
AWS Database Migration Service AWS Glue
AthenaAmazon
Athena
7. Generate
Collect & Store
Analyze
Collaborate & Act
Individual AWS customers
generating over PB/day
Amazon S3 lets you collect and store all this data
Store exabytes of
data in S3
9. 1990 2000 2010 2020
Generated Data
Available for Analysis
Sources:
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Data Volume
Year
The Dark Data Problem
Most generated data is unavailable for analysis
11. Relational data warehouse
Massively parallel; petabyte
scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at
$0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
13. Use Case: Traditional Data Warehousing
Business
Reporting
Advanced pipelines
and queries
Secure and
Compliant
Easy Migration – Point & Click using AWS Database Migration Service
Secure & Compliant – End-to-End Encryption. SOC 1/2/3, PCI-DSS, HIPAA and FedRAMP compliant
Large Ecosystem – Variety of cloud and on-premises BI and ETL tools
Japanese Mobile
Phone Provider
Powering 100 marketplaces
in 50 countries
World’s Largest Children’s
Book Publisher
Bulk Loads
and Updates
14. Use Case: Large Scale Time-Series Analysis
Log & Machine
IOT Data
Clickstream
Events Data
Time-Series
Data
Cheap – Analyze large volumes of data cost-effectively
Fast – Massively Parallel Processing (MPP) and columnar architecture for fast queries and parallel loads
Near real-time – Micro-batch loading and Amazon Kinesis Firehose for near-real time analytics
Interactive data analysis and
recommendation engine
Ride analytics for pricing
and product development
Ad prediction and
on-demand analytics
15. Use Case: Business Applications
Multi-Tenant BI
Applications
Back-end
services
Analytics as a
Service
Fully Managed – Provisioning, backups, upgrades, security, compression all come built-in so you can
focus on your business applications
Ease of Chargeback – Pay as you go, add clusters as needed. A few big common clusters, several
data marts
Service Oriented Architecture – Integrated with other AWS services. Easy to plug into your pipeline
Infosys Information
Platform (IIP)
Analytics-as-a-
Service
Product and Consumer
Analytics
16. Amazon Redshift: Cloud Data Warehousing
Leader Node
Simple SQL endpoint
Stores metadata
Optimizes query plan
Coordinates query execution
Compute Nodes
Local columnar storage
Parallel/distributed execution of all queries,
loads, backups, restores, resizes
Up to 2 petabytes of managed data
Automated ingestion from S3, Kinesis,
EMR and DynamoDB
Leader
Node
Compute Nodes
S3 EMR DynamoDB EC2
17. Amazon Redshift Cluster
The life of a query
BI tools
SQL clients
Analytics tools
Client
Leader
node
Compute
node
Compute
node
Compute
node
Queue 1
Queue 2
1
4
32
19. Benefit #1: Amazon Redshift is fast
Parallel and distributed
Query
Load
Export
Backup
Restore
Resize
20. Benefit #1: Amazon Redshift is fast
Hardware optimized for I/O intensive workloads, 4 GB/sec/node
Enhanced networking, over 1 million packets/sec/node
Choice of storage type, instance size
Regular cadence of auto-patched improvements
21. Query monitoring rules
Monitor and control
cluster resources
consumed by a query
Get notified, abort and
reprioritize long-
running / bad queries
Pre-defined templates
for common use
cases
22. Query monitoring rules
Common use cases:
• Protect interactive queues
INTERACTIVE = { “query_execution_time > 15 sec” or
“query_cpu_time > 1500 uSec” or
”query_blocks_read > 18000 blocks” } [HOP]
• Monitor ad-hoc queues for heavy queries
AD-HOC = { “query_execution_time > 120” or
“query_cpu_time > 3000” or
”query_blocks_read > 180000” or
“memory_to_disk > 400000000000”} [LOG]
• Limit the number of rows returned to a client
MAXLINES = { “RETURN_ROWS > 50000” } [ABORT]
23. Benefit #2: Amazon Redshift is inexpensive
DS2 (HDD)
Price per hour for
DS2.XL single node
Effective annual
price per TB compressed
On-demand $ 0.850 $ 3,725
1 year reservation $ 0.500 $ 2,190
3 year reservation $ 0.228 $ 999
DC1 (SSD)
Price per hour for
DC1.L single node
Effective annual
price per TB compressed
On-demand $ 0.250 $ 13,690
1 year reservation $ 0.161 $ 8,795
3 year reservation $ 0.100 $ 5,500
Pricing is simple
Number of nodes x price/hour
No charge for leader node
No upfront costs
Pay as you go
24. Benefit #3: Amazon Redshift is fully managed
Continuous/incremental backups
Multiple copies within cluster
Continuous and incremental backups
to Amazon S3
Continuous and incremental backups
across regions
Streaming restore
Compute node
Amazon S3
Region 1
Region 2
Compute node Compute node
Amazon S3
25. Benefit #3: Amazon Redshift is fully managed
Fault tolerance
Disk failures
Node failures
Network failures
Availability Zone/region level disasters
Compute node
Amazon S3
Region 1
Region 2
Compute node Compute node
Amazon S3
35. Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
S3
SQL
36. Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
Redshift Spectrum
Fast @ Exabyte scale
Life of a query
37. Amazon Redshift Spectrum – Current support
File formats
• Parquet
• CSV
• Sequence
• RCFile
• ORC (coming soon)
• RegExSerDe (coming soon)
Compression
• Gzip
• Snappy
• Lzo (coming soon)
• Bz2
Encryption
• SSE with AES256
• SSE KMS with default
key
Column types
• Numeric: bigint, int, smallint, float, double
and decimal
• Char/varchar/string
• Timestamp
• Boolean
• DATE type can be used only as a
partitioning key
Table type
• Non-partitioned table
(s3://mybucket/orders/..)
• Partitioned table
(s3://mybucket/orders/date=YYYY-MM-
DD/..)
39. NTT Docomo: Japan’s largest mobile service provider
68 million customers
Tens of TBs per day of data across a
mobile network
6 PB of total data (uncompressed)
Data science for marketing
operations, logistics, and so on
Greenplum on-premises
Scaling challenges
Performance issues
Need same level of security
Need for a hybrid environment
40. 125 node DS2.8XL cluster
4,500 vCPUs, 30 TB RAM
2 PB compressed
10x faster analytic queries
50% reduction in time for new
BI application deployment
Significantly less operations
overhead
Data
Source
ET
AWS Direct
Connect
Client
Forwarder
LoaderState
Management
SandboxAmazon Redshift
S3
NTT Docomo: Japan’s largest mobile service provider
41. Nasdaq: powering 100 marketplaces in 50 countries
Orders, quotes, trade executions,
market “tick” data from 7 exchanges
7 billion rows/day
Analyze market share, client activity,
surveillance, billing, and so on
Microsoft SQL Server on-premises
Expensive legacy DW
($1.16 M/yr.)
Limited capacity (1 yr. of data
online)
Needed lower TCO
Must satisfy multiple security
and regulatory requirements
Similar performance
42. 23 node DS2.8XL cluster
828 vCPUs, 5 TB RAM
368 TB compressed
2.7 T rows, 900 B derived
8 tables with 100 B rows
7 man-month migration
¼ the cost, 2x storage, room to
grow
Faster performance, very
secure
Nasdaq: powering 100 marketplaces in 50 countries
43. Amazon.com clickstream analytics
Web log analysis for Amazon.com
• PBs workload, 2TB/day@67% YoY
• Largest table: 400 TB
Understand customer behavior
Previous solution
• Legacy DW (Oracle)—query across 1 week/hr
• Hadoop—query across 1 month/hr
44. Results with Amazon Redshift
• Query 15 months in 14 min
• Load 5B rows in 10 min
• 21B w/ 10B rows: 3 days to 2 hrs
(Hive Redshift)
• Load pipeline: 90 hrs to 8 hrs
(Oracle Redshift)
• 100 node DS2.8XL clusters
• Easy resizing
• Managed backups and restore
• Failure tolerance and recovery
• 20% time of one DBA
• Increased productivity
AWS’s innovations are about providing the tools, technologies and techniques for working productively with data, at any scale.
AWS offers a broad portfolio of data services across DB, Analytics, & AI to help customers accelerate Your Move to the Cloud
Over 90% of new features & svcs are driven by cust feedback, and most cust requests are driven by a need to make app experiences better. Better includes faster to deploy & run, more responsive & reliable, and easier & less expensive to operate.
For DB, we have a mix of relational Open Source and Commercial Engine choices, and Amazon-created Aurora. For the greatest scalability & performance we have non-relational Amazon-created DynamoDB and Open Source choices.
Our customers use a range of analytics svcs to Engage their Data, including integration svcs needed to enable ETL, serverless query, and visualization.
There are a number of Open Source-based offerings I will talk about. And many customer are migrating from DW appliances to on-demand Cloud services w/decoupled data, including using Redshift and Redshift Spectrum.
AI is a growing part of the AWS portfolio, for deep learning with platforms of machine learning and finished services to enable developers to easy add voice and pictorial capabilities to applications.
Continuing this portfolio view, our cust use migration for a broad range of use cases across modernization of existing applications and moving between DB options, and both for relational and non-relational DB and analytics services.
The main goal of this slide is to show platform completeness
Key talking points:
1/ Any big data application has a data acquisition phase, a storage need, and an analytics need
2/ Quick Service overviews. Go fast; especially on ones we talk about later.
Collect
Direct Connect – private, low latency connections between your data centers and ours. Most customers use a pair for redundancy and availability
Import/Export – for moving large volumes of data, fedex is your highest bandwidth option. With Snowball, we’ll ship you a ruggedized case, load it up, send it back
Kinesis – real time streaming data; has streams for custom apps, firehose for easy Redshift/s3 integration, Analytics for real time SQL
AWS IoT platform – complete suite for IoT devices to make it easy to manage them and get telemetry data into AWS
Store
S3 is the foundation of any big data app on AWS. Scalable, low cost, default landing zone for data ($0.023/GB-Mo and drops from there with scale. That’s $23/TB for a month, less than $300 for a year)
Glacier is the sister service for cold storage like data you need for compliance. Age data into it using lifecycle. $0.004/GB-Mo, or $48 per TB per year!
DynamoDB – NoSQL store; zero admin; JSON + Key Value with single digit millisecond latency. Great for high concurrency reads and writes
Elasticsearch – managed elasticsearch clusters for operational intelligence and search
Analyze
EMR for fully managed dynamic clusters for running Hadoop/Spark/Presto/HBase
Athena for interactive queries on S3 Data using Standard SQL with no infrastructure to manage
Redshift – fully managed, petabyte scale DW for $1,000/TB/year
ML – fully managed machine learning
EC2 – run anything you want that runs on Linux or windows
QuickSight for fast, cost-effective BI
Lambda – serverless compute for event driven computing
And rounding all this out, we have DMS for migrating databases and replicating OLTP to Redshift and Data Pipeline for scheduling and orchestration
It has never been easier to generate data – I know ppl who generate 1PB a day
Typically, and especially in on premise environments, the ability to make sense of that data is pretty limited.
It has never been easier to generate data – I know ppl who generate 1PB a day
Typically, and especially in on premise environments, the ability to make sense of that data is pretty limited.
It has never been easier to generate data – I know ppl who generate 1PB a day
Typically, and especially in on premise environments, the ability to make sense of that data is pretty limited.
The impedance mismatch between generation and ability to process/analyze creates a data gap. Most data isn’t ever actually looked at. Hence, the Data Gap.
Pace of innovation
Continuous deployment every 2 weeks
Different model than Oracle/Teradata etc. still took a 6 month deployment plan vs. us
Redshift value proposition: fast, simple, cost-effective data warehousing. Massively parallel and columnar technology. Easily scalable to petabytes or more. Makes administration simple, no DBA needed. You can choose from hard disk drive or solid state drive platform depending on your needs. All this for $1,000/TB/year, which is one-tenth of the cost of traditional data warehousing solutions.
USE CASES – to use across the next several slides (6-10)
Amazon –Understand customer behavior, migrated from Oracle, PBs workload, 2TB/day@67% YoY. Could query across 1 week in one hour with Oracle, now can query 15 months in 14 min with Redshift
Boingo – 2000+ Commercial Wi-Fi locations, 1 million+ Hotspots, 90M+ ad engagements in 100+ countries. Used Oracle, Rapid data growth slowed analytics, Admin overhead and Expensive (license, h/w, support). After migration 180x performance improvement and 7x cost savings
Finra (Financial Industry Regulatory Authority) – One of the largest independent securities regulators in the United States, was established to monitor and regulate financial trading practices. Reacts to changing market dynamics while providing its analysts with the tools (Redshift) to interactively query multi-petabyte data sets. Captures, analyzes, and stores a daily ~75 billion records daily. The company estimates it will save up to $20 million annually by using AWS instead of a physical data center infrastructure.
Desk.com – Ingests 200K case related events/hour and runs a user facing portal on Redshift
Fully Managed – We provision, backup, patch and monitor so you can focus on your data
Fast – Massively Parallel Processing and columnar architecture for fast queries and parallel loads
Nasdaq security – Ingests 5B rows/trading day, analyzes orders, trades and quotes to protect traders, report activity and develop their marketplaces
NTT Docomo - Redshift is NTT Docomo's primary analytics platform for data science and marketing and logistic analytics. Data is pre-processed on premises and loaded into a massive, multi-petabyte data warehouse on Amazon Redshift, which data scientists use as their primary analytics platform.
Pinterest uses Redshift for interactive data analysis. Redshift is used to store all web event data and uses for KPIs, recommendations and A/B experimentation.
Lyft uses Redshift for ride analytics across the world (rides / location data ) - Through analysis, company engineers estimated that up to 90% of rides during peak times had similar routes. This led to the introduction of Lyft Line – a service that allows customers to save up to 60% by carpooling with others who are going in the same direction.
Yelp has multiple deployments of RedShift with different data sets in use by product management, sales analytics, ads, SeatMe (Point of sale analytics) and many other teams.
Analyzes 0s of millions of ads/day, 250M mobile events/day, ad campaign performance and new feature usage
Accenture Insights Platform (AIP) is a scalable, on-demand, globally available analytics solution running on Amazon Redshift. AIP is Accenture's foundation for its big data offering to deliver analytics applications for healthcare and financial services.
Like Hadoop, DW is a natural fit for cloud computing. The ability to scale out across many compute nodes to query and analyze data is our sweet spot. Redshift is a Massively Parallel Processing, or MPP, DW that uses columnar storage which makes it optimal for analytics processing.
There are many DW solutions on the market, many of them HW based, but Redshift is different
First, Redshift is fully elastic – you can dynamically provision a warehouse from <1TB to 2PB in size, and can grow and shrink elastically with your data volumes. If you need more capacity, just go into the console and change the size of the cluster. We will build a new cluster, copy your data over and switch over to the new cluster for you.
Second, Redshift is completely utility based in its pricing – there are no licenses to buy, you pay for what you use. Similar to EMR, you can have persistent warehouses that are up 24X7 or have transient warehouses that you use for a short period of time and then bring down.
Third it is fully managed, we do the backups, resizing, patching, etc. for you. You just provision it, load your data and start analyzing your data.
All that for $1,000/TB/Year; start at $0.25/hour. Provision in minutes with just a few clicks, with 10x perf. and 1/10 the price of other solutions
Column storage fetches only the specific columns that are required for queries
Customer typically get 3-5x compression. Means less I/O and more effective storage
Zone maps: Each column is divided into 1MB blocks, we store the min/max values of each block in memory. We are able to quickly identify the blocks that are required to be fetched for a query
Direct-attached storage/large block sizes also enable fast I/O
All the operations that you care about from a database management and performance perspective are all parallelized
Scale horizontally with the number of nodes
We work with EC2 and infrastructure teams closely to optimize the h/w for data warehousing needs.
Rules
Auto statistics collection
We keep track of the statistics
Usage patterns – avoid analyze if not required
When you have an interactive queue, you may want to protect it from runaway queries and hop out any heavier-than-allowed queries. Sometimes it’s not enough to limit execution time to 15 seconds as some queries can consume large amounts of CPU or scan many blocks very fast. With this rule you can simply hop out the queries before consuming excessive cluster resource and slowing down the other queries on the “fast” lane
Since Redshift is accessible to your team(s), it is always good to monitor clusters before users’ queries impact the performance of the system. Using STL_QUERY_METRICS and logs generated by this rule, you can reach out to users to discuss (and help rewrite) their queries
Abort queries that exceed the limit of rows returned to a client (for excel or interactive use cases)
Prices start at $1,000/TB/yr. for a 3 year DS2.8XL RI.
Ris can provide more than 70% discount.
Backups are managed, continuous and incremental
Multiple copies are made within and outside the cluster on S3
Streaming restore =>While restore happens in the background, cluster is made available for reads/write within minutes of initiating the restoring operation (whether the backup is for a 1TB or a 100TB cluster)
Redshift monitors the health of a variety of components and failure conditions within an AZ and recovers from them automatically
For AZ/region level disasters, streaming restore will help get back to business within minutes
We can also fix SW issues without replacing the node if not needed
Python 2.7 – do you need support for other languages? Let us know
We respect the investments you made in your analytics platforms and work with a variety of ETL, BI and SI vendors.
Standard JDBC/ODBC drivers make the integration process seamless
We work with these vendors closely to certify the joint platforms