Building big data applications often requires integrating a broad set of technologies to store, process, and analyze the increasing variety, velocity, and volume of data being collected by many organizations.
Using a combination of Amazon EMR, a managed Hadoop framework, and Amazon Redshift, a managed petabyte-scale data warehouse, organizations can effectively address many of these requirements.
In this webinar, we will show how organizations are using Amazon EMR and Amazon Redshift to build more agile and scalable architectures for big data. We will look into how you can leverage Spark and Presto running on EMR, to address multiple data processing requirements. We will also share best practices and common use cases to integrate EMR and Redshift.
Learning Objectives:
• Best practices for building a big data architecture that includes Amazon EMR and Amazon Redshift
• Understand how to use technologies such as Amazon EMR, Presto and Spark to complement your data warehousing environment
• Learn key use cases for Amazon EMR and Amazon Redshift
Who Should Attend:
• Data architects, Data management professionals, Data warehousing professionals, BI professionals
2. Agenda
• AWS Big Data Platform Overview
• Amazon EMR & Amazon Redshift
• Building a Big Data Application
• Customer Use Cases
3. AWS Big Data Platform
EMR EC2
Analyze
Glacier
S3
Store
Import Export
Collect
Kinesis
Direct Connect
Machine
Learning
Redshift
Amazon
QuickSight
DynamoDB
4. Amazon EMR – Managed Hadoop Clusters in the Cloud
Scalable Hadoop clusters as a service
Hadoop, Hive, Spark, Presto, Hbase, etc.
Easy to use; fully managed
On demand, reserved, spot pricing
HDFS, Amazon EBS, and S3 filesystems
End to end security
Amazon EMR
5. Easy to deploy
AWS Management Console
or use the EMR API with your favorite SDK
Command Line
6. Choose your instance types
CPU
C3/C4 family
MACHINE
LEARNING
Memory
R3 family
SPARK AND
INTERACTIVE
Disk/IO
D2/I2 family
LARGE
HDFS
General
M3/M4 family
BATCH
PROCESS
Customize your storage type and size using Amazon EBS
Try different configurations to find your optimal architecture
8. Integrated with the AWS Platform
Amazon DynamoDB
EMR-DynamoDB
connector
Amazon RDS
Amazon
Kinesis
Streaming data
connectorsJDBC Data Source
w/ Spark SQL
ElasticSearch
connector
Amazon Redshift
Amazon Redshift Copy
From HDFS
EMR File System
(EMRFS)
Amazon S3
Amazon EMR
9. Amazon S3 as your persistent data store
Amazon S3
Designed for 99.999999999% durability
Separate compute and storage
Resize and shut down Amazon EMR
clusters with no data loss
Point multiple Amazon EMR clusters
at same data in Amazon S3 using
the EMR File System (EMRFS)
10. EMRFS makes it easier to leverage Amazon S3
Better performance and error handling options
Transparent to applications – just read/write to “s3://”
Support for Amazon S3 server-side and client-side encryption
Faster listing using EMRFS metadata
HDFS is still available via local instance storage or Amazon EBS
11. Amazon Redshift
Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; start at $0.25/hour
End to end security; built in global DR
Amazon
Redshift
12. Amazon Redshift dramatically reduces I/O
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• With row storage you do
unnecessary I/O
• To get total amount, you have to
read everything
13. Amazon Redshift dramatically reduces I/O
• With column storage, you only
read the data you need
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
15. Amazon Redshift dramatically reduces I/O
Column storage
Data compression
• Track of the minimum and
maximum value for each block
• Skip over blocks that don’t
contain the data needed for a
given query
• Minimize unnecessary I/O
16. Amazon Redshift dramatically reduces I/O
Column storage
Data compression
Zone maps
Direct-attached storage
Large data block sizes
• Use direct-attached storage to
maximize throughput
• Hardware optimized for high
performance data processing
• Large block sizes to make the
most of each read
• Amazon Redshift manages
durability for you
17. Amazon Redshift Has Security Built In
SSL to secure data in transit
Encryption to secure data at rest
AES-256; hardware accelerated
All blocks on disks and in Amazon S3 encrypted
HSM Support
No direct access to compute nodes
Audit logging, AWS CloudTrail, AWS KMS
integration
Amazon VPC support
SOC 1/2/3, PCI-DSS Level 1, FedRAMP, HIPAA
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
19. Building a Big Data Application
web clients
mobile clients
DBMS
corporate data center
Getting Started
20. Building a Big Data Application
web clients
mobile clients
DBMS
Amazon Redshift
Amazon
QuickSight
AWS cloudcorporate data center
Adding a data warehouse
21. Building a Big Data Application
web clients
mobile clients
DBMS
Raw data
Amazon Redshift
Staging Data
Amazon
QuickSight
AWS cloud
Bringing in Log Data
corporate data center
22. Building a Big Data Application
web clients
mobile clients
DBMS
Raw data
Amazon Redshift
Staging Data
Orc/Parquet
(Query optimized)
Amazon
QuickSight
AWS cloud
Extending your DW to S3
corporate data center
23. Building a Big Data Application
web clients
mobile clients
DBMS
Raw data
Amazon Redshift
Staging Data
Orc/Parquet
(Query optimized)
Amazon
QuickSight
KinesisStreams
AWS cloud
Adding a real-time layer
corporate data center
24. Building a Big Data Application
web clients
mobile clients
DBMS
Raw data
Amazon Redshift
Staging Data
Orc/Parquet
(Query optimized)
Amazon
QuickSight
KinesisStreams
AWS cloud
Adding predictive analytics
corporate data center
25. Building a Big Data Application
web clients
mobile clients
DBMS
Raw data
Amazon Redshift
Staging Data
Orc/Parquet
(Query optimized)
Amazon
QuickSight
KinesisStreams
AWS cloud
Adding encryption at rest with AWS KMS
corporate data center
AWSKMS
26. Building a Big Data Application
web clients
mobile clients
DBMS
Raw data
Amazon Redshift
Staging Data
Orc/Parquet
(Query optimized)
Amazon
QuickSight
KinesisStreams
AWS cloud
AWSKMS
VPC subnet
SSL/TLS
SSL/TLS
Protecting Data in Transit & Adding Network Isolation
corporate data center
27. Security
• Encryption at rest with choice of key management
• Service managed, AWS KMS, CloudHSM, on premise HSM
• Encryption in Transit
• Require SSL, all internal communication over SSL/TLS
• Network isolation using Amazon VPC
• Fine grained permissions and auditing using AWS IAM
and AWS CloudTrail
28. Compliance
ISO 9001
SOC 3
SOC 2
ISO 27001
ISO 27017
PCI DSS Level 1ISO 27018
SOC 1 / ISAE 3402
GxPHIPAA
ITAR
FERPA
FISMA, RMF, and DIACAP
FedRAMP
Section 508 / VPAT
DoD SRG Levels 2 & 4
FIPS 140-2
CJIS
Cloud Security Alliance
MPAA
NIST
MLPS Level 3
G-Cloud
IT-Grundschutz
MTCS Tier 3
IRAP Cyber Essentials Plus
29. Disaster Recovery
• Amazon EMR & Amazon Redshift clusters are resilient
and we automatically replace failed nodes/HW
• Data on S3 available in all Availability Zones in a Region
• S3 data can be synced across regions
• Amazon Redshift clusters are continuously backed up to
S3 and snapshots can be synced to a second region
31. Data Source ET
Direct
Connect
Client
Forwarder
LoaderState Management
SandboxRedshift
S3
Petabytes of data generated
on premise and brought to
Redshift in the cloud for
analysis
High speed connectivity over a
redundant pair of Direct
Connect leased lines
Stringent security requirements
met by leveraging VPC, VPN,
Encryption and Rest and In
Transit, CloudTrail and
database auditing
NTT DOCOMO
32. Nasdaq – Legacy Warehouse
Expensive ($1.16M annually)
Limited capacity (1 year of data
online)
4-8 billion rows inserted per
trading day, storing:
• Orders
• Trades
• Quotes
• Market Data
• Security Master
• Membership
DW can be used to analyze market
share, client activity, surveillance,
power our billing, and more…
33. Nasdaq Architecture
On premise AWS Regional (Multi-AZ) Scope AWS (US-East,
primary AZ/VPC)
S3
SNS
Redshift
Database
Cluster
HSM Key
Appliance
Cluster
MySQL
Redshift
Load files/
Manifests
Redshift
Snapshots/
Backups
Data
Loaded
Topic
RMS Input
Sources
(multiple
systems)
Data Ingest
Process
35. Useful Resources
• AWS Big Data Blog
• Re:Invent 2015 Big Data Sessions
• AWS Marketplace for Big Data Solutions
• Amazon Big Data Partners
36. Summary
• AWS enables you to build sophisticated big data applications
• Retrospective, Real-time, Predictive
• You can build incrementally, adding use cases and increasing
scale as you go
• AWS provides a broad range of security and auditing features
to enable you to meet your security requirements
• AWS makes it easy to build hybrid applications that span
across your datacenters and the AWS Cloud