Preparing Your Data for Cloud Analytics & AI/ML

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ben Snively, Solutions Architect – Data and Analytics, AI/ML
Wednesday, May 22, 2019
Data Lifecycle – Best Practices

Data Lifecycle

Data Ingest
Mechanism for data
movement from
external sources into
your data system
Questions to ask:
What are my data sources?
What is the format of the data?
Is the data source immutable?
Is it real-time or batch?
Where is the destination?

Data Ingestion:
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Amazon Managed Streaming for Kafka
Real-time Data SourcesTraditional Data Sources
Media and Log Files
ERP Systems
Databases (SQL/NoSQL)
Data Warehouses (EDW)
IoT Sensors
Clickstream
Telemetry
Business Activities
Data Lake
Database
Data Warehouse

Amazon
Kinesis Data
Firehose
Real-time data movement and Data Lakes on AWS
AWS Glue
Data Catalog
Amazon
S3 Data
Data Lake
on AWS
Amazon
Kinesis Data
Streams
Data definitionKinesis Agent
Apache Kafka
AWS SDK
LOG4J
Flume
Fluentd
AWS Mobile SDK
Kinesis Producer Library

Databases Anti-Pattern
Single Database Tier

Best Practice: Use the Right Tool for the Job
Data Tier
Relational
Referential integrity
with strong
consistency,
transactions, and
hardened scale
Key-value
Low-latency, key-
based queries with
high throughput and
fast data ingestion
Document
Indexing and storing
of documents with
support for query on
any property
In-memory
Microsecond latency,
key-based queries,
specialized data
structures
Graph
Creating and
navigating relations
between data easily
and quickly
Complex query
support via SQL
Simple query
methods with filters
Simple query with
filters, projections
and aggregates
Simple query
methods with filters
Easily express queries
in terms of relations

What is the data
structure?
Access Patterns What to use?
Put/Get (key, value) In-memory, NoSQL
Simple relationships → 1:N, M:N NoSQL
Multi-table joins, transaction, SQL SQL
Faceting, Search Search
Graph traversal GraphDB
Data Structure What to use?
Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
Key/Value In-memory, NoSQL
Graph GraphDB
Time Interval Time Series
Ledger Ledger
Howwill thedatabe
accessed?

Amazon
QLDB
Amazon
DynamoDB
Amazon
RDS / Aurora
Amazon
Timestream
Amazon
Elasticsearch
Amazon
Neptune
Amazon S3 +
Glacier
Use Cases Immutable
Ledger
Key Value with
GSI/LSI
Indexes
OLTP,
Transactional
stores and
processes this
data by time
intervals
Log Analysis,
Reverse
Indexing
Graph Data Lake /
File and
Object store
Performance Very High
Performance
Ultra High
request rate,
Ultra low to
low latency
Very high
request rate,
low latency
High request
rate, low
latency
Medium
request rate,
low latency
Medium
request rate,
low latency
High
Throughput
Shape Ledger K/V and
Document
Relational Time Series Documents Node/Edges Files
Size TB, PB (no
limits)
GB, Mid TB GB, Low TB GB, TB GB, Mid TB GB, TB, PB,
EB (no limits)
Cost / GB $ ¢¢ - $$ $ $ $$ $ ¢- ¢4/10
VPC Support Inside VPC VPC Endpoint Inside VPC Outside or
Inside VPC
Inside VPC VPC Endpoint
Database Characteristics
Warm data Cold data

Data Staging
Validate, Verify,
Catalog the incoming
Raw Data
Perform common
housekeeping tasks
Questions to ask:
Which validation checks?
How will the raw dataset catalog be populated?
Automated Tagging of data?

Data Cleansing
Transform and
Process data for
downstream
analytics
Questions to ask:
Which users and analytics will consume data?
Is there a common data model?
Optimize for reads/queries or writes?
How will data cleanup over time be performed?
(compaction, etc..)

ELT/ETL
Preparing Raw, Staging, and Cleansed Data Lakes
Raw
Ingestion
Staged
Datasets
Optimized
ML Datasets
Optimized
ML Datasets
Data Lake
on AWS
ELT/ETL
Cleansed “views” of the data

Demonstration

Data Analytics & Visualization
Deliver decisions makers the
insights to transform an
organization by identifying
unmet needs within the
customers or by optimizing
operational processes
Questions to ask:
What business question is being answered?
Does the data support answering them?
Who are the users driving the insights?
What skills do those users have?

Business
Reporting
Data Scientists
Data Engineer
IDE
Data
Catalog
Central
Storage
SagemakerMachine Learning/Deep Learning
Start w/ the Business Problem and Users:

1. Process high variety or volume structured or unstructured datasets
• Big Data Processing
2. Power Business Users to drive Insights
• Data Warehousing
3. Interactively query and explore datasets
• Ad Hoc Querying
4. Analyze what’s happening now
• Streaming Analytics
5. Drive operational and security understanding.
• Log Analysis
Common Types of Data Analytics

Ad Hoc
&
Big Data
Analytics
process
store
Big Data
Analytics
Ad Hoc
Amazon EMR
Hive
Pig
Spark
Machine
Learning
Batch prediction
Real-time prediction
Amazon S3
Files
Amazon
Kinesis
Firehose
Amazon Kinesis
Analytics
Amazon Redshift
Amazon ES
Consume
Amazon EMR
Presto
Spark
Amazon Athena

Petabytes of data generated
on-premises, brought to AWS,
and stored in S3
Thousands of analytical
queries performed on EMR
and Amazon Redshift.
Stringent security requirements
met by leveraging VPC, VPN,
encryption at-rest and in-
transit, CloudTrail, and
database auditing
Flexible
Interactive
Queries
Predefined
Queries
Surveillance
Analytics
Web Applications
Analysts; Regulators
FINRA: Migrating to AWS

Interactive
&
Batch
Amazon S3
Amazon Redshift
Amazon EMR
Presto
Hive
Pig
Spark
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
AWS Lambda
Spark Streaming on
Amazon EMR
Applications
Amazon Kinesis
App state
or
Materialized
View
KCL
Machine
Learning
Real-time
Amazon
DynamoDB
Amazon
RDS
Change Data Capture or
Export
Transactions
Stream
Files
Data Lake
Amazon Kinesis
Analytics
Amazon Athena
Amazon Kinesis
Firehose
Amazon ES

Streaming Analytics
Amazon EMR
KCL app
AWS Lambda
Spark
Streaming
Machine Learning
Real-time prediction
Amazon
ElastiCache
(Redis)
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
App state or
Materialized
View
KPI
process
store
Amazon
Kinesis
Amazon Kinesis
Analytics
Amazon
SNS NotificationsAlerts
Amazon
S3
Log
Amazon
KinesisFan out Downstream

Hearst’s Serverless Data Pipeline
cosmopolitan.com
caranddriver.com
sfchronicle.com
elle.com
Ingestion proxy
(Node.js)
Serverless data
pipeline
Offline
analysis and
archive
Real-time
analysis

Log data analytics

Log Analytics Data Lake
Realtime
Application
and Users
activities
On premises
activities
Log Analytics
Data Lake
AWS Security
Services
Analytics
Machine/Deep
Learning

Application monitoring & root-cause
analysis
Security Information and Event
Management (SIEM)
IoT & mobile Business & clickstream analytics
Amazon Elasticsearch: Analyzing Log Data

Which Analytics Should I Use? PROCESS / ANALYZE
Batch
Takes minutes to hours
Example: Daily/weekly/monthly reports
Amazon EMR (MapReduce, Hive, Pig, Spark)
Interactive
Takes seconds
Example: Self-service dashboards
Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark)
Stream
Takes milliseconds to seconds
Example: Fraud alerts, 1 minute metrics
Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL,
AWS Lambda, etc.
Predictive
Takes milliseconds (real-time) to hours (batch)
Example: Fraud detection, Forecasting demand, Speech
recognition
Amazon SageMaker, Polly, Rekognition, Transcribe, Translate,
Amazon EMR (Spark ML), Deep Learning AMI (MXNet, TensorFlow,
Theano, Torch, CNTK and Caffe)
FastSlow
Amazon Redshift
& Spectrum
Amazon Athena
BatchInteractive
Amazon ES
Presto
Amazon
EMR
Predictive
AmazonML
KCL
Apps
AWS Lambda
Amazon Kinesis
Analytics
Stream
Streaming
Fast

Which Analytics Tool Should I Use?
Amazon Redshift Amazon Redshift
Spectrum
Amazon Athena Amazon EMR
Presto Spark Hive
Use case Optimized for data
warehousing
Query S3 data from
Redshift
Interactive Queries
over S3 data
Interactive
Query
General purpose Batch
Scale/Throughput ~Nodes ~Nodes Automatic ~ Nodes
Managed Service Yes Yes Yes, Serverless Yes
Storage Local storage Amazon S3 Amazon S3 Amazon S3, HDFS
Optimization Columnar storage,
data compression,
and zone maps
AVRO, PARQUET
TEXT, SEQ
RCFILE, ORC, etc.
AVRO, PARQUET
TEXT, SEQ
RCFILE, ORC, etc.
Framework dependent
Metadata Redshift Catalog Glue Catalog Glue Catalog Glue Catalog or
Hive Meta-store
Auth/Access controls IAM, Users, groups,
and access controls
IAM, Users, groups,
and access controls
IAM IAM, LDAP & Kerberos
UDF support Yes (Scalar) Yes (Scalar) No Yes

Which Stream Processing Technology Should I Use?
Amazon EMR
(Spark
Streaming)
KCL Application Amazon Kinesis
Analytics
AWS Lambda
Managed Service Yes No (EC2 + Auto
Scaling)
Yes Yes
Serverless No No Yes Yes
Scale / Throughput No limits /
~ nodes
No limits /
~ nodes
No Limits /
automatic
No limits /
automatic
Availability Single AZ Multi-AZ Multi-AZ Multi-AZ
Programming
Languages
Java, Python,
Scala
Java, others via
MultiLangDaemon
ANSI SQL or
Java/Flink
Node.js, Java, Python, .Net Core
Sliding Window
Functions
Build-in App needs to
implement
Built-in No
Reliability KCL and Spark
checkpoints
Managed by KCL Managed by
Amazon Kinesis
Analytics
Managed by AWS Lambda

Data Archiving
Makes the archival process easy
to manage, and allows you to
focus on the storage of your
data, rather than the
management of your tape
systems and library.

Architectural Principles
• Build decoupled systems
• Data → Store → Process → Store → Analyze → Answers
• Use the right tool for the job
• Data structure, latency, throughput, access patterns
• Leverage managed and serverless services
• Scalable/elastic, available, reliable, secure, no/low admin
• Use event-journal design patterns
• Immutable datasets (data lake), materialized views
• Be cost-conscious
• Big data ≠ big cost
• Machine Learning (ML) enable your applications

Questions

Preparing Your Data for Cloud Analytics & AI/ML

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Preparing Your Data for Cloud Analytics & AI/ML

Similaire à Preparing Your Data for Cloud Analytics & AI/ML (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Preparing Your Data for Cloud Analytics & AI/ML

Notes de l'éditeur