Preparing Your Data for Cloud Analytics & AI/ML

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2019, Amazon Web Services, Inc. or its Affiliates.
WWPS EMEA Tech Business Development
Abir Roychoudhury, TechBD Database and Analytics
Data Lifecycle
Preparing Your Data for Cloud Analytics & AI/ML

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Public Sector Situation
• Data Lifecycle Walkthrough
• Demonstration around Redshift Analytics + Machine Learning
• Customer References
• Architectural Principles
• Q&A

What do we observe in Public Sector?
• Data is dispersed and difficult to access
• Limited views on what is going in the business
• Resource constraints limit business value activities
• Governance and compliance

What is the Big Data Challenge?
Challenge Characteristic Use Case Solution Requirement to Address
Challenge
Volume Ranges from Tb to Pb Large data set required
for accurate data model
• Offline processing of large data set
• Transportation
• Extraction (key/value pairs)
Variety Different sources and
formats
Bring siloed data sources
together different formats
• Consolidate disparate sources
(structured, unstructured, semi, rest and
motion)
Velocity stringent requirements
from the time data is
generated, to the time
actionable insights
Stream data created at
high speed, only relevant
for short period.
• Capturing stream data
• Cataloguing the data, safe for offline
• Real-time analytics, ad-hoc queries
https://aws.amazon.com/big-data/what-is-big-data/

What do we observe in Public Sector?
According to Forbes:
82% of enterprises are prioritizing analytics and BI as part of their
budgets for new technologies and cloud-based services.
Data warehouse or mart in the cloud (41%), data lake in the cloud
(39%) and BI platform in the cloud (38%) are the top three types
of technologies enterprises are planning to use..
42% are seeking to improve user experiences by automating
discovery of data insights and 26% are using AI to provide user
recommendations.

Data Lifecycle

Data Ingest
Mechanism for data
movement from
external sources into
your data system
Questions to ask:
a) What are my data sources?
b) What is the format of the data?
c) Is the data source immutable?
d) Is it real-time or batch?
e) Where is the destination?

Data Ingestion:
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Amazon Managed Streaming for Kafka
Real-time Data SourcesTraditional Data Sources
Media and Log Files
ERP Systems
Databases (SQL/NoSQL)
Data Warehouses (EDW)
IoT Sensors
Clickstream
Telemetry
Business Activities
Data Lake
Database
Data Warehouse

Amazon
Kinesis Data
Firehose
Real-time data movement and Data Lakes on AWS
AWS Glue
Data Catalog
Amazon
S3 Data
Data Lake
on AWS
Amazon
Kinesis Data
Streams
Data definitionKinesis Agent
Apache Kafka
AWS SDK
LOG4J
Flume
Fluentd
AWS Mobile SDK
Kinesis Producer Library
1)
2)
3)
4a)
4b)

VS
Single / monolithic Purpose-built / micro-services

Benefits of purpose-built architectures
Better
performance Better scale
More
functionality
Easier to
debug
Independence
between teams

What is the data
structure?
Access Patterns What to use?
Put/Get (key, value) In-memory, NoSQL
Simple relationships → 1:N, M:N NoSQL
Multi-table joins, transaction, SQL SQL
Faceting, Search Search
Graph traversal GraphDB
Data Structure What to use?
Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
Key/Value In-memory, NoSQL
Graph GraphDB
Time Interval Time Series
Ledger Ledger
How will the data be
accessed?

Amazon
QLDB
Amazon
DynamoDB
Amazon
RDS / Aurora
Amazon
Timestream
Amazon
Elasticsearch
Amazon
Neptune
Amazon S3 +
Glacier
Use Cases Immutable
Ledger
Key Value with
GSI/LSI
Indexes
OLTP,
Transactional
stores and
processes this
data by time
intervals
Log Analysis,
Reverse
Indexing
Graph Data Lake /
File and
Object store
Performance Very High
Performance
Ultra High
request rate,
Ultra low to
low latency
Very high
request rate,
low latency
High request
rate, low
latency
Medium
request rate,
low latency
Medium
request rate,
low latency
High
Throughput
Shape Ledger K/V and
Document
Relational Time Series Documents Node/Edges Files
Size TB, PB (no
limits)
GB, Mid TB GB, Low TB GB, TB GB, Mid TB GB, TB, PB,
EB (no limits)
Cost / GB $ ¢¢ - $$ $ $ $$ $ ¢- ¢4/10
VPC Support Inside VPC VPC Endpoint Inside VPC Outside or
Inside VPC
Inside VPC VPC Endpoint
Database Characteristics

Data Staging
Validate, Verify,
Catalog the incoming
Raw Data
Perform common
housekeeping tasks
Questions to ask:
Which validation checks?
How will the raw dataset catalog be populated?
Automated Tagging of data?

Data Cleansing
Transform and
Process data for
downstream
analytics
Questions to ask:
Which users and analytics will consume data?
Is there a common data model?
Optimize for reads/queries or writes?
How will data cleanup over time be performed?
(compaction, etc..)

ELT/ETL
Preparing Raw, Staging, and Cleansed Data Lakes
Raw
Ingestion
Staged
Datasets
Optimized
ML Datasets
Optimized
ML Datasets
Data Lake
on AWS
ELT/ETL
Cleansed “views” of the data

Demonstration Setting
• Use data from AWS Open Data:
https://aws.amazon.com/opendata/
• Cornell University has created a public data
lake of climate data in ORC* format
• Get Data into S3, AWS Glue Catalogue
• Look at the structure
• Move to Redshift Data Warehouse analyse
temperature development by min/max and
location
• Analyse, basic prediction in advanced
analytics using ML in Sagemaker (using
DEEPAR Forecast)
• *Redshift supports ORC and Parquet

Demonstration SettingCornell Open Data
provides climate data
Data is copied to local S3
or can be queried directly
from Cornell Data Lake
Glue is cataloguing data
Early insight into data
structure
Redshift loads data for
queries on temperature by
period and location
Data enriched by ML
model (DEEPAR) for
forecast
User can query report with
QuickSight visualisation

Demonstration

Data Analytics & Visualization
Deliver decisions makers the
insights to transform an
organization by identifying
unmet needs within the
customers or by optimizing
operational processes
Questions to ask:
What business question is being answered?
Does the data support answering them?
Who are the users driving the insights?
What skills do those users have?

Customer References

Petabytes of data generated
on-premises, brought to AWS,
and stored in S3
Thousands of analytical
queries performed on EMR
and Amazon Redshift.
Stringent security requirements
met by leveraging VPC, VPN,
encryption at-rest and in-
transit, CloudTrail, and
database auditing
Flexible
Interactive
Queries
Predefined
Queries
Surveillance
Analytics
Web Applications
Analysts; Regulators
FINRA: Migrating to AWS

Hearst’s Serverless Data Pipeline
cosmopolitan.com
caranddriver.com
sfchronicle.com
elle.com
Ingestion proxy
(Node.js)
Serverless data
pipeline
Offline
analysis and
archive
Real-time
analysis

1. Process high variety or volume structured or unstructured datasets
• Big Data Processing
2. Power Business Users to drive Insights
• Data Warehousing
3. Interactively query and explore datasets
• Ad Hoc Querying
4. Analyze what’s happening now
• Streaming Analytics
5. Drive operational and security understanding.
• Log Analysis
Common Types of Data Analytics

Which Analytics Should I Use? PROCESS / ANALYZE
Batch
Takes minutes to hours
Example: Daily/weekly/monthly reports
Amazon EMR (MapReduce, Hive, Pig, Spark)
Interactive
Takes seconds
Example: Self-service dashboards
Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark)
Stream
Takes milliseconds to seconds
Example: Fraud alerts, 1 minute metrics
Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL,
AWS Lambda, etc.
Predictive
Takes milliseconds (real-time) to hours (batch)
Example: Fraud detection, Forecasting demand, Speech
recognition
Amazon SageMaker, Polly, Rekognition, Transcribe, Translate,
Amazon EMR (Spark ML), Deep Learning AMI (MXNet, TensorFlow,
Theano, Torch, CNTK and Caffe)
FastSlow
Amazon Redshift
& Spectrum
Amazon Athena
BatchInteractive
Amazon ES
Presto
Amazon
EMR
Predictive
AmazonML
KCL
Apps
AWS Lambda
Amazon Kinesis
Analytics
Stream
Streaming
Fast

Which Analytics Tool Should I Use?
Amazon Redshift Amazon Redshift
Spectrum
Amazon Athena Amazon EMR
Presto Spark Hive
Use case Optimized for data
warehousing
Query S3 data from
Redshift
Interactive Queries
over S3 data
Interactive
Query
General purpose Batch
Scale/Throughput ~Nodes ~Nodes Automatic ~ Nodes
Managed Service Yes Yes Yes, Serverless Yes
Storage Local storage Amazon S3 Amazon S3 Amazon S3, HDFS
Optimization Columnar storage,
data compression,
and zone maps
AVRO, PARQUET
TEXT, SEQ
RCFILE, ORC, etc.
AVRO, PARQUET
TEXT, SEQ
RCFILE, ORC, etc.
Framework dependent
Metadata Redshift Catalog Glue Catalog Glue Catalog Glue Catalog or
Hive Meta-store
Auth/Access controls IAM, Users, groups,
and access controls
IAM, Users, groups,
and access controls
IAM IAM, LDAP & Kerberos
UDF support Yes (Scalar) Yes (Scalar) No Yes

Which Stream Processing Technology Should I Use?
Amazon EMR
(Spark
Streaming)
KCL Application Amazon Kinesis
Analytics
AWS Lambda
Managed Service Yes No (EC2 + Auto
Scaling)
Yes Yes
Serverless No No Yes Yes
Scale / Throughput No limits /
~ nodes
No limits /
~ nodes
No Limits /
automatic
No limits /
automatic
Availability Single AZ Multi-AZ Multi-AZ Multi-AZ
Programming
Languages
Java, Python,
Scala
Java, others via
MultiLangDaemon
ANSI SQL or
Java/Flink
Node.js, Java, Python, .Net Core
Sliding Window
Functions
Build-in App needs to
implement
Built-in No
Reliability KCL and Spark
checkpoints
Managed by KCL Managed by
Amazon Kinesis
Analytics
Managed by AWS Lambda

Enforce security policies
across multiple services
Gain and manage new
insights
Identify, ingest, clean,
and transform data
Build a secure data lake in days
AWS Lake
Formation

Data Archiving
Makes the archival process easy
to manage, and allows you to
focus on the storage of your
data, rather than the
management of your tape
systems and library.

Securing, Protecting and Managing Data
• Access policy options and AWS IAM (resource and user base policies)
• Data Encryption with Amazon S3 and AWS KMS
• S3 protects against corruption, loss and accidental overwrites,
modifications or deletions
• Managing Data with Object Tagging
• S3 includes certs PCI-DSS, SOC123, HIPAA/HITECH, FedRAMP, SEC Rule
17, FISMA, EU Data Protection Directive
https://docs.aws.amazon.com/en_pv/whitepapers/latest/building-data-lakes/securing-protecting-managing-data

Architectural Principles
1. Build decoupled systems
• Data → Store → Process → Store → Analyze → Answers
2. Use the right tool for the job
• Data structure, latency, throughput, access patterns
3. Leverage managed and serverless services
• Scalable/elastic, available, reliable, secure, no/low admin
4. Use event-journal design patterns
• Immutable datasets (data lake), materialized views
5. Be cost-conscious
• Big data ≠ big cost
6. Machine Learning (ML) enable your applications

Thank you
& Questions

Preparing Your Data for Cloud Analytics & AI/ML

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Preparing Your Data for Cloud Analytics & AI/ML

Similaire à Preparing Your Data for Cloud Analytics & AI/ML (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Preparing Your Data for Cloud Analytics & AI/ML