Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA305 - Toronto AWS Summit

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Alex Coqueiro
Public Sector Solutions Architecture Team
Amazon Web Services
BDA305
Build Data Lakes and Analytics on AWS:
Patterns & Best Practices

VisualizationVariability
Big Data Is Defined Many Different Ways
Volume Velocity Variety Veracity Value

Data Is Changing → Analytics Are Adopting
Capture and store
new data at PB-EB scale
Do new type of analytics in
a cost effective way
• Machine learning
• Big data processing
• Real-time analytics
• Full-text search
New types of
analytics

Organizations that successfully generate business
value from their data will outperform their peers. An
Aberdeen survey saw organizations who implemented
a data lake outperforming similar companies by 9% in
organic revenue growth.*
24%
15%
Leaders Followers
Organic revenue growth
*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence
Most Important: Driving Value from Data

Traditionally, Analytics Used to Look Like This
OLTP ERP CRM LOB
Data warehouse
Business intelligence • Relational data
• TBs–PBs scale
• Schema defined prior to data load
• Operational reporting and ad hoc

Data Lakes Extend the Traditional Approach
Data warehouse
Business intelligence
OLTP ERP CRM LOB
• Relational and nonrelational data
• TBs–EBs scale
• Diverse analytical engines
• Low-cost storage & analytics
Devices Web Sensors Social
Data lake
Big data processing,
real-time, machine learning

Data Lakes from AWS
Analytics
• Unmatched durability, and availability at EB scale
• Best security, compliance, and audit capabilities
• Object-level controls for fine-grain access
• Fastest performance by retrieving subsets of data
• The most ways to bring data in
• Analyze with broadest set of analytics & ML services
Machine
learning
Real-time dataOn-premises
Data Lake
on AWS
movementdata movement

Managed ML Service
Deep Learning AMIs
Video and Image Recognition
Conversational Interfaces
Deep-Learning Video Camera
Natural Language Processing
Language Translation
Speech Recognition
Text-to-Speech
Interactive Analysis
Hadoop & Spark
Data Warehousing
Full-text search
Real-time analytics
Dashboards & Visualizations
Dedicated Network connection
Secure appliances
Ruggedized Shipping Container
Database migration
Connect Devices to AWS
Real-time Data Streams
Real-time Video Streams
Data Lake
on AWS
Storage & Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
Data Lakes and Analytics Portfolio from AWS
Broadest, deepest set of analytic services

Data Lakes and Analytics Portfolio from AWS
Broadest, deepest set of analytic services
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Amazon S3 | AWS Glue

What data do I have?

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Gartner:
“Through 2018, 80% of data lakes will not include effective
metadata management capabilities, making them inefficient."
What Data Do I Have?
Data Lake
on AWS
Storage | Archival Storage | Data Catalog

Job AuthoringData Catalog Job Execution
Apache Hive Metastore compatible
Integrated with AWS services
Automatic crawling
Discover
Auto-generates ETL code
Python and Apache Spark
Edit, debug, and share
Develop
Serverless execution
Flexible scheduling
Monitoring and alerting
Deploy
AWS Glue

IAM Role
AWS Glue Crawler Databases
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-in classifiers
MySQL
MariaDB
PostreSQL
Aurora
Oracle
Amazon Redshift
Avro
Parquet
ORC
XML
JSON & JSONPaths
AWS CloudTrail
BSON
Logs
(Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis,
and many others)
Delimited
(comma, pipe, tab, semicolon)
< ALWAYS GROWING…>
What can crawlers discover?
Create additional custom
classifiers
Amazon
DynamoDB
NoSQL Connection

Data Lake on Amazon S3 with AWS Glue
On-premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
AMAZON QUICKSIGHT AMAZON
SAGEMAKER

Other Ways of Populating the Catalog
Call the AWS Glue CreateTable API
Create table manually Run Hive DDL statement
Apache Hive
Metastore
AWS GLUE ETL AWS GLUE
DATA CATALOG
Import from Apache Hive Metastore

But I have my own data formats …?
− There is a custom classifier for that …
Row-Based
GROK Classifier
A grok pattern is a
named set of regular
expressions (regex)
that are used to match
data one line at a time.
XML
XML Classifier
XML tag that defines a
table row in the XML
document.
JSON
JSON Classifier
JSON path to the
object, array, or value
that defines a row of
the table being
created. Type the
name in either dot or
bracket JSON syntax
using AWS Glue
supported operators

How do I hydrate my Data Lake?

How do I drive value?
Amazon SageMaker
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS IoT Core
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
Real-time data movementTraditional data movement

Ingest data based on the type of data
Open and comprehensive
• Data movement from on-premises datacenters
• Dedicated network connection
• Secure appliances
• Ruggedized shipping container
• Database migration
• Gateway that lets applications write to the cloud
• Data movement from real-time sources
• Connect devices to AWS
• Real-time data streams
• Real-time video streams
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Storage Gateway
AWS IoT Core
Data movement from
real-time sources
Data movement from your
datacenters
Amazon S3
Amazon Glacier
AWS Glue

Amazon
Kinesis Data
Firehose
Real-time data movement and Data Lakes on AWS
AWS Glue
Data Catalog
Amazon
S3 Data
Data Lake
on AWS
Amazon
Kinesis Data
Streams
Data definitionKinesis Agent
Apache Kafka
AWS SDK
LOG4J
Flume
Fluentd
AWS Mobile SDK
Kinesis Producer Library

Amazon S3
Amazon Glacier
AWS Glue
IMPORTANT: Ingest data in its raw form …
Open and comprehensive
• Store the data in its raw form:
• BEFORE
• Transforming
• Analyzing
• Manipulating
• Doing … anything … to it
CSV
ORC
Grok
Avro
Parquet
JSON
• This becomes your source of record you can
always go back to …
• Lifecycle policies allow you to shift it to warm and
cold storage.

Datasets in the Lake
Raw datasets – immutable datasets that you can always go back
to.
• Abstract out the complexities of how the data is stored
through the catalog and SerDes
Optimizing Analytics and Machine Learning:
Curated datasets – query-optimized for consumption across wide
number of tools

Raw data stored in Data Lake:
Preparation:
No rmalized
Partitio ned
Co mpressed
S to rage Optimized
Extract – Load – Transform
Preparing raw data for consumption
Data Lake
on AWS
Raw
Ingestion
Curated
DataSets
Data Catalog
ELT

Which tool should I use to analyze my
data?

Different tools for different users … solving different problems
Business
Reporting
Data Scientists
Data Engineer
IDE
Data
Catalog
Data Lake
Central Storage
SagemakerMachine Learning/Deep Learning

How Do I Drive Value?
Amazon SageMaker
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS IoT Core
Data Lake
on AWS

Amazon Athena – interactive analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
$ SQL
Query instantly
Zero setup cost; just
point to Amazon S3
and start querying
Pay per query
Pay only for queries run;
save 30%–90% on per-
query costs through
compression
Open
ANSI SQL interface,
JDBC/ODBC drivers, multiple
formats, compression types,
and complex joins and data
types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with Amazon
QuickSight

Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning

Exploring Data with Amazon Athena
Dados on-premise
Web app data
Amazon RDS
Outros Banco de
Dados
Streaming data
SAGEMAKER

Amazon EMR – big data processing
Analytics and ML at scale
19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more
Enterprise-grade security
$
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Low cost
Flexible billing with per-
second billing, Amazon
EC2 Spot, Reserved
Instances, and Auto
Scaling to reduce costs
50%-80%
Use Amazon S3 storage
Process data directly in
the Amazon S3 data lake
securely with high
performance using the
EMRFS connector
Easy
Launch fully managed
Hadoop & Spark in minutes;
no cluster setup, node
provisioning, cluster tuning
Data Lake
100110000100101011100
1010101110010101000
00111100101100101
010001100001

EMR – Enterprise - Hadoop & Spark
Deploy latest releases in Hadoop and Spark ecosystemsHadoop
Ganglia
HBase
Hive&
Catalog
Hue
Mahout
Oozie
Phoenix
Pig
Presto
Spark
Tez
Zeppelin
Zookeeper
Flink
Livy
MXNet
Sqoop
Emr-4.0.0
July2015
2.6.0 1.0.0 0.10.0 0.14.0 1.4.1
Emr-4.7.0
June2016
2.7.2 3.7.2 1.2.1 1.0.0 3.7.1 0.12.0 4.2.0 4.7.0 0.14.0 .147 1.6.1 1.4.6 0.8.3 0.5.6 3.4.8
Emr-5.3.0
January2017
2.7.3 3.7.2
1.2.3
+
S3
2.1.1 3.11.0 0.12.2 4.3.0 4.7.0 0.16.0 0.157.1 2.1.0 1.4.6 0.8.4 0.6.2 3.4.9 1.1.4
Emr-5.14.0
June2018
2.8.3 3.7.2
1.4.2
+
S3
2.3.2 4.1.0 0.13.0 4.3.0 4.13.0 0.17.0 0.194 2.3.0 1.4.7 0.8.4 0.7.3 3.4.10 1.4.2 0.4.0 1.1.0
EMR releases
• Nineteen open-source
projects: Apache Hadoop,
Spark, HBase, Presto, and
more
• Updated with the latest
open source frameworks
within 30 days of release

Hadoop/Spark Analytics on AWS
YARN (Hadoop Resource Manager)
NoSQLMachine
learning
Real-timeInteractiveScriptBatch
Data Lake
on AWS
Amazon S3
Amazon EMR
Managed Hadoop/Spark
Object Storage

Amazon S3 – Source of Truth, Multiple Clusters
Amazon S3
Interactive Spark Cluster
Amazon EMR
Amazon EMR
HDFS
HDFS
EC2 Instance Memory
Intermediates stored
on local disk or HDFSLocal
HDFS
EC2 Instance Memory
Intermediates stored
on local disk or HDFSLocal
Transient ETL Job
Source of Truth
HDFS
HDFS
HDFS
Local Intermediate HDFS/Storage
Local Intermediate HDFS/Storage

Fitting this into the Common Data Catalog
Amazon S3
Interactive Spark cluster
Amazon EMR
Amazon EMR
EMRFS
HDFS
Transient ETL job
Source of Truth
EMRFS
HDFS
Describes the data
MySQL DB
instance
Unifieddataview
AWS Glue
Data Catalog
Stores the data
…

Data processing with Amazon EMR (Spark)
Dados on-premise
Web app data
Amazon RDS
Outros Banco de
Dados
Streaming data
SAGEMAKER

What if I implement machine learning to
identify complex business insights?

Machine Learning on Your Data Lake
Amazon SageMaker
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS IoT Core
Data Lake
on AWS

Vision
AWS Machine Learning
Frameworks &
Infrastructure
Services GPU MobileCPU IoT (Greengrass)
Platform
Services
Application
Services
Amazon SageMaker
Rekognition
Image
Rekognition
Video
Speech
Polly Transcribe
Language
Translate ComprehendLex
TensorFlow GluonApache MXNet Cognitive Toolkit Caffe2 & Caffe PyTorch Keras

Amazon SageMaker
1 2 3 4
I I I I
Notebook Instances Algorithms ML Training Service ML Hosting Service

Machine Learning with Amazon Sagemaker
Dados on-premise
Web app data
Amazon RDS
Outros Banco de
Dados
Streaming data
SAGEMAKER

Agility and Innovation Are Key
Amazon SageMaker
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS IoT Core
Data Lake
on AWS

BDA305
Thank You !!!
Alex Coqueiro
Public Sector Solutions Architecture Team
Amazon Web Services

Please complete the session survey in the
summit mobile app.

Submit Session Feedback
1. Tap the Schedule icon. 2. Select the session you
attended.
3. Tap Session Evaluation to
submit your feedback.

Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA305 - Toronto AWS Summit

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA305 - Toronto AWS Summit

Similaire à Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA305 - Toronto AWS Summit (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA305 - Toronto AWS Summit