Deploying a Data Lake in AWS - AWS Summit Tel Aviv 2017

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ran Tessler - Manager, Solutions Architecture, AWS
Shahar Bonderman – Head of Architecture, Matomy
June 21, 2017
Deploying a Data Lake in AWS

What to expect from this session
• Data Lake concept
• Important capabilities of a Data Lake
• Matomy’s Data Lake implementation
• Big Data Reference Architecture

What is a Data Lake?
Data Lake is a new and increasingly
popular way to store and analyze
massive volumes and heterogenous
types of data in a centralized repository.

Benefits of a Data Lake – Quick Ingest
“How can I collect data quickly
from various sources and store
it efficiently?”
Quickly ingest data
without needing to force it into a
pre-defined schema.

Benefits of a Data Lake – All Data in One Place
“Why is the data distributed in
many locations? Where is the
single source of truth?”
Store and analyze all of your data,
from all of your sources, in one
centralized location.

Benefits of a Data Lake – Storage vs Compute
Separating your storage and compute
allows you to scale each component as
required
“How can I scale up with the
volume of data being generated?”

Benefits of a Data Lake – Schema on Read
“Is there a way I can apply multiple
analytics and processing frameworks
to the same data?”
A Data Lake enables ad-hoc
analysis by applying schemas
on read, not write.

Important Capabilities of a
“Data Lake”

Important components of a Data Lake
Catalog & Search Protect & SecureAccess & User
Interface
Ingest and Store

Ingest and Store
Ingest streaming and batch data
Support for any type of data at scale
Durable
Low cost

Amazon S3 as your cluster’s persistent data store
Amazon S3
Separate compute and storage
Resize and shut down Analytics
Compute Environments with no data
loss
Point multiple compute clusters at
same data in Amazon S3

AWS Direct Connect AWS Snowball ISV Connectors
Amazon Kinesis
Firehose
S3 Transfer
Acceleration
AWS Storage
Gateway
Data Ingestion into Amazon S3

Use S3 as Data Substrate for Compute
EMR Kinesis
Redshift DynamoDB RDS
Athena
Storm
Amazon
S3
Import/Export
Snowball
Highly Durable
Low Cost
Scalable Storage
Spark

Metadata lake
Used for summary statistics and data
Classification management
Simplified model for data discovery & governance
Catalog & Search

Catalog & Search Architecture
Data Collectors
(EC2, ECS)
S3 Bucket
Metadata Index
Amazon DynamoDB
Put Object
AWS Lambda
Object Created,
Object Deleted Put Item
AWS Lambda
Search Index
Amazon Elasticsearch
Extract Search Fields
Update
Stream
Update
Index

Exposes the data lake to customers
Programmatically query catalogue
Expose search API
Ensures that entitlements are respected
API & User Interface

API & UI Architecture
Metadata Index
Amazon DynamoDB
Search Index
AWS LambdaAPI Gateway
Users
API
User
Management
Static Website
UI

Access Control - Authentication & Authorization
Data protection - Encryption
Logging and Monitoring
Protect and Secure

Encryption ComplianceSecurity
§ Identity & Access Management
§ Bucket policies
§ Access Control Lists (ACLs)
§ Query string authentication
§ Private VPC endpoints to
Amazon S3
§ SSL endpoints
§ Server Side Encryption
(SSE-S3, SSE-C, SSE-
KMS)
§ Client-side Encryption
§ Buckets access logs
§ Lifecycle Management
Policies
§ Versioning & MFA deletes
§ Certifications – HIPAA, PCI,
SOC 1/2/3 etc.
Implement the right controls

A Data Lake on AWS
Catalog & Search Access & User Interface
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena
Kinesis
Analytics
RDS
Central
Storage
S3
Snowball Database
Migration Service
Kinesis Firehose Direct
Connect
Collect & Ingest
Protect & Secure Process & Analyze
Security
Token Service
CloudWatch CloudTrail Key Management
Service

That’s a lot of work!
Where do I even start?

Smarter Technology
Stronger Advertising

400,000
Bid Request
per Second
40Billion
Events per Day
<20ms
Response
Time

Senior Data
Engineer
We Are Hiring
DevOps
Engineer

Big Data Reference Architecture

Streaming
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon Kinesis
Analytics
KCL
apps
AWS Lambda
Service
Apache Kafka
Amazon RDS
Amazon DynamoDB
Amazon ElastiCache
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon SQS
Amazon DynamoDB
Streams
Stream
SearchSQLNoSQLCacheFileMessage
Stream
Amazon EC2
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
LoggingIoTApplicationsTransportMessaging
ETL
Amazon EMR
Amazon SQS apps
Amazon Redshift
Amazon EC2
Amazon Athena
BatchMessageInteractiveAI
Presto
Amazon
EMR
Amazon
AI
Amazon S3
Apps & Services
Analysis&visualizationNotebooksIDEAPI
Amazon QuickSight

Deploying a Data Lake in AWS - AWS Summit Tel Aviv 2017

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Deploying a Data Lake in AWS - AWS Summit Tel Aviv 2017

Similaire à Deploying a Data Lake in AWS - AWS Summit Tel Aviv 2017 (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Deploying a Data Lake in AWS - AWS Summit Tel Aviv 2017