ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and Amazon Athena

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Architecting a data lake with
Amazon S3, Amazon Kinesis, AWS
Glue and Amazon Athena
R o h a n D h u p e l i a , A n a l y t i c s P l a t f o r m M a n a g e r , A t l a s s i a n
A b h i s h e k S i n h a , S e n i o r P r o d u c t M a n a g e r , A m a z o n A t h e n a
A B D 3 1 8

Agenda
• Characteristics of a data lake
• The basic components of building a data lake and services corresponding to
each
• Example 1: Building a data lake to unify real-time and batch data processing
needs
• Example 2: The Atlassian self-service data lake

E x p o n e n t i a l g r o w t h i n d a t a
Reasons for building a data lake
Transactions
ERP
Sensor Data
Billing
Web logs
Social
Infrastructure logs

Transactions
ERP
Sensor Data
Billing
Web logs
Social
Infrastructure logs
D i v e r s i f i e d c o n s u m e r s
Data Scientists
Business Analyst External Consumers
Applications

Transactions
ERP
Sensor Data
Billing
Web logs
Social
Infrastructure logs
D i v e r s i f i e d c o n s u m e r s
Data Scientists
Business Analyst External Consumers
Applications
M u l t i p l e a c c e s s m e c h a n i s m s
API Access
BI Tools
Notebooks

Characteristics of a data lake
Future ProofFlexible
Access
Dive in
Anywhere
Collect
Anything

Amazon S3 as the data lake

Simplified architectural view
Amazon S3
Ingestion
mechanism
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Process Consume

There are lots of ingestion tools
Amazon S3
Process Consume
S3 Transfer
Acceleration
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices

Variety of data processing tools
Amazon S3
Consume
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices

And multiple ways to consume the data
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Petabyte-scale Data
Warehousing
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
Amazon API Gateway
Programmatic Access

Because data is not prefect
AWS Lambda
Trigger-based Code
Execution
AWS Glue
Event based Server-less ETL
engine
Amazon EMR
Spark and Hive running on
EMR
Because data is not never prefect
Clean
Transform
Concatenate
Convert to better formats
Schedule transformations
Event-driven transformations
Transformations expressed as
code

ETL when you need it
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Petabyte-scale Data
Warehousing
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Analytic Notebooks
API Gateway
Programmatic Access

Metadata?
AWS Glue Data Catalog
Central Metadata Catalog for the data lake
One per account
Allows you to share metadata between
Amazon Athena, Amazon Redshift
Spectrum, EMR & JDBC sources
We added a few extensions:
 Search over metadata for data
discovery
 Connection info – JDBC URLs,
credentials
 Classification for identifying and parsing
files
 Versioning of table metadata as
schemas evolve and other metadata are
updated

Data Catalog Crawlers
AWS Glue Data Catalog - Crawlers
Helping Catalog your data
Crawlers automatically build your Data
Catalog and keep it in sync
Automatically discover new data, extracts
schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom
classifiers using Grok expression
Run ad hoc or on a schedule; serverless – only
pay when crawler runs

Table schema
Table properties
Data statistics
Nested fields
Data Catalog – Table Details

List of table versionsCompare schema versions
Data Catalog: Version Control

Automatically register available partitions
Table
partitions
Automatic Partition Detection

A central metadata store for your lake
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Petabyte-scale Data
Warehousing
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Analytic Notebooks
API Gateway
Programmatic Access
Hive-compatible Metastore

Real-time (instream processing)
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Petabyte-scale Data
Warehousing
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Analytic Notebooks
API Gateway
Programmatic Access
Spark Streaming
& Flink on EMR
AmazonKinesis
Analytics

W r i t e o n c e , c a t a l o g o n c e , r e a d m u l t i p l e , E T L A n y w h e r e
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Petabyte-scale Data
Warehousing
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Analytic Notebooks
API Gateway
Programmatic Access
Spark Streaming
& Flink on EMR
AmazonKinesis
Analytics

Let’s take an example
1. What is going on with a specific sensor
2. Daily Aggregations (device,
inefficiencies, average temperature)
3. A real-time view of how many sensors
are showing inefficiencies
1. Scale
2. Highly availability
3. Less management overhead
4. Pay what I need
Business Questions
Operations
Record-level dataSensor/IOT device

Let’s push this data into a Kinesis
Kinesis Firehose
Amazon S3
Amazon S3 Amazon Athena

Querying it in Amazon Athena
Either Create a Crawler to
auto-generate schema
OR
Write a DDL on the Athena
console/API/ JDBC/ODBC
driver
Start Querying Data

Query daily aggregates in Amazon Athena
Kinesis Firehose
Amazon S3
“raw-time-series”
Amazon S3
“daily-average”

AWS Glue Job
Serverless, event-driven execution
Data is written out to S3
Output table is automatically
created in Amazon Athena

Kinesis Analytics for in-stream analytics
Kinesis Firehose Amazon S3
Amazon S3
“daily-average”
Amazon S3
Kinesis Analytics Kinesis Firehose
“results”

KPI - Overall device daily inefficiency"
SELECT ( SUM(daily_avg_inefficiency)/COUNT(*) )
AS all_device_avg_inefficiency, date
FROM awsblogsgluedemo.daily_avg_inefficiency
GROUP BY date;
Top 10 most inefficient devices - event-level granularity
SELECT col0 AS "uuid", col1 AS "deviceid", col2 AS "devicets",
col3 AS "temp", col4 AS "settemp", col5 AS "pct_inefficiency"
FROM awsblogsgluedemo.results ORDER BY pct_inefficiency DESC
limit 10;
“raw” table with raw data
Top 20 most active devices
SELECT
deviceid, COUNT(*) AS num_events
FROM awsblogsgluedemo."raw"
GROUP BY deviceid
ORDER BY num_events DESC
Events by Device ID
SELECT uuid, devicets,deviceid,
temp
FROM awsblogsgluedemo."raw" WHERE
deviceid = 1
ORDER BY devicets DESC;
“daily-agg” table with daily
aggregation
“result” table

Overall architecture
Kinesis Firehose Amazon S3
Amazon S3
“daily-average”
Amazon S3
Kinesis Analytics Kinesis Firehose
“results”

Characteristics
 Scale to hundreds of thousands of data sources
 Virtually infinite storage scalability
 Real-time and batch processing layers
 Interactive queries
 Highly available and durable
 Pay only for what you use
X No servers to manage

Very easy to try – existing template

ROHAN DHUPELIA | ANALYTICS PLATFORM MANAGER | @ROHANDHUPELIA
Building the Atlassian Data
Lake

Meetings DecisionsMentions FilesConvosReactions
Planning & tracking Messaging & communicate Organizing projects Content collaboration Code collaboration
Software Teams IT TeamsMarketing Teams Finance TeamsHR Teams
ATLASSIAN OVERVIEW

Socrates
The Atlassian Data Lake
Image courtesy of © Bar Harel, CC BY-SA 4.0, Wikimedia Commons

The numbers
500+ TBs 1B+ Events 100
Integrations
1000 Internal
UsersStored in the data
lake
Ingested into the data
lake daily
Providing analytical
events
Using the data lake
daily

Ingest
Moving away from pull-based ingestion

Challenges with pull-based ingestion
Complex DisruptiveBrittle
Various technologies to
maintain
Analytics extracts strain
sourcing systems
As sources change the
pipelines break and need
updating

Our Ingestion
Journey
Late 2015
Socrates
(Data Lake)
Web
CRM
Billing
Product
Kinesis
REST
JDBC
GraphQL

Our Ingestion
Journey
Early 2016
Socrates
(Data Lake)
Web
CRM
Billing
Product
Micro Services
Kinesis
REST
JDBC
Webhook
ODBC
SFTP
GraphQL

Our Ingestion
Journey
Late 2016
Socrates
(Data Lake)
Web
CRM
Billing
Product
Micro Services

Our Ingestion
Journey
Early 2017
Socrates
(Data Lake)
Web
CRM
Billing
Product
Micro Services Other
Micro Services
Other
Enterprise Systems
StreamHub
(Enterprise Bus)

Event-Driven
Architecture
Schema Registry
What is StreamHub?
Producers and subscribers
integrate via events
Validates that messages are
compatible

atlassian-socrates-raw-landed/
└── avi:jira:created:comment/
└── day=2017-10-10/
├── events-13:20:15.479940.json.gz
├── events-13:21:23.479940.json.gz
├── events-13:21:52.479940.json.gz
├── events-13:23:37.479940.json.gz
├── events-13:23:56.479940.json.gz
├── events-13:24:15.479940.json.gz
├── events-13:24:21.479940.json.gz
├── events-13:25:34.479940.json.gz
└── events-13:26:13.479940.json.gz

atlassian-socrates-raw-published-stg1/
├── avi:jira:created:comment/
├── day=2017-10-10
└── <sub-partition>
│ ├── events-part01.snappy.parquet
│ └── events-part04.snappy.parquet
└── <sub-partition>
├── events-part05.snappy.parquet
└── events-part08.snappy.parquet

atlassian-socrates-raw-published-stg2/
├── avi:jira:created:comment/
├── day=2017-10-10
└── business_key_1
│ └── events-part01.snappy.parquet
└── business_key_2
└── events-part01.snappy.parquet

Prepare
Cleansing and transforming our data

Challenges with preparation
Cluster Management Re-Inventing the
Wheel
Data Engineering
Bottleneck Clusters could be hard to
upgrade and attribute costs to
jobs
Lots of time spent re-
implementing patterns to
perform transformations
Teams would rely on us to
help them with their data
transformation needs

Airflow
RAW
/UNALTERED
JOB SCOPED
CLUSTERS
PREPARED
/TRANSFORMED
CRM/Billing
Product/Web Aggregated
/ Derived
Dimensional
Model
User Defined
Extracts
Support/Ops
Account /
Chargeback
Upscale
Quarantine

Airflow DAG
Copy logs for
debugging
Spin up a
dedicated
EMR cluster
Shutdown
EMR cluster

Transformation as a Service
TaaS

Organize
Storing, securing, and governing our data

Challenges with organizing data
Security Categorizing DataTeams want
flexibility How can we provision buckets
for teams who don’t want to
face the AWS console head-
on?
How can we structure our data
lake in a way that will scale
well?
How do we give teams
flexibility on how they organize
themselves?

Areas of the data lake
Landed Raw Modeled Self-Serve
Unaltered,
Unformatted,
Unmasked
Optimized,
Partitioned, Masked
Conformed
dimensions,
Standardized facts,
aggregated/derived
value
BYO Data,
User/Team managed

Self-Service
Schemas
What gets
provisioned
Provisions the components
• Create a S3 bucket, tagged to the user
• Create an a schema in our metastore(s)
• Create an Active Directory group
We call them Zones
We use to call them “Playgrounds” but often they were
used for production loads
e.g. zone_marketing
Use Vault to control access rights
• A tool that manages secrets
• Creates a temporary IAM user (2 hours)
• Passes the credentials to the user

Self-Service
Schemas
How users
interact
$ vault auth -method=ldap username=<ad_username>
Password (will be hidden): <ad_password>
...
token_policies: [zone-marketing-write zone-marketing-read]
$ vault read aws/creds/zone-marketing-write
Key Value
--- -----
lease_id aws/creds/zone-marketing-write/e1x2a3m4p5l6e7
lease_duration 25h0m0s
lease_renewable true
access_key AKIAISANEXAMPLEKEYID
secret_key 1r2a3n4d5o6m7s8t9r0i1n2g3O4f5C6o7u8r9s0e
security_token <nil>
Authenticate against Vault
Retrieve your credentials

Self-Service
Schemas
How users
interact
$ aws configure
AWS Access Key ID [None]: AKIAISANEXAMPLEKEYID
AWS Secret Access Key [None]:1r2a3n4d5o6m7s8t9r0i1n2g3O4f5C6o7u8r9s0e
$ aws s3 cp examplefile s3://atlassian-zone-bucketname
Apply Credentials
List your bucket
$ aws s3 ls s3://atlassian-zone-marketing/
PRE example_directory/
PRE another_example_directory/
2016-12-08 13:21:35 0 example_text_file.txt
2016-09-27 12:24:48 0 example_csv_file.csv
Upload your file

Discover
Finding, understanding, and exploring data

Challenges with data discovery
Managing query
engines
Finding dataTeams want options
Query engine usage is
unpredictable, doing a bad job
blocks analysts
Difficult to know which table to
trust or to use for what
purpose
Different visualizations tools
better suit different needs

Visual Layer
Interactive Layer
Metastore Layer
Storage Layer Raw Buckets Model Buckets
Zone Buckets
(Self-Service)
Hive Metastore AWS Glue
Metastore
Amazon
Athena
Presto EMR
Spark/Hive
EMR
Tableau R Shiny
Zeppelin
Notebooks Redash

After: Amazon AthenaBefore: Presto
• Many failed queries
• Difficulties upgrading
• Hard to secure
• Ability to attribute costs
• Less infrastructure/operational
overhead
• Not paying for what we don’t use
• Uses bucket security policies

Challenges with Amazon Athena
No AD
Authentication
Cost ManagementEarly Adopter Pains
Only access via JDBC to
begin with using keys
Costs need to be monitored to
spot any unusual spikes
There wasn’t parity with
Presto to begin with

Visualization Stack
Tableau R Shiny Zeppelin
Notebooks
Redash
Interactive exploration
on core data sets and
corporate dashboards
Web apps and
standalone
dashboards
Web based
notebooks
Quick queries and
visualizations on all
data

Key
Takeaways
It’s not just flicking on a
switch
AWS helps you move up
the value chain
You can’t just turn on AWS components and
have an instant data lake
Using AWS helps you focus on areas where you
can be adding value

ROHAN DHUPELIA | ANALYTICS PLATFORM MANAGER | @ROHANDHUPELIA
Thank you!

ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and Amazon Athena

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and Amazon Athena

Similaire à ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and Amazon Athena (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and Amazon Athena