SlideShare une entreprise Scribd logo
1  sur  79
Télécharger pour lire hors ligne
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Architecting a data lake with
Amazon S3, Amazon Kinesis, AWS
Glue and Amazon Athena
R o h a n D h u p e l i a , A n a l y t i c s P l a t f o r m M a n a g e r , A t l a s s i a n
A b h i s h e k S i n h a , S e n i o r P r o d u c t M a n a g e r , A m a z o n A t h e n a
A B D 3 1 8
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Characteristics of a data lake
• The basic components of building a data lake and services corresponding to
each
• Example 1: Building a data lake to unify real-time and batch data processing
needs
• Example 2: The Atlassian self-service data lake
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
E x p o n e n t i a l g r o w t h i n d a t a
Reasons for building a data lake
Transactions
ERP
Sensor Data
Billing
Web logs
Social
Infrastructure logs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
E x p o n e n t i a l g r o w t h i n d a t a
Reasons for building a data lake
Transactions
ERP
Sensor Data
Billing
Web logs
Social
Infrastructure logs
D i v e r s i f i e d c o n s u m e r s
Data Scientists
Business Analyst External Consumers
Applications
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
E x p o n e n t i a l g r o w t h i n d a t a
Reasons for building a data lake
Transactions
ERP
Sensor Data
Billing
Web logs
Social
Infrastructure logs
D i v e r s i f i e d c o n s u m e r s
Data Scientists
Business Analyst External Consumers
Applications
M u l t i p l e a c c e s s m e c h a n i s m s
API Access
BI Tools
Notebooks
Characteristics of a data lake
Future ProofFlexible
Access
Dive in
Anywhere
Collect
Anything
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 as the data lake
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Simplified architectural view
Amazon S3
Ingestion
mechanism
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Process Consume
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
There are lots of ingestion tools
Amazon S3
Process Consume
S3 Transfer
Acceleration
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Variety of data processing tools
Amazon S3
Consume
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
And multiple ways to consume the data
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
Amazon API Gateway
Programmatic Access
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Because data is not prefect
AWS Lambda
Trigger-based Code
Execution
AWS Glue
Event based Server-less ETL
engine
Amazon EMR
Spark and Hive running on
EMR
Because data is not never prefect
Clean
Transform
Concatenate
Convert to better formats
Schedule transformations
Event-driven transformations
Transformations expressed as
code
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ETL when you need it
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
API Gateway
Programmatic Access
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Metadata?
AWS Glue Data Catalog
Central Metadata Catalog for the data lake
One per account
Allows you to share metadata between
Amazon Athena, Amazon Redshift
Spectrum, EMR & JDBC sources
We added a few extensions:
 Search over metadata for data
discovery
 Connection info – JDBC URLs,
credentials
 Classification for identifying and parsing
files
 Versioning of table metadata as
schemas evolve and other metadata are
updated
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Catalog Crawlers
AWS Glue Data Catalog - Crawlers
Helping Catalog your data
Crawlers automatically build your Data
Catalog and keep it in sync
Automatically discover new data, extracts
schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom
classifiers using Grok expression
Run ad hoc or on a schedule; serverless – only
pay when crawler runs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue Data Catalog
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Table schema
Table properties
Data statistics
Nested fields
Data Catalog – Table Details
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
List of table versionsCompare schema versions
Data Catalog: Version Control
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Automatically register available partitions
Table
partitions
Automatic Partition Detection
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
A central metadata store for your lake
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
API Gateway
Programmatic Access
AWS Glue Data Catalog
Hive-compatible Metastore
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time (instream processing)
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
API Gateway
Programmatic Access
AWS Glue Data Catalog
Hive-compatible Metastore
Spark Streaming
& Flink on EMR
AmazonKinesis
Analytics
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
W r i t e o n c e , c a t a l o g o n c e , r e a d m u l t i p l e , E T L A n y w h e r e
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
API Gateway
Programmatic Access
AWS Glue Data Catalog
Hive-compatible Metastore
Spark Streaming
& Flink on EMR
AmazonKinesis
Analytics
Characteristics of a data lake
Future ProofFlexible
Access
Dive in
Anywhere
Collect
Anything
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Let’s take an example
1. What is going on with a specific sensor
2. Daily Aggregations (device,
inefficiencies, average temperature)
3. A real-time view of how many sensors
are showing inefficiencies
1. Scale
2. Highly availability
3. Less management overhead
4. Pay what I need
Business Questions
Operations
Record-level dataSensor/IOT device
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Let’s push this data into a Kinesis
Kinesis Firehose
Amazon S3
Amazon S3 Amazon Athena
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Querying it in Amazon Athena
Either Create a Crawler to
auto-generate schema
OR
Write a DDL on the Athena
console/API/ JDBC/ODBC
driver
Start Querying Data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Query daily aggregates in Amazon Athena
Kinesis Firehose
Amazon S3
Amazon S3 Amazon Athena
“raw-time-series”
Amazon S3
“daily-average”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue Job
Serverless, event-driven execution
Data is written out to S3
Output table is automatically
created in Amazon Athena
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Query daily aggregates in Amazon Athena
Kinesis Firehose
Amazon S3
Amazon S3 Amazon Athena
“raw-time-series”
Amazon S3
“daily-average”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kinesis Analytics for in-stream analytics
Kinesis Firehose Amazon S3
Amazon S3 Amazon Athena
“raw-time-series”
Amazon S3
“daily-average”
Amazon S3
Kinesis Analytics Kinesis Firehose
“results”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KPI - Overall device daily inefficiency"
SELECT ( SUM(daily_avg_inefficiency)/COUNT(*) )
AS all_device_avg_inefficiency, date
FROM awsblogsgluedemo.daily_avg_inefficiency
GROUP BY date;
Top 10 most inefficient devices - event-level granularity
SELECT col0 AS "uuid", col1 AS "deviceid", col2 AS "devicets",
col3 AS "temp", col4 AS "settemp", col5 AS "pct_inefficiency"
FROM awsblogsgluedemo.results ORDER BY pct_inefficiency DESC
limit 10;
“raw” table with raw data
Top 20 most active devices
SELECT
deviceid, COUNT(*) AS num_events
FROM awsblogsgluedemo."raw"
GROUP BY deviceid
ORDER BY num_events DESC
Events by Device ID
SELECT uuid, devicets,deviceid,
temp
FROM awsblogsgluedemo."raw" WHERE
deviceid = 1
ORDER BY devicets DESC;
“daily-agg” table with daily
aggregation
“result” table
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Overall architecture
Kinesis Firehose Amazon S3
Amazon S3 Amazon Athena
“raw-time-series”
Amazon S3
“daily-average”
Amazon S3
Kinesis Analytics Kinesis Firehose
“results”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Characteristics
 Scale to hundreds of thousands of data sources
 Virtually infinite storage scalability
 Real-time and batch processing layers
 Interactive queries
 Highly available and durable
 Pay only for what you use
X No servers to manage
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Very easy to try – existing template
ROHAN DHUPELIA | ANALYTICS PLATFORM MANAGER | @ROHANDHUPELIA
Building the Atlassian Data
Lake
Meetings DecisionsMentions FilesConvosReactions
Planning & tracking Messaging & communicate Organizing projects Content collaboration Code collaboration
Software Teams IT TeamsMarketing Teams Finance TeamsHR Teams
ATLASSIAN OVERVIEW
Socrates
The Atlassian Data Lake
Image courtesy of © Bar Harel, CC BY-SA 4.0, Wikimedia Commons
The numbers
500+ TBs 1B+ Events 100
Integrations
1000 Internal
UsersStored in the data
lake
Ingested into the data
lake daily
Providing analytical
events
Using the data lake
daily
Data lake services
Ingest
Moving away from pull-based ingestion
Challenges with pull-based ingestion
Complex DisruptiveBrittle
Various technologies to
maintain
Analytics extracts strain
sourcing systems
As sources change the
pipelines break and need
updating
Our Ingestion
Journey
Late 2015
Socrates
(Data Lake)
Web
CRM
Billing
Product
Kinesis
REST
JDBC
GraphQL
Our Ingestion
Journey
Early 2016
Socrates
(Data Lake)
Web
CRM
Billing
Product
Micro Services
Kinesis
REST
JDBC
Webhook
ODBC
SFTP
GraphQL
Our Ingestion
Journey
Late 2016
Socrates
(Data Lake)
Web
CRM
Billing
Product
Micro Services
Our Ingestion
Journey
Early 2017
Socrates
(Data Lake)
Web
CRM
Billing
Product
Micro Services Other
Micro Services
Other
Enterprise Systems
StreamHub
(Enterprise Bus)
Event-Driven
Architecture
Schema Registry
What is StreamHub?
Producers and subscribers
integrate via events
Validates that messages are
compatible
How do we land it?
atlassian-socrates-raw-landed/
└── avi:jira:created:comment/
└── day=2017-10-10/
├── events-13:20:15.479940.json.gz
├── events-13:21:23.479940.json.gz
├── events-13:21:52.479940.json.gz
├── events-13:23:37.479940.json.gz
├── events-13:23:56.479940.json.gz
├── events-13:24:15.479940.json.gz
├── events-13:24:21.479940.json.gz
├── events-13:25:34.479940.json.gz
└── events-13:26:13.479940.json.gz
atlassian-socrates-raw-published-stg1/
├── avi:jira:created:comment/
├── day=2017-10-10
└── <sub-partition>
│ ├── events-part01.snappy.parquet
│ ├── events-part02.snappy.parquet
│ ├── events-part03.snappy.parquet
│ └── events-part04.snappy.parquet
└── <sub-partition>
├── events-part05.snappy.parquet
├── events-part06.snappy.parquet
├── events-part07.snappy.parquet
└── events-part08.snappy.parquet
atlassian-socrates-raw-published-stg2/
├── avi:jira:created:comment/
├── day=2017-10-10
└── business_key_1
│ └── events-part01.snappy.parquet
└── business_key_2
└── events-part01.snappy.parquet
Prepare
Cleansing and transforming our data
Challenges with preparation
Cluster Management Re-Inventing the
Wheel
Data Engineering
Bottleneck Clusters could be hard to
upgrade and attribute costs to
jobs
Lots of time spent re-
implementing patterns to
perform transformations
Teams would rely on us to
help them with their data
transformation needs
Airflow
RAW
/UNALTERED
JOB SCOPED
CLUSTERS
PREPARED
/TRANSFORMED
CRM/Billing
Product/Web Aggregated
/ Derived
Dimensional
Model
User Defined
Extracts
Support/Ops
Account /
Chargeback
Upscale
Quarantine
Airflow DAG
Copy logs for
debugging
Spin up a
dedicated
EMR cluster
Shutdown
EMR cluster
Transformation as a Service
TaaS
Organize
Storing, securing, and governing our data
Challenges with organizing data
Security Categorizing DataTeams want
flexibility How can we provision buckets
for teams who don’t want to
face the AWS console head-
on?
How can we structure our data
lake in a way that will scale
well?
How do we give teams
flexibility on how they organize
themselves?
Areas of the data lake
Landed Raw Modeled Self-Serve
Unaltered,
Unformatted,
Unmasked
Optimized,
Partitioned, Masked
Conformed
dimensions,
Standardized facts,
aggregated/derived
value
BYO Data,
User/Team managed
Request a Schema…
Self-Service
Schemas
What gets
provisioned
Provisions the components
• Create a S3 bucket, tagged to the user
• Create an a schema in our metastore(s)
• Create an Active Directory group
We call them Zones
We use to call them “Playgrounds” but often they were
used for production loads
e.g. zone_marketing
Use Vault to control access rights
• A tool that manages secrets
• Creates a temporary IAM user (2 hours)
• Passes the credentials to the user
Self-Service
Schemas
How users
interact
$ vault auth -method=ldap username=<ad_username>
Password (will be hidden): <ad_password>
...
token_policies: [zone-marketing-write zone-marketing-read]
$ vault read aws/creds/zone-marketing-write
Key Value
--- -----
lease_id aws/creds/zone-marketing-write/e1x2a3m4p5l6e7
lease_duration 25h0m0s
lease_renewable true
access_key AKIAISANEXAMPLEKEYID
secret_key 1r2a3n4d5o6m7s8t9r0i1n2g3O4f5C6o7u8r9s0e
security_token <nil>
Authenticate against Vault
Retrieve your credentials
Self-Service
Schemas
How users
interact
$ aws configure
AWS Access Key ID [None]: AKIAISANEXAMPLEKEYID
AWS Secret Access Key [None]:1r2a3n4d5o6m7s8t9r0i1n2g3O4f5C6o7u8r9s0e
$ aws s3 cp examplefile s3://atlassian-zone-bucketname
Apply Credentials
List your bucket
$ aws s3 ls s3://atlassian-zone-marketing/
PRE example_directory/
PRE another_example_directory/
2016-12-08 13:21:35 0 example_text_file.txt
2016-09-27 12:24:48 0 example_csv_file.csv
Upload your file
Discover
Finding, understanding, and exploring data
Challenges with data discovery
Managing query
engines
Finding dataTeams want options
Query engine usage is
unpredictable, doing a bad job
blocks analysts
Difficult to know which table to
trust or to use for what
purpose
Different visualizations tools
better suit different needs
Visual Layer
Interactive Layer
Metastore Layer
Storage Layer Raw Buckets Model Buckets
Zone Buckets
(Self-Service)
Hive Metastore AWS Glue
Metastore
Amazon
Athena
Presto EMR
Spark/Hive
EMR
Tableau R Shiny
Zeppelin
Notebooks Redash
After: Amazon AthenaBefore: Presto
• Many failed queries
• Difficulties upgrading
• Hard to secure
• Ability to attribute costs
• Less infrastructure/operational
overhead
• Not paying for what we don’t use
• Uses bucket security policies
Challenges with Amazon Athena
No AD
Authentication
Cost ManagementEarly Adopter Pains
Only access via JDBC to
begin with using keys
Costs need to be monitored to
spot any unusual spikes
There wasn’t parity with
Presto to begin with
Visualization Stack
Tableau R Shiny Zeppelin
Notebooks
Redash
Interactive exploration
on core data sets and
corporate dashboards
Web apps and
standalone
dashboards
Web based
notebooks
Quick queries and
visualizations on all
data
Search the Data Catalog
Key
Takeaways
It’s not just flicking on a
switch
AWS helps you move up
the value chain
You can’t just turn on AWS components and
have an instant data lake
Using AWS helps you focus on areas where you
can be adding value
ROHAN DHUPELIA | ANALYTICS PLATFORM MANAGER | @ROHANDHUPELIA
Thank you!

Contenu connexe

Tendances

Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Amazon Web Services
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017Amazon Web Services
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSAmazon Web Services
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Getting Started with Amazon Kinesis
Getting Started with Amazon KinesisGetting Started with Amazon Kinesis
Getting Started with Amazon KinesisAmazon Web Services
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarAmazon Web Services
 
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...Amazon Web Services
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSAmazon Web Services
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
 
Visualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSightVisualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSightAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
AWS Partner Data Analytics on AWS_Handout.pdf
AWS Partner Data Analytics on AWS_Handout.pdfAWS Partner Data Analytics on AWS_Handout.pdf
AWS Partner Data Analytics on AWS_Handout.pdfSrinjoySaha12
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveCobus Bernard
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingDatabricks
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxSwathiPonugumati
 

Tendances (20)

Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Getting Started with Amazon Kinesis
Getting Started with Amazon KinesisGetting Started with Amazon Kinesis
Getting Started with Amazon Kinesis
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWS
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Visualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSightVisualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSight
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
AWS Partner Data Analytics on AWS_Handout.pdf
AWS Partner Data Analytics on AWS_Handout.pdfAWS Partner Data Analytics on AWS_Handout.pdf
AWS Partner Data Analytics on AWS_Handout.pdf
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
AWS-Data-Migration-module3
AWS-Data-Migration-module3AWS-Data-Migration-module3
AWS-Data-Migration-module3
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptx
 

Similaire à ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and Amazon Athena

ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Amazon Web Services
 
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansAmazon Web Services
 
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...Amazon Web Services
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfAmazon Web Services
 
Preparing Data for the Lake: Data Analytics Week SF
Preparing Data for the Lake: Data Analytics Week SFPreparing Data for the Lake: Data Analytics Week SF
Preparing Data for the Lake: Data Analytics Week SFAmazon Web Services
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCAmazon Web Services LATAM
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with ZopaAmazon Web Services
 
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Amazon Web Services
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Amazon Web Services
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAdir Sharabi
 
Value of Data Beyond Analytics by Darin Briskman
 Value of Data Beyond Analytics by Darin Briskman Value of Data Beyond Analytics by Darin Briskman
Value of Data Beyond Analytics by Darin BriskmanSameer Kenkare
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 

Similaire à ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and Amazon Athena (20)

ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWS
 
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
 
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data Oceans
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
Preparing Data for the Lake: Data Analytics Week SF
Preparing Data for the Lake: Data Analytics Week SFPreparing Data for the Lake: Data Analytics Week SF
Preparing Data for the Lake: Data Analytics Week SF
 
Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LC
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with Zopa
 
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 
STG401_This Is My Architecture
STG401_This Is My ArchitectureSTG401_This Is My Architecture
STG401_This Is My Architecture
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
Value of Data Beyond Analytics by Darin Briskman
 Value of Data Beyond Analytics by Darin Briskman Value of Data Beyond Analytics by Darin Briskman
Value of Data Beyond Analytics by Darin Briskman
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and Amazon Athena

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and Amazon Athena R o h a n D h u p e l i a , A n a l y t i c s P l a t f o r m M a n a g e r , A t l a s s i a n A b h i s h e k S i n h a , S e n i o r P r o d u c t M a n a g e r , A m a z o n A t h e n a A B D 3 1 8
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • Characteristics of a data lake • The basic components of building a data lake and services corresponding to each • Example 1: Building a data lake to unify real-time and batch data processing needs • Example 2: The Atlassian self-service data lake
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. E x p o n e n t i a l g r o w t h i n d a t a Reasons for building a data lake Transactions ERP Sensor Data Billing Web logs Social Infrastructure logs
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. E x p o n e n t i a l g r o w t h i n d a t a Reasons for building a data lake Transactions ERP Sensor Data Billing Web logs Social Infrastructure logs D i v e r s i f i e d c o n s u m e r s Data Scientists Business Analyst External Consumers Applications
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. E x p o n e n t i a l g r o w t h i n d a t a Reasons for building a data lake Transactions ERP Sensor Data Billing Web logs Social Infrastructure logs D i v e r s i f i e d c o n s u m e r s Data Scientists Business Analyst External Consumers Applications M u l t i p l e a c c e s s m e c h a n i s m s API Access BI Tools Notebooks
  • 6. Characteristics of a data lake Future ProofFlexible Access Dive in Anywhere Collect Anything
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 as the data lake
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Simplified architectural view Amazon S3 Ingestion mechanism Data sources Transactions Web logs / cookies ERP Connected devices Process Consume
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. There are lots of ingestion tools Amazon S3 Process Consume S3 Transfer Acceleration Data sources Transactions Web logs / cookies ERP Connected devices
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Variety of data processing tools Amazon S3 Consume S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. And multiple ways to consume the data Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE Amazon API Gateway Programmatic Access
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Because data is not prefect AWS Lambda Trigger-based Code Execution AWS Glue Event based Server-less ETL engine Amazon EMR Spark and Hive running on EMR Because data is not never prefect Clean Transform Concatenate Convert to better formats Schedule transformations Event-driven transformations Transformations expressed as code
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ETL when you need it Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE API Gateway Programmatic Access
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Metadata? AWS Glue Data Catalog Central Metadata Catalog for the data lake One per account Allows you to share metadata between Amazon Athena, Amazon Redshift Spectrum, EMR & JDBC sources We added a few extensions:  Search over metadata for data discovery  Connection info – JDBC URLs, credentials  Classification for identifying and parsing files  Versioning of table metadata as schemas evolve and other metadata are updated
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Catalog Crawlers AWS Glue Data Catalog - Crawlers Helping Catalog your data Crawlers automatically build your Data Catalog and keep it in sync Automatically discover new data, extracts schema definitions • Detect schema changes and version tables • Detect Hive style partitions on Amazon S3 Built-in classifiers for popular types; custom classifiers using Grok expression Run ad hoc or on a schedule; serverless – only pay when crawler runs
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Data Catalog
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Table schema Table properties Data statistics Nested fields Data Catalog – Table Details
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. List of table versionsCompare schema versions Data Catalog: Version Control
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Automatically register available partitions Table partitions Automatic Partition Detection
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. A central metadata store for your lake Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE API Gateway Programmatic Access AWS Glue Data Catalog Hive-compatible Metastore
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Real-time (instream processing) Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE API Gateway Programmatic Access AWS Glue Data Catalog Hive-compatible Metastore Spark Streaming & Flink on EMR AmazonKinesis Analytics
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. W r i t e o n c e , c a t a l o g o n c e , r e a d m u l t i p l e , E T L A n y w h e r e Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE API Gateway Programmatic Access AWS Glue Data Catalog Hive-compatible Metastore Spark Streaming & Flink on EMR AmazonKinesis Analytics
  • 23. Characteristics of a data lake Future ProofFlexible Access Dive in Anywhere Collect Anything
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Let’s take an example 1. What is going on with a specific sensor 2. Daily Aggregations (device, inefficiencies, average temperature) 3. A real-time view of how many sensors are showing inefficiencies 1. Scale 2. Highly availability 3. Less management overhead 4. Pay what I need Business Questions Operations Record-level dataSensor/IOT device
  • 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Let’s push this data into a Kinesis Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Querying it in Amazon Athena Either Create a Crawler to auto-generate schema OR Write a DDL on the Athena console/API/ JDBC/ODBC driver Start Querying Data
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Query daily aggregates in Amazon Athena Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena “raw-time-series” Amazon S3 “daily-average”
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Job Serverless, event-driven execution Data is written out to S3 Output table is automatically created in Amazon Athena
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Query daily aggregates in Amazon Athena Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena “raw-time-series” Amazon S3 “daily-average”
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kinesis Analytics for in-stream analytics Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena “raw-time-series” Amazon S3 “daily-average” Amazon S3 Kinesis Analytics Kinesis Firehose “results”
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KPI - Overall device daily inefficiency" SELECT ( SUM(daily_avg_inefficiency)/COUNT(*) ) AS all_device_avg_inefficiency, date FROM awsblogsgluedemo.daily_avg_inefficiency GROUP BY date; Top 10 most inefficient devices - event-level granularity SELECT col0 AS "uuid", col1 AS "deviceid", col2 AS "devicets", col3 AS "temp", col4 AS "settemp", col5 AS "pct_inefficiency" FROM awsblogsgluedemo.results ORDER BY pct_inefficiency DESC limit 10; “raw” table with raw data Top 20 most active devices SELECT deviceid, COUNT(*) AS num_events FROM awsblogsgluedemo."raw" GROUP BY deviceid ORDER BY num_events DESC Events by Device ID SELECT uuid, devicets,deviceid, temp FROM awsblogsgluedemo."raw" WHERE deviceid = 1 ORDER BY devicets DESC; “daily-agg” table with daily aggregation “result” table
  • 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Overall architecture Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena “raw-time-series” Amazon S3 “daily-average” Amazon S3 Kinesis Analytics Kinesis Firehose “results”
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Characteristics  Scale to hundreds of thousands of data sources  Virtually infinite storage scalability  Real-time and batch processing layers  Interactive queries  Highly available and durable  Pay only for what you use X No servers to manage
  • 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Very easy to try – existing template
  • 36. ROHAN DHUPELIA | ANALYTICS PLATFORM MANAGER | @ROHANDHUPELIA Building the Atlassian Data Lake
  • 37. Meetings DecisionsMentions FilesConvosReactions Planning & tracking Messaging & communicate Organizing projects Content collaboration Code collaboration Software Teams IT TeamsMarketing Teams Finance TeamsHR Teams ATLASSIAN OVERVIEW
  • 38. Socrates The Atlassian Data Lake Image courtesy of © Bar Harel, CC BY-SA 4.0, Wikimedia Commons
  • 39. The numbers 500+ TBs 1B+ Events 100 Integrations 1000 Internal UsersStored in the data lake Ingested into the data lake daily Providing analytical events Using the data lake daily
  • 41. Ingest Moving away from pull-based ingestion
  • 42. Challenges with pull-based ingestion Complex DisruptiveBrittle Various technologies to maintain Analytics extracts strain sourcing systems As sources change the pipelines break and need updating
  • 43. Our Ingestion Journey Late 2015 Socrates (Data Lake) Web CRM Billing Product Kinesis REST JDBC GraphQL
  • 44. Our Ingestion Journey Early 2016 Socrates (Data Lake) Web CRM Billing Product Micro Services Kinesis REST JDBC Webhook ODBC SFTP GraphQL
  • 45. Our Ingestion Journey Late 2016 Socrates (Data Lake) Web CRM Billing Product Micro Services
  • 46. Our Ingestion Journey Early 2017 Socrates (Data Lake) Web CRM Billing Product Micro Services Other Micro Services Other Enterprise Systems StreamHub (Enterprise Bus)
  • 47. Event-Driven Architecture Schema Registry What is StreamHub? Producers and subscribers integrate via events Validates that messages are compatible
  • 48.
  • 49.
  • 50.
  • 51. How do we land it?
  • 52. atlassian-socrates-raw-landed/ └── avi:jira:created:comment/ └── day=2017-10-10/ ├── events-13:20:15.479940.json.gz ├── events-13:21:23.479940.json.gz ├── events-13:21:52.479940.json.gz ├── events-13:23:37.479940.json.gz ├── events-13:23:56.479940.json.gz ├── events-13:24:15.479940.json.gz ├── events-13:24:21.479940.json.gz ├── events-13:25:34.479940.json.gz └── events-13:26:13.479940.json.gz
  • 53. atlassian-socrates-raw-published-stg1/ ├── avi:jira:created:comment/ ├── day=2017-10-10 └── <sub-partition> │ ├── events-part01.snappy.parquet │ ├── events-part02.snappy.parquet │ ├── events-part03.snappy.parquet │ └── events-part04.snappy.parquet └── <sub-partition> ├── events-part05.snappy.parquet ├── events-part06.snappy.parquet ├── events-part07.snappy.parquet └── events-part08.snappy.parquet
  • 54. atlassian-socrates-raw-published-stg2/ ├── avi:jira:created:comment/ ├── day=2017-10-10 └── business_key_1 │ └── events-part01.snappy.parquet └── business_key_2 └── events-part01.snappy.parquet
  • 56. Challenges with preparation Cluster Management Re-Inventing the Wheel Data Engineering Bottleneck Clusters could be hard to upgrade and attribute costs to jobs Lots of time spent re- implementing patterns to perform transformations Teams would rely on us to help them with their data transformation needs
  • 57. Airflow RAW /UNALTERED JOB SCOPED CLUSTERS PREPARED /TRANSFORMED CRM/Billing Product/Web Aggregated / Derived Dimensional Model User Defined Extracts Support/Ops Account / Chargeback Upscale Quarantine
  • 58. Airflow DAG Copy logs for debugging Spin up a dedicated EMR cluster Shutdown EMR cluster
  • 59. Transformation as a Service TaaS
  • 60.
  • 61.
  • 62. Organize Storing, securing, and governing our data
  • 63. Challenges with organizing data Security Categorizing DataTeams want flexibility How can we provision buckets for teams who don’t want to face the AWS console head- on? How can we structure our data lake in a way that will scale well? How do we give teams flexibility on how they organize themselves?
  • 64. Areas of the data lake Landed Raw Modeled Self-Serve Unaltered, Unformatted, Unmasked Optimized, Partitioned, Masked Conformed dimensions, Standardized facts, aggregated/derived value BYO Data, User/Team managed
  • 66. Self-Service Schemas What gets provisioned Provisions the components • Create a S3 bucket, tagged to the user • Create an a schema in our metastore(s) • Create an Active Directory group We call them Zones We use to call them “Playgrounds” but often they were used for production loads e.g. zone_marketing Use Vault to control access rights • A tool that manages secrets • Creates a temporary IAM user (2 hours) • Passes the credentials to the user
  • 67. Self-Service Schemas How users interact $ vault auth -method=ldap username=<ad_username> Password (will be hidden): <ad_password> ... token_policies: [zone-marketing-write zone-marketing-read] $ vault read aws/creds/zone-marketing-write Key Value --- ----- lease_id aws/creds/zone-marketing-write/e1x2a3m4p5l6e7 lease_duration 25h0m0s lease_renewable true access_key AKIAISANEXAMPLEKEYID secret_key 1r2a3n4d5o6m7s8t9r0i1n2g3O4f5C6o7u8r9s0e security_token <nil> Authenticate against Vault Retrieve your credentials
  • 68. Self-Service Schemas How users interact $ aws configure AWS Access Key ID [None]: AKIAISANEXAMPLEKEYID AWS Secret Access Key [None]:1r2a3n4d5o6m7s8t9r0i1n2g3O4f5C6o7u8r9s0e $ aws s3 cp examplefile s3://atlassian-zone-bucketname Apply Credentials List your bucket $ aws s3 ls s3://atlassian-zone-marketing/ PRE example_directory/ PRE another_example_directory/ 2016-12-08 13:21:35 0 example_text_file.txt 2016-09-27 12:24:48 0 example_csv_file.csv Upload your file
  • 70. Challenges with data discovery Managing query engines Finding dataTeams want options Query engine usage is unpredictable, doing a bad job blocks analysts Difficult to know which table to trust or to use for what purpose Different visualizations tools better suit different needs
  • 71. Visual Layer Interactive Layer Metastore Layer Storage Layer Raw Buckets Model Buckets Zone Buckets (Self-Service) Hive Metastore AWS Glue Metastore Amazon Athena Presto EMR Spark/Hive EMR Tableau R Shiny Zeppelin Notebooks Redash
  • 72. After: Amazon AthenaBefore: Presto • Many failed queries • Difficulties upgrading • Hard to secure • Ability to attribute costs • Less infrastructure/operational overhead • Not paying for what we don’t use • Uses bucket security policies
  • 73. Challenges with Amazon Athena No AD Authentication Cost ManagementEarly Adopter Pains Only access via JDBC to begin with using keys Costs need to be monitored to spot any unusual spikes There wasn’t parity with Presto to begin with
  • 74. Visualization Stack Tableau R Shiny Zeppelin Notebooks Redash Interactive exploration on core data sets and corporate dashboards Web apps and standalone dashboards Web based notebooks Quick queries and visualizations on all data
  • 75. Search the Data Catalog
  • 76.
  • 77.
  • 78. Key Takeaways It’s not just flicking on a switch AWS helps you move up the value chain You can’t just turn on AWS components and have an instant data lake Using AWS helps you focus on areas where you can be adding value
  • 79. ROHAN DHUPELIA | ANALYTICS PLATFORM MANAGER | @ROHANDHUPELIA Thank you!