"Learn how to architect a data lake where different teams within your organization can publish and consume data in a self-service manner. As organizations aim to become more data-driven, data engineering teams have to build architectures that can cater to the needs of diverse users - from developers, to business analysts, to data scientists. Each of these user groups employs different tools, have different data needs and access data in different ways.
In this talk, we will dive deep into assembling a data lake using Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue. The session will feature Mohit Rao, Architect and Integration lead at Atlassian, the maker of products such as JIRA, Confluence, and Stride. First, we will look at a couple of common architectures for building a data lake. Then we will show how Atlassian built a self-service data lake, where any team within the company can publish a dataset to be consumed by a broad set of users."
39. The numbers
500+ TBs 1B+ Events 100
Integrations
1000 Internal
UsersStored in the data
lake
Ingested into the data
lake daily
Providing analytical
events
Using the data lake
daily
42. Challenges with pull-based ingestion
Complex DisruptiveBrittle
Various technologies to
maintain
Analytics extracts strain
sourcing systems
As sources change the
pipelines break and need
updating
56. Challenges with preparation
Cluster Management Re-Inventing the
Wheel
Data Engineering
Bottleneck Clusters could be hard to
upgrade and attribute costs to
jobs
Lots of time spent re-
implementing patterns to
perform transformations
Teams would rely on us to
help them with their data
transformation needs
63. Challenges with organizing data
Security Categorizing DataTeams want
flexibility How can we provision buckets
for teams who don’t want to
face the AWS console head-
on?
How can we structure our data
lake in a way that will scale
well?
How do we give teams
flexibility on how they organize
themselves?
64. Areas of the data lake
Landed Raw Modeled Self-Serve
Unaltered,
Unformatted,
Unmasked
Optimized,
Partitioned, Masked
Conformed
dimensions,
Standardized facts,
aggregated/derived
value
BYO Data,
User/Team managed
66. Self-Service
Schemas
What gets
provisioned
Provisions the components
• Create a S3 bucket, tagged to the user
• Create an a schema in our metastore(s)
• Create an Active Directory group
We call them Zones
We use to call them “Playgrounds” but often they were
used for production loads
e.g. zone_marketing
Use Vault to control access rights
• A tool that manages secrets
• Creates a temporary IAM user (2 hours)
• Passes the credentials to the user
67. Self-Service
Schemas
How users
interact
$ vault auth -method=ldap username=<ad_username>
Password (will be hidden): <ad_password>
...
token_policies: [zone-marketing-write zone-marketing-read]
$ vault read aws/creds/zone-marketing-write
Key Value
--- -----
lease_id aws/creds/zone-marketing-write/e1x2a3m4p5l6e7
lease_duration 25h0m0s
lease_renewable true
access_key AKIAISANEXAMPLEKEYID
secret_key 1r2a3n4d5o6m7s8t9r0i1n2g3O4f5C6o7u8r9s0e
security_token <nil>
Authenticate against Vault
Retrieve your credentials
68. Self-Service
Schemas
How users
interact
$ aws configure
AWS Access Key ID [None]: AKIAISANEXAMPLEKEYID
AWS Secret Access Key [None]:1r2a3n4d5o6m7s8t9r0i1n2g3O4f5C6o7u8r9s0e
$ aws s3 cp examplefile s3://atlassian-zone-bucketname
Apply Credentials
List your bucket
$ aws s3 ls s3://atlassian-zone-marketing/
PRE example_directory/
PRE another_example_directory/
2016-12-08 13:21:35 0 example_text_file.txt
2016-09-27 12:24:48 0 example_csv_file.csv
Upload your file
70. Challenges with data discovery
Managing query
engines
Finding dataTeams want options
Query engine usage is
unpredictable, doing a bad job
blocks analysts
Difficult to know which table to
trust or to use for what
purpose
Different visualizations tools
better suit different needs
71. Visual Layer
Interactive Layer
Metastore Layer
Storage Layer Raw Buckets Model Buckets
Zone Buckets
(Self-Service)
Hive Metastore AWS Glue
Metastore
Amazon
Athena
Presto EMR
Spark/Hive
EMR
Tableau R Shiny
Zeppelin
Notebooks Redash
72. After: Amazon AthenaBefore: Presto
• Many failed queries
• Difficulties upgrading
• Hard to secure
• Ability to attribute costs
• Less infrastructure/operational
overhead
• Not paying for what we don’t use
• Uses bucket security policies
73. Challenges with Amazon Athena
No AD
Authentication
Cost ManagementEarly Adopter Pains
Only access via JDBC to
begin with using keys
Costs need to be monitored to
spot any unusual spikes
There wasn’t parity with
Presto to begin with
74. Visualization Stack
Tableau R Shiny Zeppelin
Notebooks
Redash
Interactive exploration
on core data sets and
corporate dashboards
Web apps and
standalone
dashboards
Web based
notebooks
Quick queries and
visualizations on all
data
78. Key
Takeaways
It’s not just flicking on a
switch
AWS helps you move up
the value chain
You can’t just turn on AWS components and
have an instant data lake
Using AWS helps you focus on areas where you
can be adding value