Building a Metadata Catalog for your Data Lakes using Amazon Elasticsearch Service - July 2017

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Metadata Catalog for
your Data Lakes using Amazon
Elasticsearch Service
11 Jul 2017
Jon Handler
AWS Principal Solutions Architect
handler@amazon.com or @_searchgeek

Get started at https://aws.amazon.com/elasticsearch-service/
What produces data?
• Metering
Records
• Mobile
Apps
• IoT
Sensors
Web
Clickstream
• Enterprise
documents
• Application
Logs
[Wed Oct 11 14:32:52
2000] [error] [client
127.0.0.1] client
denied by server
configuration:
/export/home/live/ap/ht
docs/test

What must a data lake support?
• Collecting and storing any type of data, at any scale and at
low costs
• Securing and protecting this data in the central repository
• Searching and finding relevant data
• Quickly and easily performing data analysis
• Defining the data’s structure at the time of use

Data Lake Architectures
Data Lake is a new and increasingly
popular architecture to store and analyze
massive volumes and heterogenous
types of data in a centralized repository

Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena Kinesis RDS
Central Storage
Secure, cost-effective
Storage in Amazon S3
S3
Snowball Database Migration
Service
Kinesis Firehose Direct Connect
Data Ingestion
Get your data into S3
Quickly and securely
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Security Token
Service
CloudWatch CloudTrail Key Management
Service
Building a data lake on AWS

• Real-time, distributed, search & analytics engine:
• Built on top of Apache Lucene
• Developer friendly RESTful API

Elasticsearch for S3 data and metadata
GET metadata*/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"metadata.department": "001"
}
},
{
"range": {
"@timestamp": {
"gte": "2017-07-06T00:15:00"
...
}
• Locate objects in S3
based on creation date,
author, size, or custom
metadata
• Retrieve keys to find
source content
• Search against
unstructured file contents

Kibana for monitoring S3 usage

Amazon Elasticsearch Service is a cost-effective
managed service that makes it easy to deploy,
manage, and scale open source Elasticsearch for log
analytics, full-text search and more.
Amazon
Elasticsearch
Service

Easy to use and scalable
AWS SDK
AWS CLI
AWS
CloudFormation
Elasticsearch
data nodes
Elasticsearch
master nodes
Elastic Load
Balancing
AWS IAM
CloudWatchCloudTrail

Easy to Use
Deploy a production-
ready Elasticsearch
cluster in minutes
Open
Get direct access to the
Elasticsearch open-source
API
Secure
Secure clusters with AWS
Identity and Access
Management (IAM)
Available
Make it highly availability
using Zone Awareness
AWS Integrated
Integrate with Amazon
Kinesis Firehose, AWS IOT,
and Amazon CloudWatch
Logs for seamless data
ingestion
Scalable
Scale clusters from a single
node up to 100 nodes
Amazon Elasticsearch Service Benefits

Leading enterprises trust Amazon Elasticsearch
Service for their search and analytics applications
Media &
Entertainment
Online
Services
Technology Other

Ingest architecture
Files
S3 Events
Amazon S3 AWS Lambda
Function
Amazon Elasticsearch
Service
• Files go into the S3 data lake through any channel
• The file creation produces an event that you can catch with a Lambda
function
• The Lambda function retrieves S3 data and metadata and pushes to
Amazon ES
• Direct to API, or at larger scale, use Kinesis Firehose as a delivery pipe
Amazon Kinesis
Firehose

Synthetic data generation
• Using python-testdata
• Keys are constructed from
dest_bin/s3_prefix/UUID.file_type
• Includes custom metadata fields – firstname, lastname, and
department

Create a Lambda function and connect with S3
Events
• From the Lambda console,
select "Create a Lambda
function"
• Choose the "Blank
Function" template

Configure triggers
• You can trigger
Lambda
invocations from
sources as
disparate as S3,
CloudFront,
DynamoDB, etc.
• Choose S3 as
the trigger

Configure with your bucket
• Give the bucket's
name
• Pick "Object
Created (All)" to
receive all
creation events
• You can restrict
by prefix or suffix
• Enable the trigger!

Configure the function
• Give the function a name
and pick the runtime
• Initial memory and timeout
may not be sufficient
• Increase based on
document sizing

Add code
• The Lambda function receives batches of records
• For each, retrieve the custom metadata from S3
• Retrieve the object's contents

Add code
• Flatten the structure for easier searching, pulling out
interesting details
• Use Firehose for more robust delivery

Set up an IAM role
• In the IAM console, create a new AWS Service Role for
Lambda, name it, and add policies for S3 full access,
Firehose access, and Lambda execution

Set access for the function
• Choose the Role you just created
• That's it!

Elasticsearch works with structured JSON
{
"@timestamp": "2017-07-06T04:00:05.173Z",
"bucket": "handler-data-lake",
"key": "bin2/root/predictions/97e05fcc-61ff-
11e7-b192-80e6501b37de.docx",
"principalId": "ACQ34MYGS5IYH",
"filename": "97e05fcc-61ff-11e7-b192-
80e6501b37de.docx",
"extension": "docx",
"eventName": "ObjectCreated:Put",
"eTag": "6797bde3578a82e62a4564de50f18cc0",
"awsRegion": "us-west-2",
"size": 242,
"metadata": {
"department": "021",
"last": "Johnston",
"first": "Amy"
}
}
• Documents contain fields –
name/value pairs
• Fields can nest
• Value types include text, numerics,
dates, and geo objects
• Field values can be single or array
• When you send documents to
Elasticsearch they should arrive as
JSON
*ES 5 can work with unstructured documents

Layout of data in the cluster Amazon ES cluster
logs_01.21.2017
logs_01.22.2017
logs_01.23.2017
logs_01.24.2017
logs_01.25.2017
logs_01.26.2017
logs_01.27.2017
Shard 1
Shard 2
Shard 3
Bucket
Key
Filename
Extension
etc.
Each index has
multiple shards
Each shard contains
a set of documents
Each document contains
a set of fields and values
One index per day

Deploy instances based on storage and use
• Metadata will be ~500 bytes. Multiply by the number of objects to get total
size. E.g., 500 Mb for 1Mn objects. Double for a replica. Compare to "Total
cluster size" to get instance count correct
• 2 instances minimum!
• M4 class of instances as a starting point. R4 for high volume
• Use dedicated masters and Zone Awareness

Before you begin, set an index template
• Matches all indexes
created as "logs-<date>"
• Set sharding
• refresh_interval controls
Lucene disk flushing
• Set specific schema for
fields
• Analyzing the S3 key
hierarchically

Set up Kibana

Query data – find specific content

Query data – find specific author

Query – who is sending .xlsx files

Visualize – objects added over time

Visualize – heat map for file extensions

Visualize – total data sent to S3 over time

Three Things to Remember
• Data lakes allow you to store and analyze massive
volumes of unstructured and heterogeneous data
• Amazon S3 provides a central repository for your data
lake
• Send object-creation events to Amazon ES via Lambda
to provide catalog and analysis of your data

Find out more:
https://aws.amazon.com/elasticsearch-service/
AWS Centralized Logging:
https://aws.amazon.com/answers/logging/centralized-logging/
Elasticsearch at the AWS Database Blog:
https://aws.amazon.com/blogs/database/category/elasticsearch/
Or ask your Solutions Architect!
Amazon
Elasticsearch
Service

Building a Metadata Catalog for your Data Lakes using Amazon Elasticsearch Service - July 2017

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (18)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Dernier

Dernier (20)

Building a Metadata Catalog for your Data Lakes using Amazon Elasticsearch Service - July 2017