SlideShare une entreprise Scribd logo
1  sur  40
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Metadata Catalog for
your Data Lakes using Amazon
Elasticsearch Service
11 Jul 2017
Jon Handler
AWS Principal Solutions Architect
handler@amazon.com or @_searchgeek
Get started at https://aws.amazon.com/elasticsearch-service/
What produces data?
• Metering
Records
• Mobile
Apps
• IoT
Sensors
Web
Clickstream
• Enterprise
documents
• Application
Logs
[Wed Oct 11 14:32:52
2000] [error] [client
127.0.0.1] client
denied by server
configuration:
/export/home/live/ap/ht
docs/test
Get started at https://aws.amazon.com/elasticsearch-service/
What must a data lake support?
• Collecting and storing any type of data, at any scale and at
low costs
• Securing and protecting this data in the central repository
• Searching and finding relevant data
• Quickly and easily performing data analysis
• Defining the data’s structure at the time of use
Get started at https://aws.amazon.com/elasticsearch-service/
Data Lake Architectures
Data Lake is a new and increasingly
popular architecture to store and analyze
massive volumes and heterogenous
types of data in a centralized repository
Get started at https://aws.amazon.com/elasticsearch-service/
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena Kinesis RDS
Central Storage
Secure, cost-effective
Storage in Amazon S3
S3
Snowball Database Migration
Service
Kinesis Firehose Direct Connect
Data Ingestion
Get your data into S3
Quickly and securely
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Security Token
Service
CloudWatch CloudTrail Key Management
Service
Building a data lake on AWS
Get started at https://aws.amazon.com/elasticsearch-service/
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena Kinesis RDS
Central Storage
Secure, cost-effective
Storage in Amazon S3
S3
Snowball Database Migration
Service
Kinesis Firehose Direct Connect
Data Ingestion
Get your data into S3
Quickly and securely
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Security Token
Service
CloudWatch CloudTrail Key Management
Service
Building a data lake on AWS
Amazon Elasticsearch Service
Get started at https://aws.amazon.com/elasticsearch-service/
• Real-time, distributed, search & analytics engine:
• Built on top of Apache Lucene
• Developer friendly RESTful API
Get started at https://aws.amazon.com/elasticsearch-service/
Elasticsearch for S3 data and metadata
GET metadata*/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"metadata.department": "001"
}
},
{
"range": {
"@timestamp": {
"gte": "2017-07-06T00:15:00"
...
}
• Locate objects in S3
based on creation date,
author, size, or custom
metadata
• Retrieve keys to find
source content
• Search against
unstructured file contents
Get started at https://aws.amazon.com/elasticsearch-service/
Kibana for monitoring S3 usage
Get started at https://aws.amazon.com/elasticsearch-service/
Amazon Elasticsearch Service is a cost-effective
managed service that makes it easy to deploy,
manage, and scale open source Elasticsearch for log
analytics, full-text search and more.
Amazon
Elasticsearch
Service
Get started at https://aws.amazon.com/elasticsearch-service/
Easy to use and scalable
AWS SDK
AWS CLI
AWS
CloudFormation
Elasticsearch
data nodes
Elasticsearch
master nodes
Elastic Load
Balancing
AWS IAM
CloudWatchCloudTrail
Get started at https://aws.amazon.com/elasticsearch-service/
Easy to Use
Deploy a production-
ready Elasticsearch
cluster in minutes
Open
Get direct access to the
Elasticsearch open-source
API
Secure
Secure clusters with AWS
Identity and Access
Management (IAM)
Available
Make it highly availability
using Zone Awareness
AWS Integrated
Integrate with Amazon
Kinesis Firehose, AWS IOT,
and Amazon CloudWatch
Logs for seamless data
ingestion
Scalable
Scale clusters from a single
node up to 100 nodes
Amazon Elasticsearch Service Benefits
Leading enterprises trust Amazon Elasticsearch
Service for their search and analytics applications
Media &
Entertainment
Online
Services
Technology Other
Metadata ingest to Amazon ES
Get started at https://aws.amazon.com/elasticsearch-service/
Ingest architecture
Files
S3 Events
Amazon S3 AWS Lambda
Function
Amazon Elasticsearch
Service
• Files go into the S3 data lake through any channel
• The file creation produces an event that you can catch with a Lambda
function
• The Lambda function retrieves S3 data and metadata and pushes to
Amazon ES
• Direct to API, or at larger scale, use Kinesis Firehose as a delivery pipe
Amazon Kinesis
Firehose
Get started at https://aws.amazon.com/elasticsearch-service/
Synthetic data generation
• Using python-testdata
• Keys are constructed from
dest_bin/s3_prefix/UUID.file_type
• Includes custom metadata fields – firstname, lastname, and
department
Get started at https://aws.amazon.com/elasticsearch-service/
Create a Lambda function and connect with S3
Events
• From the Lambda console,
select "Create a Lambda
function"
• Choose the "Blank
Function" template
Get started at https://aws.amazon.com/elasticsearch-service/
Configure triggers
• You can trigger
Lambda
invocations from
sources as
disparate as S3,
CloudFront,
DynamoDB, etc.
• Choose S3 as
the trigger
Get started at https://aws.amazon.com/elasticsearch-service/
Configure with your bucket
• Give the bucket's
name
• Pick "Object
Created (All)" to
receive all
creation events
• You can restrict
by prefix or suffix
• Enable the trigger!
Get started at https://aws.amazon.com/elasticsearch-service/
Configure the function
• Give the function a name
and pick the runtime
• Initial memory and timeout
may not be sufficient
• Increase based on
document sizing
Get started at https://aws.amazon.com/elasticsearch-service/
Add code
• The Lambda function receives batches of records
• For each, retrieve the custom metadata from S3
• Retrieve the object's contents
Get started at https://aws.amazon.com/elasticsearch-service/
Add code
• Flatten the structure for easier searching, pulling out
interesting details
• Use Firehose for more robust delivery
Get started at https://aws.amazon.com/elasticsearch-service/
Set up an IAM role
• In the IAM console, create a new AWS Service Role for
Lambda, name it, and add policies for S3 full access,
Firehose access, and Lambda execution
Get started at https://aws.amazon.com/elasticsearch-service/
Set access for the function
• Choose the Role you just created
• That's it!
Amazon ES setup
Get started at https://aws.amazon.com/elasticsearch-service/
Elasticsearch works with structured JSON
{
"@timestamp": "2017-07-06T04:00:05.173Z",
"bucket": "handler-data-lake",
"key": "bin2/root/predictions/97e05fcc-61ff-
11e7-b192-80e6501b37de.docx",
"principalId": "ACQ34MYGS5IYH",
"filename": "97e05fcc-61ff-11e7-b192-
80e6501b37de.docx",
"extension": "docx",
"eventName": "ObjectCreated:Put",
"eTag": "6797bde3578a82e62a4564de50f18cc0",
"awsRegion": "us-west-2",
"size": 242,
"metadata": {
"department": "021",
"last": "Johnston",
"first": "Amy"
}
}
• Documents contain fields –
name/value pairs
• Fields can nest
• Value types include text, numerics,
dates, and geo objects
• Field values can be single or array
• When you send documents to
Elasticsearch they should arrive as
JSON
*ES 5 can work with unstructured documents
Get started at https://aws.amazon.com/elasticsearch-service/
Layout of data in the cluster Amazon ES cluster
logs_01.21.2017
logs_01.22.2017
logs_01.23.2017
logs_01.24.2017
logs_01.25.2017
logs_01.26.2017
logs_01.27.2017
Shard 1
Shard 2
Shard 3
Bucket
Key
Filename
Extension
etc.
Each index has
multiple shards
Each shard contains
a set of documents
Each document contains
a set of fields and values
One index per day
Get started at https://aws.amazon.com/elasticsearch-service/
Deploy instances based on storage and use
• Metadata will be ~500 bytes. Multiply by the number of objects to get total
size. E.g., 500 Mb for 1Mn objects. Double for a replica. Compare to "Total
cluster size" to get instance count correct
• 2 instances minimum!
• M4 class of instances as a starting point. R4 for high volume
• Use dedicated masters and Zone Awareness
Querying and Analyzing data
Get started at https://aws.amazon.com/elasticsearch-service/
Before you begin, set an index template
• Matches all indexes
created as "logs-<date>"
• Set sharding
• refresh_interval controls
Lucene disk flushing
• Set specific schema for
fields
• Analyzing the S3 key
hierarchically
Get started at https://aws.amazon.com/elasticsearch-service/
Set up Kibana
Get started at https://aws.amazon.com/elasticsearch-service/
Query data – find specific content
Get started at https://aws.amazon.com/elasticsearch-service/
Query data – find specific author
Get started at https://aws.amazon.com/elasticsearch-service/
Query – who is sending .xlsx files
Get started at https://aws.amazon.com/elasticsearch-service/
Visualize – objects added over time
Get started at https://aws.amazon.com/elasticsearch-service/
Visualize – heat map for file extensions
Get started at https://aws.amazon.com/elasticsearch-service/
Visualize – total data sent to S3 over time
Get started at https://aws.amazon.com/elasticsearch-service/
Three Things to Remember
• Data lakes allow you to store and analyze massive
volumes of unstructured and heterogeneous data
• Amazon S3 provides a central repository for your data
lake
• Send object-creation events to Amazon ES via Lambda
to provide catalog and analysis of your data
Get started at https://aws.amazon.com/elasticsearch-service/
Find out more:
https://aws.amazon.com/elasticsearch-service/
AWS Centralized Logging:
https://aws.amazon.com/answers/logging/centralized-logging/
Elasticsearch at the AWS Database Blog:
https://aws.amazon.com/blogs/database/category/elasticsearch/
Or ask your Solutions Architect!
Amazon
Elasticsearch
Service

Contenu connexe

En vedette

カスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでください
カスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでくださいカスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでください
カスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでください
Takaaki Umada
 

En vedette (18)

サイボウズのサービスを支えるログ基盤
サイボウズのサービスを支えるログ基盤サイボウズのサービスを支えるログ基盤
サイボウズのサービスを支えるログ基盤
 
すべての人にチームワークを サイボウズのアクセシビリティ
すべての人にチームワークを サイボウズのアクセシビリティすべての人にチームワークを サイボウズのアクセシビリティ
すべての人にチームワークを サイボウズのアクセシビリティ
 
遅いクエリと向き合う仕組み #CybozuMeetup
遅いクエリと向き合う仕組み #CybozuMeetup遅いクエリと向き合う仕組み #CybozuMeetup
遅いクエリと向き合う仕組み #CybozuMeetup
 
あなたの開発チームには、チームワークがあふれていますか?
 あなたの開発チームには、チームワークがあふれていますか? あなたの開発チームには、チームワークがあふれていますか?
あなたの開発チームには、チームワークがあふれていますか?
 
サイボウズのフロントエンド開発 現在とこれからの挑戦
サイボウズのフロントエンド開発 現在とこれからの挑戦サイボウズのフロントエンド開発 現在とこれからの挑戦
サイボウズのフロントエンド開発 現在とこれからの挑戦
 
すべてを自動化せよ! 〜生産性向上チームの挑戦〜
すべてを自動化せよ! 〜生産性向上チームの挑戦〜すべてを自動化せよ! 〜生産性向上チームの挑戦〜
すべてを自動化せよ! 〜生産性向上チームの挑戦〜
 
How to Succeed as a PM with your Unique Skills
How to Succeed as a PM with your Unique SkillsHow to Succeed as a PM with your Unique Skills
How to Succeed as a PM with your Unique Skills
 
[POStudy]大きなSIerの中で「アジャイルな開発で飯を食う」までの歩み
[POStudy]大きなSIerの中で「アジャイルな開発で飯を食う」までの歩み[POStudy]大きなSIerの中で「アジャイルな開発で飯を食う」までの歩み
[POStudy]大きなSIerの中で「アジャイルな開発で飯を食う」までの歩み
 
「俺たちのポータル」~ガルーンユーザーがお手本にしたいポータル活用術~
「俺たちのポータル」~ガルーンユーザーがお手本にしたいポータル活用術~「俺たちのポータル」~ガルーンユーザーがお手本にしたいポータル活用術~
「俺たちのポータル」~ガルーンユーザーがお手本にしたいポータル活用術~
 
Practical attacks on commercial white-box cryptography solutions
Practical attacks on commercial white-box cryptography solutionsPractical attacks on commercial white-box cryptography solutions
Practical attacks on commercial white-box cryptography solutions
 
カスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでください
カスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでくださいカスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでください
カスタマーサポートのことは嫌いでも、カスタマーサクセスは嫌いにならないでください
 
フロー効率性とリソース効率性について #xpjug
フロー効率性とリソース効率性について #xpjugフロー効率性とリソース効率性について #xpjug
フロー効率性とリソース効率性について #xpjug
 
暗号技術入門
暗号技術入門暗号技術入門
暗号技術入門
 
Memory Networks (End-to-End Memory Networks の Chainer 実装)
Memory Networks (End-to-End Memory Networks の Chainer 実装)Memory Networks (End-to-End Memory Networks の Chainer 実装)
Memory Networks (End-to-End Memory Networks の Chainer 実装)
 
If文から機械学習への道
If文から機械学習への道If文から機械学習への道
If文から機械学習への道
 
シリコンバレーの「何が」凄いのか
シリコンバレーの「何が」凄いのかシリコンバレーの「何が」凄いのか
シリコンバレーの「何が」凄いのか
 
Azure Virtual Machines設計の勘所 | Microsoft Tech Summit 2017
Azure Virtual Machines設計の勘所 | Microsoft Tech Summit 2017Azure Virtual Machines設計の勘所 | Microsoft Tech Summit 2017
Azure Virtual Machines設計の勘所 | Microsoft Tech Summit 2017
 
4つの戦犯から考えるサービスづくりの失敗
4つの戦犯から考えるサービスづくりの失敗4つの戦犯から考えるサービスづくりの失敗
4つの戦犯から考えるサービスづくりの失敗
 

Plus de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Building a Metadata Catalog for your Data Lakes using Amazon Elasticsearch Service - July 2017

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building a Metadata Catalog for your Data Lakes using Amazon Elasticsearch Service 11 Jul 2017 Jon Handler AWS Principal Solutions Architect handler@amazon.com or @_searchgeek
  • 2. Get started at https://aws.amazon.com/elasticsearch-service/ What produces data? • Metering Records • Mobile Apps • IoT Sensors Web Clickstream • Enterprise documents • Application Logs [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/ht docs/test
  • 3. Get started at https://aws.amazon.com/elasticsearch-service/ What must a data lake support? • Collecting and storing any type of data, at any scale and at low costs • Securing and protecting this data in the central repository • Searching and finding relevant data • Quickly and easily performing data analysis • Defining the data’s structure at the time of use
  • 4. Get started at https://aws.amazon.com/elasticsearch-service/ Data Lake Architectures Data Lake is a new and increasingly popular architecture to store and analyze massive volumes and heterogenous types of data in a centralized repository
  • 5. Get started at https://aws.amazon.com/elasticsearch-service/ Catalog & Search Access and search metadata Access & User Interface Give your users easy and secure access DynamoDB Elasticsearch API Gateway Identity & Access Management Cognito QuickSight Amazon AI EMR Redshift Athena Kinesis RDS Central Storage Secure, cost-effective Storage in Amazon S3 S3 Snowball Database Migration Service Kinesis Firehose Direct Connect Data Ingestion Get your data into S3 Quickly and securely Protect and Secure Use entitlements to ensure data is secure and users’ identities are verified Security Token Service CloudWatch CloudTrail Key Management Service Building a data lake on AWS
  • 6. Get started at https://aws.amazon.com/elasticsearch-service/ Catalog & Search Access and search metadata Access & User Interface Give your users easy and secure access DynamoDB Elasticsearch API Gateway Identity & Access Management Cognito QuickSight Amazon AI EMR Redshift Athena Kinesis RDS Central Storage Secure, cost-effective Storage in Amazon S3 S3 Snowball Database Migration Service Kinesis Firehose Direct Connect Data Ingestion Get your data into S3 Quickly and securely Protect and Secure Use entitlements to ensure data is secure and users’ identities are verified Security Token Service CloudWatch CloudTrail Key Management Service Building a data lake on AWS
  • 8. Get started at https://aws.amazon.com/elasticsearch-service/ • Real-time, distributed, search & analytics engine: • Built on top of Apache Lucene • Developer friendly RESTful API
  • 9. Get started at https://aws.amazon.com/elasticsearch-service/ Elasticsearch for S3 data and metadata GET metadata*/_search { "query": { "bool": { "must": [ { "match": { "metadata.department": "001" } }, { "range": { "@timestamp": { "gte": "2017-07-06T00:15:00" ... } • Locate objects in S3 based on creation date, author, size, or custom metadata • Retrieve keys to find source content • Search against unstructured file contents
  • 10. Get started at https://aws.amazon.com/elasticsearch-service/ Kibana for monitoring S3 usage
  • 11. Get started at https://aws.amazon.com/elasticsearch-service/ Amazon Elasticsearch Service is a cost-effective managed service that makes it easy to deploy, manage, and scale open source Elasticsearch for log analytics, full-text search and more. Amazon Elasticsearch Service
  • 12. Get started at https://aws.amazon.com/elasticsearch-service/ Easy to use and scalable AWS SDK AWS CLI AWS CloudFormation Elasticsearch data nodes Elasticsearch master nodes Elastic Load Balancing AWS IAM CloudWatchCloudTrail
  • 13. Get started at https://aws.amazon.com/elasticsearch-service/ Easy to Use Deploy a production- ready Elasticsearch cluster in minutes Open Get direct access to the Elasticsearch open-source API Secure Secure clusters with AWS Identity and Access Management (IAM) Available Make it highly availability using Zone Awareness AWS Integrated Integrate with Amazon Kinesis Firehose, AWS IOT, and Amazon CloudWatch Logs for seamless data ingestion Scalable Scale clusters from a single node up to 100 nodes Amazon Elasticsearch Service Benefits
  • 14. Leading enterprises trust Amazon Elasticsearch Service for their search and analytics applications Media & Entertainment Online Services Technology Other
  • 15. Metadata ingest to Amazon ES
  • 16. Get started at https://aws.amazon.com/elasticsearch-service/ Ingest architecture Files S3 Events Amazon S3 AWS Lambda Function Amazon Elasticsearch Service • Files go into the S3 data lake through any channel • The file creation produces an event that you can catch with a Lambda function • The Lambda function retrieves S3 data and metadata and pushes to Amazon ES • Direct to API, or at larger scale, use Kinesis Firehose as a delivery pipe Amazon Kinesis Firehose
  • 17. Get started at https://aws.amazon.com/elasticsearch-service/ Synthetic data generation • Using python-testdata • Keys are constructed from dest_bin/s3_prefix/UUID.file_type • Includes custom metadata fields – firstname, lastname, and department
  • 18. Get started at https://aws.amazon.com/elasticsearch-service/ Create a Lambda function and connect with S3 Events • From the Lambda console, select "Create a Lambda function" • Choose the "Blank Function" template
  • 19. Get started at https://aws.amazon.com/elasticsearch-service/ Configure triggers • You can trigger Lambda invocations from sources as disparate as S3, CloudFront, DynamoDB, etc. • Choose S3 as the trigger
  • 20. Get started at https://aws.amazon.com/elasticsearch-service/ Configure with your bucket • Give the bucket's name • Pick "Object Created (All)" to receive all creation events • You can restrict by prefix or suffix • Enable the trigger!
  • 21. Get started at https://aws.amazon.com/elasticsearch-service/ Configure the function • Give the function a name and pick the runtime • Initial memory and timeout may not be sufficient • Increase based on document sizing
  • 22. Get started at https://aws.amazon.com/elasticsearch-service/ Add code • The Lambda function receives batches of records • For each, retrieve the custom metadata from S3 • Retrieve the object's contents
  • 23. Get started at https://aws.amazon.com/elasticsearch-service/ Add code • Flatten the structure for easier searching, pulling out interesting details • Use Firehose for more robust delivery
  • 24. Get started at https://aws.amazon.com/elasticsearch-service/ Set up an IAM role • In the IAM console, create a new AWS Service Role for Lambda, name it, and add policies for S3 full access, Firehose access, and Lambda execution
  • 25. Get started at https://aws.amazon.com/elasticsearch-service/ Set access for the function • Choose the Role you just created • That's it!
  • 27. Get started at https://aws.amazon.com/elasticsearch-service/ Elasticsearch works with structured JSON { "@timestamp": "2017-07-06T04:00:05.173Z", "bucket": "handler-data-lake", "key": "bin2/root/predictions/97e05fcc-61ff- 11e7-b192-80e6501b37de.docx", "principalId": "ACQ34MYGS5IYH", "filename": "97e05fcc-61ff-11e7-b192- 80e6501b37de.docx", "extension": "docx", "eventName": "ObjectCreated:Put", "eTag": "6797bde3578a82e62a4564de50f18cc0", "awsRegion": "us-west-2", "size": 242, "metadata": { "department": "021", "last": "Johnston", "first": "Amy" } } • Documents contain fields – name/value pairs • Fields can nest • Value types include text, numerics, dates, and geo objects • Field values can be single or array • When you send documents to Elasticsearch they should arrive as JSON *ES 5 can work with unstructured documents
  • 28. Get started at https://aws.amazon.com/elasticsearch-service/ Layout of data in the cluster Amazon ES cluster logs_01.21.2017 logs_01.22.2017 logs_01.23.2017 logs_01.24.2017 logs_01.25.2017 logs_01.26.2017 logs_01.27.2017 Shard 1 Shard 2 Shard 3 Bucket Key Filename Extension etc. Each index has multiple shards Each shard contains a set of documents Each document contains a set of fields and values One index per day
  • 29. Get started at https://aws.amazon.com/elasticsearch-service/ Deploy instances based on storage and use • Metadata will be ~500 bytes. Multiply by the number of objects to get total size. E.g., 500 Mb for 1Mn objects. Double for a replica. Compare to "Total cluster size" to get instance count correct • 2 instances minimum! • M4 class of instances as a starting point. R4 for high volume • Use dedicated masters and Zone Awareness
  • 31. Get started at https://aws.amazon.com/elasticsearch-service/ Before you begin, set an index template • Matches all indexes created as "logs-<date>" • Set sharding • refresh_interval controls Lucene disk flushing • Set specific schema for fields • Analyzing the S3 key hierarchically
  • 32. Get started at https://aws.amazon.com/elasticsearch-service/ Set up Kibana
  • 33. Get started at https://aws.amazon.com/elasticsearch-service/ Query data – find specific content
  • 34. Get started at https://aws.amazon.com/elasticsearch-service/ Query data – find specific author
  • 35. Get started at https://aws.amazon.com/elasticsearch-service/ Query – who is sending .xlsx files
  • 36. Get started at https://aws.amazon.com/elasticsearch-service/ Visualize – objects added over time
  • 37. Get started at https://aws.amazon.com/elasticsearch-service/ Visualize – heat map for file extensions
  • 38. Get started at https://aws.amazon.com/elasticsearch-service/ Visualize – total data sent to S3 over time
  • 39. Get started at https://aws.amazon.com/elasticsearch-service/ Three Things to Remember • Data lakes allow you to store and analyze massive volumes of unstructured and heterogeneous data • Amazon S3 provides a central repository for your data lake • Send object-creation events to Amazon ES via Lambda to provide catalog and analysis of your data
  • 40. Get started at https://aws.amazon.com/elasticsearch-service/ Find out more: https://aws.amazon.com/elasticsearch-service/ AWS Centralized Logging: https://aws.amazon.com/answers/logging/centralized-logging/ Elasticsearch at the AWS Database Blog: https://aws.amazon.com/blogs/database/category/elasticsearch/ Or ask your Solutions Architect! Amazon Elasticsearch Service