Learning Objectives:
- Understand the search and visualization capabilities of Amazon Elasticsearch Service
- Learn how to set up Amazon Kinesis Firehose to ingesting, transform and load document metadata into Amazon Elasticsearch Service
- Learn best practices for building a metadata repository for your data lakes using Amazon Elasticsearch Service
Data lakes can consist of massive amounts of unstructured data that makes it difficult to search and explore them. You can use Amazon Elasticsearch Service to easily index and search the metadata and the content of the documents in your data lakes. In this tech talk, you will learn how to use Amazon Elasticsearch Service build a metadata repository for your data lake and index the contents of your documents so you can easily locate files by the text contained in them.
2. Get started at https://aws.amazon.com/elasticsearch-service/
What produces data?
• Metering
Records
• Mobile
Apps
• IoT
Sensors
Web
Clickstream
• Enterprise
documents
• Application
Logs
[Wed Oct 11 14:32:52
2000] [error] [client
127.0.0.1] client
denied by server
configuration:
/export/home/live/ap/ht
docs/test
3. Get started at https://aws.amazon.com/elasticsearch-service/
What must a data lake support?
• Collecting and storing any type of data, at any scale and at
low costs
• Securing and protecting this data in the central repository
• Searching and finding relevant data
• Quickly and easily performing data analysis
• Defining the data’s structure at the time of use
4. Get started at https://aws.amazon.com/elasticsearch-service/
Data Lake Architectures
Data Lake is a new and increasingly
popular architecture to store and analyze
massive volumes and heterogenous
types of data in a centralized repository
5. Get started at https://aws.amazon.com/elasticsearch-service/
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena Kinesis RDS
Central Storage
Secure, cost-effective
Storage in Amazon S3
S3
Snowball Database Migration
Service
Kinesis Firehose Direct Connect
Data Ingestion
Get your data into S3
Quickly and securely
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Security Token
Service
CloudWatch CloudTrail Key Management
Service
Building a data lake on AWS
6. Get started at https://aws.amazon.com/elasticsearch-service/
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena Kinesis RDS
Central Storage
Secure, cost-effective
Storage in Amazon S3
S3
Snowball Database Migration
Service
Kinesis Firehose Direct Connect
Data Ingestion
Get your data into S3
Quickly and securely
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Security Token
Service
CloudWatch CloudTrail Key Management
Service
Building a data lake on AWS
8. Get started at https://aws.amazon.com/elasticsearch-service/
• Real-time, distributed, search & analytics engine:
• Built on top of Apache Lucene
• Developer friendly RESTful API
9. Get started at https://aws.amazon.com/elasticsearch-service/
Elasticsearch for S3 data and metadata
GET metadata*/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"metadata.department": "001"
}
},
{
"range": {
"@timestamp": {
"gte": "2017-07-06T00:15:00"
...
}
• Locate objects in S3
based on creation date,
author, size, or custom
metadata
• Retrieve keys to find
source content
• Search against
unstructured file contents
10. Get started at https://aws.amazon.com/elasticsearch-service/
Kibana for monitoring S3 usage
11. Get started at https://aws.amazon.com/elasticsearch-service/
Amazon Elasticsearch Service is a cost-effective
managed service that makes it easy to deploy,
manage, and scale open source Elasticsearch for log
analytics, full-text search and more.
Amazon
Elasticsearch
Service
12. Get started at https://aws.amazon.com/elasticsearch-service/
Easy to use and scalable
AWS SDK
AWS CLI
AWS
CloudFormation
Elasticsearch
data nodes
Elasticsearch
master nodes
Elastic Load
Balancing
AWS IAM
CloudWatchCloudTrail
13. Get started at https://aws.amazon.com/elasticsearch-service/
Easy to Use
Deploy a production-
ready Elasticsearch
cluster in minutes
Open
Get direct access to the
Elasticsearch open-source
API
Secure
Secure clusters with AWS
Identity and Access
Management (IAM)
Available
Make it highly availability
using Zone Awareness
AWS Integrated
Integrate with Amazon
Kinesis Firehose, AWS IOT,
and Amazon CloudWatch
Logs for seamless data
ingestion
Scalable
Scale clusters from a single
node up to 100 nodes
Amazon Elasticsearch Service Benefits
14. Leading enterprises trust Amazon Elasticsearch
Service for their search and analytics applications
Media &
Entertainment
Online
Services
Technology Other
16. Get started at https://aws.amazon.com/elasticsearch-service/
Ingest architecture
Files
S3 Events
Amazon S3 AWS Lambda
Function
Amazon Elasticsearch
Service
• Files go into the S3 data lake through any channel
• The file creation produces an event that you can catch with a Lambda
function
• The Lambda function retrieves S3 data and metadata and pushes to
Amazon ES
• Direct to API, or at larger scale, use Kinesis Firehose as a delivery pipe
Amazon Kinesis
Firehose
17. Get started at https://aws.amazon.com/elasticsearch-service/
Synthetic data generation
• Using python-testdata
• Keys are constructed from
dest_bin/s3_prefix/UUID.file_type
• Includes custom metadata fields – firstname, lastname, and
department
18. Get started at https://aws.amazon.com/elasticsearch-service/
Create a Lambda function and connect with S3
Events
• From the Lambda console,
select "Create a Lambda
function"
• Choose the "Blank
Function" template
19. Get started at https://aws.amazon.com/elasticsearch-service/
Configure triggers
• You can trigger
Lambda
invocations from
sources as
disparate as S3,
CloudFront,
DynamoDB, etc.
• Choose S3 as
the trigger
20. Get started at https://aws.amazon.com/elasticsearch-service/
Configure with your bucket
• Give the bucket's
name
• Pick "Object
Created (All)" to
receive all
creation events
• You can restrict
by prefix or suffix
• Enable the trigger!
21. Get started at https://aws.amazon.com/elasticsearch-service/
Configure the function
• Give the function a name
and pick the runtime
• Initial memory and timeout
may not be sufficient
• Increase based on
document sizing
22. Get started at https://aws.amazon.com/elasticsearch-service/
Add code
• The Lambda function receives batches of records
• For each, retrieve the custom metadata from S3
• Retrieve the object's contents
23. Get started at https://aws.amazon.com/elasticsearch-service/
Add code
• Flatten the structure for easier searching, pulling out
interesting details
• Use Firehose for more robust delivery
24. Get started at https://aws.amazon.com/elasticsearch-service/
Set up an IAM role
• In the IAM console, create a new AWS Service Role for
Lambda, name it, and add policies for S3 full access,
Firehose access, and Lambda execution
25. Get started at https://aws.amazon.com/elasticsearch-service/
Set access for the function
• Choose the Role you just created
• That's it!
27. Get started at https://aws.amazon.com/elasticsearch-service/
Elasticsearch works with structured JSON
{
"@timestamp": "2017-07-06T04:00:05.173Z",
"bucket": "handler-data-lake",
"key": "bin2/root/predictions/97e05fcc-61ff-
11e7-b192-80e6501b37de.docx",
"principalId": "ACQ34MYGS5IYH",
"filename": "97e05fcc-61ff-11e7-b192-
80e6501b37de.docx",
"extension": "docx",
"eventName": "ObjectCreated:Put",
"eTag": "6797bde3578a82e62a4564de50f18cc0",
"awsRegion": "us-west-2",
"size": 242,
"metadata": {
"department": "021",
"last": "Johnston",
"first": "Amy"
}
}
• Documents contain fields –
name/value pairs
• Fields can nest
• Value types include text, numerics,
dates, and geo objects
• Field values can be single or array
• When you send documents to
Elasticsearch they should arrive as
JSON
*ES 5 can work with unstructured documents
28. Get started at https://aws.amazon.com/elasticsearch-service/
Layout of data in the cluster Amazon ES cluster
logs_01.21.2017
logs_01.22.2017
logs_01.23.2017
logs_01.24.2017
logs_01.25.2017
logs_01.26.2017
logs_01.27.2017
Shard 1
Shard 2
Shard 3
Bucket
Key
Filename
Extension
etc.
Each index has
multiple shards
Each shard contains
a set of documents
Each document contains
a set of fields and values
One index per day
29. Get started at https://aws.amazon.com/elasticsearch-service/
Deploy instances based on storage and use
• Metadata will be ~500 bytes. Multiply by the number of objects to get total
size. E.g., 500 Mb for 1Mn objects. Double for a replica. Compare to "Total
cluster size" to get instance count correct
• 2 instances minimum!
• M4 class of instances as a starting point. R4 for high volume
• Use dedicated masters and Zone Awareness
31. Get started at https://aws.amazon.com/elasticsearch-service/
Before you begin, set an index template
• Matches all indexes
created as "logs-<date>"
• Set sharding
• refresh_interval controls
Lucene disk flushing
• Set specific schema for
fields
• Analyzing the S3 key
hierarchically
32. Get started at https://aws.amazon.com/elasticsearch-service/
Set up Kibana
33. Get started at https://aws.amazon.com/elasticsearch-service/
Query data – find specific content
34. Get started at https://aws.amazon.com/elasticsearch-service/
Query data – find specific author
35. Get started at https://aws.amazon.com/elasticsearch-service/
Query – who is sending .xlsx files
36. Get started at https://aws.amazon.com/elasticsearch-service/
Visualize – objects added over time
37. Get started at https://aws.amazon.com/elasticsearch-service/
Visualize – heat map for file extensions
38. Get started at https://aws.amazon.com/elasticsearch-service/
Visualize – total data sent to S3 over time
39. Get started at https://aws.amazon.com/elasticsearch-service/
Three Things to Remember
• Data lakes allow you to store and analyze massive
volumes of unstructured and heterogeneous data
• Amazon S3 provides a central repository for your data
lake
• Send object-creation events to Amazon ES via Lambda
to provide catalog and analysis of your data
40. Get started at https://aws.amazon.com/elasticsearch-service/
Find out more:
https://aws.amazon.com/elasticsearch-service/
AWS Centralized Logging:
https://aws.amazon.com/answers/logging/centralized-logging/
Elasticsearch at the AWS Database Blog:
https://aws.amazon.com/blogs/database/category/elasticsearch/
Or ask your Solutions Architect!
Amazon
Elasticsearch
Service