Contenu connexe Similaire à AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019 (20) AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 20191. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
AWS Analytics Services - When to use what?
With SimilarWeb
Roy Hasson
Business Development Lead – Analytics and Data Lakes
Amazon Web Services
D A T 2 0 1
Ido Senesh
Sr. Software Engineer
SimilarWeb
2. S U MMI T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Data
every 5 years
There is more data
than people think
15
years
live for
Data platforms need to
1,000x
scale
>10x
grows
Modern Data Challenges
4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
There are more
people accessing data
And more
requirements for
making data available
Data Scientists
Analysts
Business Users
Applications
Secure Real time
Flexible Scalable
Modern Data Challenges
5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Democratization
of data
Governance
& control
There are more
people working
with data than
ever before
How do I provide democratized
access to data to enable
informed decisions while at the
same time enforce data
governance and prevent
mismanagement of the data?
Modern Data Challenges
6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
AWS databases and analytics
Broad and deep portfolio, built for builders
AWS Marketplace
Amazon Redshift
Data warehousing
Amazon EMR
Hadoop + Spark
Athena
Interactive analytics
Kinesis Analytics
Real-time
Amazon Elasticsearch service
Operational Analytics
RDS
MySQL, PostgreSQL, MariaDB,
Oracle, SQL Server
Aurora
MySQL, PostgreSQL
Amazon
QuickSight
Amazon
SageMaker
DynamoDB
Key value, Document
ElastiCache
Redis, Memcached
Neptune
Graph
Timestream
Time Series
QLDB
Ledger Database
S3/Amazon Glacier
AWS Glue
ETL & Data Catalog
Lake Formation
Data Lakes
Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Data Pipeline | Direct Connect
Data Movement
AnalyticsDatabases
Business Intelligence & Machine Learning
Data Lake
Managed
Blockchain
Blockchain
Templates
Blockchain
Amazon
Comprehend
Amazon
Rekognition
Amazon
Lex
Amazon
Transcribe
AWS DeepLens 250+ solutions
730+ Database
solutions
600+ Analytics
solutions
25+ Blockchain
solutions
20+ Data lake
solutions
30+ solutions
RDS on VMWare
7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
A data lake is a centralized repository that allows
you to store all your structured and unstructured
data at any scale
8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
More data lakes & analytics on AWS than anywhere else
9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Typical steps of building a data lake
Setup Storage1
Move data2
Cleanse, prep, and
catalog data
3
Configure and enforce
security and compliance
policies
4
Make data available
for analytics
5
10. S U MMI T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Kinesis Data Streams
12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Managed Streaming for Kafka
13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Kinesis Data Firehose
14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Devices
Web
Sensors
Social
EDW
S3://bucket/year=yyyy/month=mm/file.parquet
S3://bucket/year=yyyy/month=mm/file.orc
Real-time data analysis
with Amazon Kinesis
Data Analytics
Ingest streaming
events in real time
with Amazon Kinesis
Output streaming data
to select destinations.
Optimize file format
Take action
Ingestion: Streaming Events
15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Database Migration Service
16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue ETL
17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Ingestion: Database and Data Warehouse
Devices
Web
Sensors
Social
EDW
S3://bucket/table/LOAD001.csv
S3://bucket/table/20181127-1134010000.csv
S3://bucket/year=yyyy/month=mm/file.parquet
S3://bucket/year=yyyy/month=mm/file.orc
S3://bucket/year=yyyy/month=mm/file.parquet
S3://bucket/year=yyyy/month=mm/file.orc
18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Ingestion: How to choose
Event
Stream
Batch
Operation
Database
Source
CDC
Persist
Data
Real-time
Analytics
Open
Source
Seamless
Scaling
Y
N
Y
NNY
N
Y YSnapshot
Incremental
N
Amazon
DMS
Y
Y
N
Amazon
MSK
Y
19. S U MMI T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unified metadata repository across relational
databases, Amazon RDS, Amazon Redshift, and
Amazon S3.
Single searchable view into your data, no matter
where it is stored
Ability to automatically crawl and classify your data
Augment technical metadata with business metadata
for tables
Manage access to data using Fine Grain Access
Controls. Even finer with AWS Lake Formation
Apache Hive metastore compatible and integrated
with AWS Analytics services
AWS Glue Data Catalog
Search and explore available data
21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Crawlers automatically build your Data
Catalog and keep it in sync.
Automatically discover new data, extracts
schema definitions
Detect schema changes and version tables
Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom
classifiers using Grok expression
Run ad hoc or on a schedule; serverless – only
pay when crawler runs
AWS Glue Crawlers
Crawlers
Automatically catalog your data
22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
AWS Lake Formation (join the preview)
Build, secure, and manage a data lake in days
Build a data lake in days,
not months
Build and deploy a fully
managed data lake with a few
clicks
Enforce security policies
across multiple services
Centrally define security,
governance, and auditing policies in
one place and enforce those policies
for all users and all applications
Combine different
analytics approaches
Empower analyst and data scientist
productivity, giving them self-
service discovery and safe access to
all data from a single catalog
23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Hive Metastore Client for AWS Glue Data Catalog
• Connect Hive-Metastore compatible platforms to AWS Glue Data Catalog
• Apache Hive 2.x compatible
• Apache 2.0 license
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore
24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Data Catalog
AWS Glue
Data Catalog
Crawl data sources,
catalog schema &
partitions
Connect Hive
compatible sources
via open connector
Search and
discover data in
your data lake
Integrated AWS
Analytics tools
25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Catalog: How to choose
Auto-
discovery
AWS
Analytics
Integration
Y YY Governance
FGAC
AWS
Lake
Formation
DC
Managed
Hadoop
Hive
Metastore
RDS
N
N
GDC
Open
Connector
Y
26. S U MMI T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue ETL
28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR
29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift
30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Data Transformation
Structured and
unstructured data
available in raw
S3 bucket
Other real-
time streaming
sources
Sometimes
ELT is a better
option
Transformed
S3 bucket for
querying
31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Process: How to choose
Consistent
schema/big
data
Cluster
customization
Y YServerless
N
SQL based
transforms
Y
N
Transactional
Y
Variable
schema/sm
all data
Y
Y
<15min
job
Y
N
N
Apache Spark
Python Shell
32. S U MMI T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena
Permissions
Data Lake
AWS Cloud
AWS Cloud
Reporting
&
Analytics
Machine
Learning
AWS Cloud
Custom
Applications
AWS Glue
Data Catalog
34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Amazon EMR Notebooks in the Console
A managed analytics environment based on Jupyter Notebooks
Amazon EMR clusters
AWS Management
Console for EMR
EMR-managed notebook based
on Jupyter notebook
users
Auto saves notebook file to your S3 bucket
Run queries on your remote EMR cluster
EMR VPC
Customer VPC
35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Elasticsearch Service
36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon QuickSight
37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Serve and Consume
Permissions
Data Lake
AWS Glue
Data Catalog
AWS Cloud
AWS Cloud
Reporting
&
Analytics
Machine
Learning
AWS Cloud
Custom
Applications
38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Serve: How to choose
Interactive
query
Free-form
search
Y Ymili-sec
response
N
Serverless
N
Interactive
code
Y
Y
N
Y
Repeated
queries
Y
39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Build with flexibility in mind
Open Source Secure IntegratedManaged
&
Elastic
40. S U MMI T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A queryable interface for the entire goddamn internet
SimilarWeb’s Lead Generator
41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
SimilarWeb gives you digital market intelligence
for every website and mobile app worldwide
to understand, track and grow your market
share.
SimilarWeb’s Mission
42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
500M
Websites
Some Numbers
100+
Dimensions
50+
Countries
43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
The Usecase
Show me all websites with 50M+ Monthly visits
from United States, 60%+ mobile share,
Bounce rate is less than 10%,
More than 30% of visits by men aged 18-25
and traffic spiked by 30%+ in the past year
Sales person:
500M
Websites
X
100+
Dimensions
X
50+
Countries
The internet
as measured
by Similarweb
44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Billions of records to query & process for each report
45. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Requirements
● Zero operations
● Cost efficient
● High Availability
● Responsiveness (~seconds query time)
● Data is stored on S3
● Schema evolution
46. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Is it possible?
47. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Amazon Athena
48. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Amazon Athena
Serverless
Just connect to an endpoint
and submit your queries
49. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Amazon Athena
Cost Effectiveness
Running 10k queries with
a monthly cost of 150$
50. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Amazon Athena
Fully automated
data discovery
using Glue Crawlers
51. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Amazon Athena
SQL support
Provide customers with rich features
(ordering, aggregations, analytic functions)
without any effort from our side
52. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Amazon Athena
Serverless Cost Effectiveness
Fully automated
data discovery
SQL support
53. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Preparations to production?
54. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Production Readiness
55. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Production Readiness
● Out of the box limits are 5 concurrent queries per second
○ Soft limit - open a limit increase ticket to support
56. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Production Readiness
● Out of the box limits are 5 concurrent queries per second
○ Soft limit - open a limit increase ticket to support
● Workgroups - Control costs and limit parallelism per
business case
57. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Production Readiness
● Out of the box limits are 5 concurrent queries per second
○ Soft limit - open a limit increase ticket to support
● Workgroups - Control costs and limit parallelism per
business case
● Monitoring - A keep alive every second is not a good idea
(it cost us 1000$)
58. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Production Readiness
● Out of the box limits are 5 concurrent queries per second
○ Soft limit - open a limit increase ticket to support
● Workgroups - Control costs and limit parallelism per
business case
● Monitoring - A keep alive every second is not a good idea
(it cost us 1000$)
● Disaster Recovery
○ Data - S3 Cross Region Replication - Provided by AWS
○ Metadata - You need to take care of it by yourself (Lambda, Crawlers)
59. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Tips & Tricks for
Performance
(This would save you s*** tons of time & money)
60. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Production Readiness
● Order your data
○ Columnar formats work best if you write the data ordered
by a commonly used filter key
61. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Production Readiness
● Order your data
○ Columnar formats work best if you write the data ordered
by a commonly used filter key
● Use Hive bucketing
○ Directs Athena to specific files instead of scanning a whole directory
62. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Production Readiness
● Order your data
○ Columnar formats work best if you write the data ordered
by a commonly used filter key
● Use Hive bucketing
○ Directs Athena to specific files instead of scanning a whole directory
● Use JDBC
○ Better than the API for large reports (over thousands of rows
returned to client)
63. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Read more on our experience working & optimizing
Athena and other cool stuff we are doing at
similarweb.engineering
64. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Feel free to send me your CV at
ido.senesh@similarweb.com
or at linkedin.com/in/senesh
65. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
66. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI T
Thank you!
S U MMI T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Roy Hasson
royon@amazon.com
@royhasson http://bit.ly/2SJ6WBa
Ido Senesh
ido.senesh@similarweb.com
linkedin.com/in/senesh
67. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U MMI TS U MMI T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.