SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SSID: Guest
Password: Cube@11999
Building Data Lake on AWS
Adir Sharabi
Solutions Architect, Amazon Web Services
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Floor28 Agenda
GameDay
24 Oct
Enterprise IT Day
23 Oct
Builders Day
AppSync, Alexa & IoT
22 Oct
Big Data Day
14 Oct
ML & DL Day
15 Oct
DevOps Day
16 Oct
DevOps Day
17 Oct
Technical Sessions
Serverless Data Workshop
Big Data UG Meetup
Technical Sessions
SageMaker Workshop
ML&DL Meetup
Technical Sessions
K8s Workshop
DevOps Meetup
Technical Sessions
Spot Workshop
Databases Day
18 Oct
Technical Sessions
Serverless Workshop
Virtual assistants UG Meetup
Technical Sessions
PyTorch Meetup
Technical Sessions
CDK Workshop
AWS IL UG Meetup
Builders Day
Serverless backend
21 Oct
Technical Sessions
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big Data Day Agenda
# Time Title Speaker
1 9:30 - 10:15 Building Data Lake on AWS Adir Sharabi
2 10:30 - 11:15
Store once, query thrice: Introduction to query
engines on AWS
Daniel Haviv
3 11:30 – 12:15
Introduction to Real-Time Streaming Analytics -
Amazon Kinesis State Of Union
Roy Ben Alta
4 12:30 - 13:15 From data to insights Orit Alul
5 15:00 – 18:00 Serverless Data Processing Workshop Adir Sharabi
6 18:00 – 20:00 Big Data User Group Meetup
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Documents and files Streams
Multiple sources and formats… and growing everyday
Your Data Sources
Records
Amazon
RDS
Amazon
DynamoDB
AWS IoT
On Premises
databases
Spreadsheets Infrastructure
logs
Clickstream data Mobile app data
Social media data Amazon
Redshift
Device data
Sensor data
ERP WEB
Clickstream
Mobile Apps
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Challenges
Analysts Applications
Data Scientists
Business Users API Access BI Tools
Notebooks
Multiple consumers
and requirements
1990 2000 2010 2020
Generated Data
Available for
Analysis
Data Visibility Multiple Access
Mechanisms
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Traditionally, Analytics Used to Look Like This
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence
Relational data
Schema defined prior to data load
TBs-PBs Scale
Operational reporting and ad hoc
Large initial capex + $10K–$50K/TB/Year
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes Extend the Traditional Approach
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
1001100001001010111001
0101011100101010000101
1111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW
Queries
Big data
processing
Interactive Real-time
Relational and non-relational data
Schema defined during analysis
Scale storage and compute independently
Diverse analytical engines to gain insights
Designed for low-cost storage and analytics
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Amazon S3
Redshift
EMR
Athena Kinesis
Elasticsearch
Service
Amazon S3 as Data Lakes Storage Layer
Kinesis
Video Streams
AI Services
Many ways to bring all kinds of data
Unmatched durability and availability at EB scale
Best security, compliance, and audit capabilities
Integration with Big Data Tools
Run any analytics on the same data without movement
Cost effective - Store data at $0.023 / GB / Month
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Store
Simplified Big Data Pipeline
Amazon S3
Ingest
Process &
Analyze Consume
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lots of ingestion tools
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Process &
Analyze Consume
Store
Amazon S3
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time data movement and Data Lakes on AWS
Amazon
Kinesis Data
Firehose
Amazon
Kinesis Data
Streams
Kinesis Agent
Apache Kafka
AWS SDK
LOG4J
Flume
Fluentd
AWS Mobile SDK
Kinesis Producer Library
Amazon S3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lots of ingestion tools
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Process &
Analyze Consume
Store
Amazon S3
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Variety of data processing tools
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift
& Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Store
Amazon S3
Process & Analyze
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
Amazon Athena – interactive analysis
$ SQL
Query instantly
Zero setup cost;
just point to
Amazon S3 and
start querying
Pay per query
Pay only for queries run;
save 30%–90% on per-
query costs through
compression
Open
ANSI SQL interface,
JDBC/ODBC drivers,
multiple formats,
compression types, and
complex joins and data
types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with Amazon
QuickSight
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analytics and ML at scale
19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more
Enterprise-grade security
Amazon EMR – big data processing
Latest versions
Updated with the latest
open source frameworks
within 30 days of
release
Low cost
Flexible billing with per-
second billing, Amazon
EC2 Spot, Reserved
Instances, and Auto
Scaling to reduce costs
50%-80%
Use Amazon S3
storage
Process data directly in
the Amazon S3 data
lake securely with high
performance using the
EMRFS connector
Easy
Launch fully managed
Hadoop & Spark in
minutes; no cluster
setup, node provisioning,
cluster tuning
Data Lake
100110000100101011100
1010101110010101000
00111100101100101
010001100001
$
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost
Massively parallel, scale from gigabytes to petabytes
Amazon Redshift – data warehousing
Fast at scale
Columnar storage
technology to improve
I/O efficiency and scale
query performance
Inexpensive
As low as $1,000 per
terabyte per year, 1/10 the
cost of traditional data
warehouse solutions; start
at $0.25 per hour
Open file formats Secure
Audit everything;
encrypt data end-to-
end; extensive
certification and
compliance
Analyze optimized data
formats on the latest
SSD, and all open data
formats in Amazon S3
$
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Extend the data warehouse to exabytes of data in Amazon S3 data lake
Amazon Redshift Spectrum
Amazon S3
Data Lake
Amazon
Redshift data
Amazon Redshift Spectrum
query engine
Exabyte Redshift SQL queries against Amazon S3
Join data across Redshift and Amazon S3
Scale compute and storage separately
Stable query performance and unlimited concurrency
CSV, ORC, Grok, Avro, & Parquet data formats
Pay only for the amount of data scanned
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon
Redshift
JDBC / ODBC
...
1 2 3 4 N
Redshift Spectrum
Scale-out serverless compute
COPY
commands
Hot data
Query directly on data
lake
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Multiple ways to consume the data
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift
& Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Store
Amazon S3
Process & Analyze
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
Amazon
QuickSight
Jupyter, Zeppelin,
HUE
Amazon API
Gateway
BI Tools
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Because data is NEVER perfect
Amazon EMR
Spark and Hive running on EMR
Clean
Transform
Concatenate
Convert to better formats
Schedule transformations
Event-driven transformations
Transformations expressed as code
AWS Glue
Event based Server-less ETL engine
AWS Lambda
Trigger-based Code Execution
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue
Job Authoring Job Execution
Auto-generates ETL code
Python/Scala and Apache Spark
Edit, debug, and share
Develop
Serverless execution
Flexible scheduling
Monitoring and alerting
Deploy
Data Catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ETL when you need it
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift
& Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Store
Amazon S3
Process & Analyze
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
Amazon
QuickSight
Jupyter, Zeppelin,
HUE
Amazon API
Gateway
BI Tools
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Realtime - in-stream processing
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift &
Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Store
Amazon S3
Process & Analyze
Spark
Streaming
& Flink
Amazon
Kinesis
Analytics
In stream process
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
Amazon
QuickSight
Jupyter, Zeppelin,
HUE
Amazon API
Gateway
BI Tools
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue Data Catalog
One per account
Allows you to share metadata between Amazon Athena, Amazon Redshift
Spectrum, EMR & JDBC sources
Serverless
We added a few extensions:
§ Search over metadata for data discovery
§ Manage Connections – JDBC URLs, credentials
§ Classification for identifying and parsing files
§ Versioning of table metadata as schemas evolve and other
metadata are updated
Central Metadata
Catalog for the data
lake
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue Data Catalog Crawlers
Crawlers automatically build your Data Catalog and keep it in sync
Automatically discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom classifiers using Grok
expression
Run ad hoc or on a schedule; serverless – only pay when crawler runs
Catalogs Your Data
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue Crawler Databases
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
• MySQL
• MariaDB
• PostgreSQL
• Amazon Redshift
Create additional custom
classifiers
Amazon
DynamoDB
NoSQL Connection
• Aurora
• Oracle
Built-in classifiers
• Avro
• Parquet
• ORC
• XML
• JSON & JSONPaths
• AWS CloudTrail
• BSON
• Logs
• (Apache (Grok), Linux(Grok), MS(Grok),
Ruby, Redis, and many others)
• Delimited
• (comma, pipe, tab, semicolon)
• < ALWAYS GROWING…>
What can crawlers discover?
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Call the AWS Glue CreateTable API
Create table manually DDL statement (in Amazon Athena or Amazon EMR)
Apache Hive
Metastore
AWS GLUE ETL AWS GLUE
DATA CATALOG
Import from Apache Hive Metastore
Other ways of populating the catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Write once, catalog once, read multiple, ETL Anywhere
IngestData sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Consume
Amazon Athena
Amazon EMR
Amazon Redshift &
Spectrum
Amazon
Elasticsearch
Amazon AI/ML/DL
Services
Data Catalog
Store
Amazon S3
Process & Analyze
Amazon
QuickSight
Jupyter, Zeppelin,
HUE
Amazon API
Gateway
Amazon S3
API
Amazon Kinesis
Firehose
Direct Connect
Snowball
Database
Migration Service
Spark
Streaming
& Flink
Amazon
Kinesis
Analytics
In stream process
BI Tools
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Data lakes and data warehouses complement each other
• Loose Coupling, but highly performant
• Storage, analytics, metadata management, etc..
• Choosing the best tool for the job
• Future-proof your analytics
• Elasticity and multiple clusters for dedicated purposes
• Replace capacity planning with a consumption model
• Don’t forget metadata management
Core Tenets
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank You!
Adir Sharabi
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SSID: Guest
Password: Cube@11999
GAME DAY
PUT YOUR SKILLS TO THE TEST
OCT 24
Register now: bit.ly/Floor28GameDay

Contenu connexe

Tendances

Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarAmazon Web Services
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Amazon Web Services
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudAmazon Web Services
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewAmazon Web Services
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Amazon Web Services
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...Amazon Web Services
 
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Amazon Web Services
 
How Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsHow Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsAmazon Web Services
 
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...Amazon Web Services
 
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018Amazon Web Services
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfAmazon Web Services
 
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...Amazon Web Services
 
Qubole on AWS - White paper
Qubole on AWS - White paper Qubole on AWS - White paper
Qubole on AWS - White paper Vasu S
 

Tendances (20)

Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
Building Data Lakes with AWS
Building Data Lakes with AWSBuilding Data Lakes with AWS
Building Data Lakes with AWS
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
How Amazon uses AWS Analytics
How Amazon uses AWS AnalyticsHow Amazon uses AWS Analytics
How Amazon uses AWS Analytics
 
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
How Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsHow Amazon.com uses AWS Analytics
How Amazon.com uses AWS Analytics
 
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
 
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
Data preparation and transformation - Spin your straw into gold - Tel Aviv Su...
 
Qubole on AWS - White paper
Qubole on AWS - White paper Qubole on AWS - White paper
Qubole on AWS - White paper
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 

Similaire à Building Data Lake on AWS | AWS Floor28

Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Amazon Web Services
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAmazon Web Services
 
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Amazon Web Services
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Amazon Web Services
 
Big Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_SingaporeBig Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_SingaporeAmazon Web Services
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveCobus Bernard
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAmazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...Amazon Web Services
 
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)Amazon Web Services
 
Get to Know Your Customers - Build and Innovate with a Modern Data Architecture
Get to Know Your Customers - Build and Innovate with a Modern Data ArchitectureGet to Know Your Customers - Build and Innovate with a Modern Data Architecture
Get to Know Your Customers - Build and Innovate with a Modern Data ArchitectureAmazon Web Services
 
SRV309 AWS Purpose-Built Database Strategy: The Right Tool for the Right Job
 SRV309 AWS Purpose-Built Database Strategy: The Right Tool for the Right Job SRV309 AWS Purpose-Built Database Strategy: The Right Tool for the Right Job
SRV309 AWS Purpose-Built Database Strategy: The Right Tool for the Right JobAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Build and Innovate with a Modern Data Architecture
Build and Innovate with a Modern Data ArchitectureBuild and Innovate with a Modern Data Architecture
Build and Innovate with a Modern Data ArchitectureAmazon Web Services
 
BI & Analytics - A Datalake on AWS
BI & Analytics - A Datalake on AWSBI & Analytics - A Datalake on AWS
BI & Analytics - A Datalake on AWSAmazon Web Services
 

Similaire à Building Data Lake on AWS | AWS Floor28 (20)

Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Big Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_SingaporeBig Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_Singapore
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
 
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
 
Get to Know Your Customers - Build and Innovate with a Modern Data Architecture
Get to Know Your Customers - Build and Innovate with a Modern Data ArchitectureGet to Know Your Customers - Build and Innovate with a Modern Data Architecture
Get to Know Your Customers - Build and Innovate with a Modern Data Architecture
 
SRV309 AWS Purpose-Built Database Strategy: The Right Tool for the Right Job
 SRV309 AWS Purpose-Built Database Strategy: The Right Tool for the Right Job SRV309 AWS Purpose-Built Database Strategy: The Right Tool for the Right Job
SRV309 AWS Purpose-Built Database Strategy: The Right Tool for the Right Job
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Build and Innovate with a Modern Data Architecture
Build and Innovate with a Modern Data ArchitectureBuild and Innovate with a Modern Data Architecture
Build and Innovate with a Modern Data Architecture
 
BI & Analytics - A Datalake on AWS
BI & Analytics - A Datalake on AWSBI & Analytics - A Datalake on AWS
BI & Analytics - A Datalake on AWS
 
BI & Analytics
BI & AnalyticsBI & Analytics
BI & Analytics
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Building Data Lake on AWS | AWS Floor28

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SSID: Guest Password: Cube@11999 Building Data Lake on AWS Adir Sharabi Solutions Architect, Amazon Web Services
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Floor28 Agenda GameDay 24 Oct Enterprise IT Day 23 Oct Builders Day AppSync, Alexa & IoT 22 Oct Big Data Day 14 Oct ML & DL Day 15 Oct DevOps Day 16 Oct DevOps Day 17 Oct Technical Sessions Serverless Data Workshop Big Data UG Meetup Technical Sessions SageMaker Workshop ML&DL Meetup Technical Sessions K8s Workshop DevOps Meetup Technical Sessions Spot Workshop Databases Day 18 Oct Technical Sessions Serverless Workshop Virtual assistants UG Meetup Technical Sessions PyTorch Meetup Technical Sessions CDK Workshop AWS IL UG Meetup Builders Day Serverless backend 21 Oct Technical Sessions
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Data Day Agenda # Time Title Speaker 1 9:30 - 10:15 Building Data Lake on AWS Adir Sharabi 2 10:30 - 11:15 Store once, query thrice: Introduction to query engines on AWS Daniel Haviv 3 11:30 – 12:15 Introduction to Real-Time Streaming Analytics - Amazon Kinesis State Of Union Roy Ben Alta 4 12:30 - 13:15 From data to insights Orit Alul 5 15:00 – 18:00 Serverless Data Processing Workshop Adir Sharabi 6 18:00 – 20:00 Big Data User Group Meetup
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Documents and files Streams Multiple sources and formats… and growing everyday Your Data Sources Records Amazon RDS Amazon DynamoDB AWS IoT On Premises databases Spreadsheets Infrastructure logs Clickstream data Mobile app data Social media data Amazon Redshift Device data Sensor data ERP WEB Clickstream Mobile Apps
  • 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Challenges Analysts Applications Data Scientists Business Users API Access BI Tools Notebooks Multiple consumers and requirements 1990 2000 2010 2020 Generated Data Available for Analysis Data Visibility Multiple Access Mechanisms
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Traditionally, Analytics Used to Look Like This OLTP ERP CRM LOB Data Warehouse Business Intelligence Relational data Schema defined prior to data load TBs-PBs Scale Operational reporting and ad hoc Large initial capex + $10K–$50K/TB/Year
  • 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lakes Extend the Traditional Approach OLTP ERP CRM LOB Data Warehouse Business Intelligence Data Lake 1001100001001010111001 0101011100101010000101 1111011010 0011110010110010110 0100011000010 Devices Web Sensors Social Catalog Machine Learning DW Queries Big data processing Interactive Real-time Relational and non-relational data Schema defined during analysis Scale storage and compute independently Diverse analytical engines to gain insights Designed for low-cost storage and analytics
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Snowball Snowmobile Kinesis Data Firehose Kinesis Data Streams Amazon S3 Redshift EMR Athena Kinesis Elasticsearch Service Amazon S3 as Data Lakes Storage Layer Kinesis Video Streams AI Services Many ways to bring all kinds of data Unmatched durability and availability at EB scale Best security, compliance, and audit capabilities Integration with Big Data Tools Run any analytics on the same data without movement Cost effective - Store data at $0.023 / GB / Month
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Store Simplified Big Data Pipeline Amazon S3 Ingest Process & Analyze Consume Data sources Transactions Web logs / cookies ERP Connected devices
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lots of ingestion tools IngestData sources Transactions Web logs / cookies ERP Connected devices Process & Analyze Consume Store Amazon S3 Amazon S3 API Amazon Kinesis Firehose Direct Connect Snowball Database Migration Service
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Real-time data movement and Data Lakes on AWS Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Kinesis Agent Apache Kafka AWS SDK LOG4J Flume Fluentd AWS Mobile SDK Kinesis Producer Library Amazon S3
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lots of ingestion tools IngestData sources Transactions Web logs / cookies ERP Connected devices Process & Analyze Consume Store Amazon S3 Amazon S3 API Amazon Kinesis Firehose Direct Connect Snowball Database Migration Service
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Variety of data processing tools IngestData sources Transactions Web logs / cookies ERP Connected devices Consume Amazon Athena Amazon EMR Amazon Redshift & Spectrum Amazon Elasticsearch Amazon AI/ML/DL Services Store Amazon S3 Process & Analyze Amazon S3 API Amazon Kinesis Firehose Direct Connect Snowball Database Migration Service
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Ability to run SQL queries on data archived in Amazon Glacier (coming soon) Amazon Athena – interactive analysis $ SQL Query instantly Zero setup cost; just point to Amazon S3 and start querying Pay per query Pay only for queries run; save 30%–90% on per- query costs through compression Open ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Easy Serverless: zero infrastructure, zero administration Integrated with Amazon QuickSight
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Analytics and ML at scale 19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more Enterprise-grade security Amazon EMR – big data processing Latest versions Updated with the latest open source frameworks within 30 days of release Low cost Flexible billing with per- second billing, Amazon EC2 Spot, Reserved Instances, and Auto Scaling to reduce costs 50%-80% Use Amazon S3 storage Process data directly in the Amazon S3 data lake securely with high performance using the EMRFS connector Easy Launch fully managed Hadoop & Spark in minutes; no cluster setup, node provisioning, cluster tuning Data Lake 100110000100101011100 1010101110010101000 00111100101100101 010001100001 $
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost Massively parallel, scale from gigabytes to petabytes Amazon Redshift – data warehousing Fast at scale Columnar storage technology to improve I/O efficiency and scale query performance Inexpensive As low as $1,000 per terabyte per year, 1/10 the cost of traditional data warehouse solutions; start at $0.25 per hour Open file formats Secure Audit everything; encrypt data end-to- end; extensive certification and compliance Analyze optimized data formats on the latest SSD, and all open data formats in Amazon S3 $
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Extend the data warehouse to exabytes of data in Amazon S3 data lake Amazon Redshift Spectrum Amazon S3 Data Lake Amazon Redshift data Amazon Redshift Spectrum query engine Exabyte Redshift SQL queries against Amazon S3 Join data across Redshift and Amazon S3 Scale compute and storage separately Stable query performance and unlimited concurrency CSV, ORC, Grok, Avro, & Parquet data formats Pay only for the amount of data scanned
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift JDBC / ODBC ... 1 2 3 4 N Redshift Spectrum Scale-out serverless compute COPY commands Hot data Query directly on data lake
  • 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Multiple ways to consume the data IngestData sources Transactions Web logs / cookies ERP Connected devices Consume Amazon Athena Amazon EMR Amazon Redshift & Spectrum Amazon Elasticsearch Amazon AI/ML/DL Services Store Amazon S3 Process & Analyze Amazon S3 API Amazon Kinesis Firehose Direct Connect Snowball Database Migration Service Amazon QuickSight Jupyter, Zeppelin, HUE Amazon API Gateway BI Tools
  • 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Because data is NEVER perfect Amazon EMR Spark and Hive running on EMR Clean Transform Concatenate Convert to better formats Schedule transformations Event-driven transformations Transformations expressed as code AWS Glue Event based Server-less ETL engine AWS Lambda Trigger-based Code Execution
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Job Authoring Job Execution Auto-generates ETL code Python/Scala and Apache Spark Edit, debug, and share Develop Serverless execution Flexible scheduling Monitoring and alerting Deploy Data Catalog
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ETL when you need it IngestData sources Transactions Web logs / cookies ERP Connected devices Consume Amazon Athena Amazon EMR Amazon Redshift & Spectrum Amazon Elasticsearch Amazon AI/ML/DL Services Store Amazon S3 Process & Analyze Amazon S3 API Amazon Kinesis Firehose Direct Connect Snowball Database Migration Service Amazon QuickSight Jupyter, Zeppelin, HUE Amazon API Gateway BI Tools
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Realtime - in-stream processing IngestData sources Transactions Web logs / cookies ERP Connected devices Consume Amazon Athena Amazon EMR Amazon Redshift & Spectrum Amazon Elasticsearch Amazon AI/ML/DL Services Store Amazon S3 Process & Analyze Spark Streaming & Flink Amazon Kinesis Analytics In stream process Amazon S3 API Amazon Kinesis Firehose Direct Connect Snowball Database Migration Service Amazon QuickSight Jupyter, Zeppelin, HUE Amazon API Gateway BI Tools
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Data Catalog One per account Allows you to share metadata between Amazon Athena, Amazon Redshift Spectrum, EMR & JDBC sources Serverless We added a few extensions: § Search over metadata for data discovery § Manage Connections – JDBC URLs, credentials § Classification for identifying and parsing files § Versioning of table metadata as schemas evolve and other metadata are updated Central Metadata Catalog for the data lake
  • 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Data Catalog Crawlers Crawlers automatically build your Data Catalog and keep it in sync Automatically discover new data, extracts schema definitions • Detect schema changes and version tables • Detect Hive style partitions on Amazon S3 Built-in classifiers for popular types; custom classifiers using Grok expression Run ad hoc or on a schedule; serverless – only pay when crawler runs Catalogs Your Data
  • 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Crawler Databases Amazon Redshift Amazon S3 JDBC Connection Object Connection • MySQL • MariaDB • PostgreSQL • Amazon Redshift Create additional custom classifiers Amazon DynamoDB NoSQL Connection • Aurora • Oracle Built-in classifiers • Avro • Parquet • ORC • XML • JSON & JSONPaths • AWS CloudTrail • BSON • Logs • (Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis, and many others) • Delimited • (comma, pipe, tab, semicolon) • < ALWAYS GROWING…> What can crawlers discover?
  • 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Call the AWS Glue CreateTable API Create table manually DDL statement (in Amazon Athena or Amazon EMR) Apache Hive Metastore AWS GLUE ETL AWS GLUE DATA CATALOG Import from Apache Hive Metastore Other ways of populating the catalog
  • 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Write once, catalog once, read multiple, ETL Anywhere IngestData sources Transactions Web logs / cookies ERP Connected devices Consume Amazon Athena Amazon EMR Amazon Redshift & Spectrum Amazon Elasticsearch Amazon AI/ML/DL Services Data Catalog Store Amazon S3 Process & Analyze Amazon QuickSight Jupyter, Zeppelin, HUE Amazon API Gateway Amazon S3 API Amazon Kinesis Firehose Direct Connect Snowball Database Migration Service Spark Streaming & Flink Amazon Kinesis Analytics In stream process BI Tools
  • 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Data lakes and data warehouses complement each other • Loose Coupling, but highly performant • Storage, analytics, metadata management, etc.. • Choosing the best tool for the job • Future-proof your analytics • Elasticity and multiple clusters for dedicated purposes • Replace capacity planning with a consumption model • Don’t forget metadata management Core Tenets
  • 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank You! Adir Sharabi
  • 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SSID: Guest Password: Cube@11999 GAME DAY PUT YOUR SKILLS TO THE TEST OCT 24 Register now: bit.ly/Floor28GameDay