SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Big Data: Data Lakes and Data
Oceans
L e x C r o s e t t , P r i n c i p a l S o l u t i o n s A r c h i t e c t
S a j e e M a t h e w , P r i n c i p a l S o l u t i o n s A r c h i t e c t
E r i c k D a m e , E n t e r p r i s e S o l u t i o n s A r c h i t e c t
J o h n M a l l o r y , B u s i n e s s D e v e l o p m e n t M a n a g e r
November 28, 2017
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What to Expect from the Workshop
• Data lake and analytics review (30 min.)
• Set up a data lake using an AWS solution (30 min.)
• Add information to the data lake (30 min.)
• Perform lightweight analysis with AWS big data tools (1 hour)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Introduction to Data Lake Concepts
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unlocking Data
Most companies and organizations are embarking on
ambitious innovation initiatives to unlock their data
The data already exists but goes unused or is locked
away from complimentary data sets in isolated data silos
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Challenges with Legacy Data Architectures
• Can’t move data across silos
• Can’t deal with dynamic data and real-time processing
• Can’t deal with format diversity and change rate
• Complex ETL processes
• Difficult to find people with adequate skills to
configure and manage these systems
• Can’t integrate with the explosion of available social
and behavior tracking data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Legacy Data Architectures Exist as Isolated Data Silos
Hadoop
Cluster
SQL
Database
Data
Warehouse
Appliance
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Legacy Data Architectures Are Monolithic
Multiple layers of functionality all
on a single cluster
CPU
Memory
HDFS Storage
CPU
Memory
HDFS Storage
CPU
Memory
HDFS Storage
Hadoop Master Node
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Evolution of Data Architectures
2009: Decoupled EMR architecture
CPU
Memory
Hadoop Master Node
CPU
Memory
CPU
Memory
Improvements
• Decoupled storage and compute
• Scale CPU and memory resources
independently and up and down
• Only pay for the 500 TB dataset (not 3X)
• Multi-physical facility replication via Amazon S3
• Multiple clusters can run in parallel against
shared data in Amazon S3
• Each job gets its own optimized cluster. For
example, Spark on memory intensive, Hive on
CPU intensive, HBase on I/O intensive, and so
on
Constraints
• Still have a cluster to provision and manage
• Must expose EMR cluster to SQL users via
Hive, Presto, and so on
S3 as HDFS
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Evolution of Data Architectures
2012: Amazon Redshift—cloud DW
Improvements
Constraints
• Still have to load data into a schema
Leader node
Compute
node
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal VPC
BI tools SQL clientsAnalytics tools
Compute
node
Compute
node
JDBC/ODBC
• Automated installation, patching, backups
• No servers to manage and maintain
• MPP columnar relational database
• $1,000/TB/year
• Accessible to any ODBC or JDBC BI Tool
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Enter Data Lake Architectures
Data lake is a new and increasingly
popular architecture to store and analyze
massive volumes and heterogeneous
types of data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of a Data Lake—All Data is in One Place
Analyze all of your data,
from all of your sources, in one stored
location
“Why is the data distributed in
many locations? Where is the
single source of truth?”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of a Data Lake—Quick Ingest
Quickly ingest data
without needing to force it into a
predefined schema
“How can I collect data quickly
from various sources and store
it efficiently?”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of a Data Lake—Storage vs. Compute
Separating your storage and compute
allows you to scale each component as
required
“How can I scale up with the
volume of data being generated?”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of a Data Lake—Schema on Read
“Is there a way I can apply multiple
analytics and processing frameworks
to the same data?”
A data lake enables ad-hoc
analysis by applying schemas
on read, not write
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Approach to Data Lake
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 is the Data Lake
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
 Multiple upload
 Range GET
 Scalable throughput
 Store as much as you need
 Scale storage and compute
independently
 No minimum usage commitments
Scalable
 Amazon EMR
 Amazon Redshift Spectrum
 Amazon DynamoDB
 Amazon Athena
 AWS Glue
 Amazon Rekognition
 Amazon Macie
Integrated
 Simple REST API
 AWS SDKs
 Simple management tools
 Event notification
 Lifecycle policies
Easy to use
Why Amazon S3 for a Data Lake?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of an AWS Amazon S3 Data Lake
Fixed cluster data lake AWS Amazon S3 data lake
• Limited to only the single tool contained
on the cluster (for example, Hadoop or
data warehouse or Cassandra). Use
cases and ecosystem tools change
rapidly.
• Expensive to add nodes to add storage
capacity
• Expensive to replicate data against
node loss
• Complexity in scaling local storage
capacity
• Long refresh cycles to add additional
storage equipment
• Decouple storage and compute by
making S3 object based storage, not a
fixed tool to manage the data lake
• Flexibility to use any and all tools in the
ecosystem. The right tool for the job.
• Catalog, transform, and query in place
• Future-proof your architecture. As new
use cases and tools emerge you can
plug and play current best of breed.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Data Lake on AWS
Kinesis Firehose
Athena
Query Service
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing and Analytics
Real-time Batch
AI and Predictive
BI and Data Visualization
Transactional and
RDBMS
AWS Lambda
Apache Storm
on EMR
Apache Flink
on EMR
Spark Streaming
on EMR
Elasticsearch
Service
Kinesis Analytics,
Kinesis Streams
DynamoDB
NoSQL DB Relational Database
Aurora
EMR
Hadoop, Spark,
Presto
Amazon Redshift
Data Warehouse
Amazon Athena
Query Service
Amazon Lex
Speech
recognition
Amazon
Rekognition
Amazon Polly
Text to speech
Machine Learning
Predictive analytics
Kinesis Streams
and Firehose
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Further Evolution of Data Lake Architectures
Today: clusterless Improvements
• No cluster/infrastructure to manage
• Business users and analysts can write SQL
without having to provision a cluster or touch
infrastructure
• Pay by the query
• Zero administration
• Process data where it lives
Constraints
• Limited to SQL, Hive, and Spark jobs today
• More frameworks to come!
SQL Interface in web
browser
Amazon Athena for SQL
S3 Data Lake
AWS Glue for ETL
S3 Data Lake
Spark and Hive
Interface in web
browser
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Open Data Formats—Free Your Data!
• Parquet, ORC, Avro, JSON, CSV, others
• Allows multiple analytic tools to operate on same data
• Proprietary data = countdown to next migration
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use an Optimal Combination of Highly Interoperable Services
Amazon
QuickSight
4. Visualize all your data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Summary of AWS Analytics, Database, and AI Tools
Amazon Redshift
Enterprise Data Warehouse
Amazon EMR
Hadoop/Spark
Amazon Athena
Clusterless SQL
AWS Glue
Clusterless ETL
Amazon Aurora
Managed Relational Database
Amazon Machine Learning
Predictive Analytics
Amazon QuickSight
Business Intelligence/Visualization
Amazon Elasticsearch Service
Elasticsearch
Amazon ElastiCache
Redis In-memory Datastore
Amazon DynamoDB
Managed NoSQL Database
Amazon Rekognition
Deep Learning-based Image Recognition
Amazon Lex
Voice or Text Chatbots
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Innovations to Produce
More Efficient Data Lake Architectures
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Direct Connect AWS Snowball ISV Connectors
Amazon Kinesis
Firehose
Amazon S3 Transfer
Acceleration
AWS Storage
Gateway
Data Ingestion into Amazon S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Kinesis Firehose
Load massive volumes of streaming data into Amazon S3, Amazon
Redshift, and Amazon Elasticsearch
Zero administration: capture and deliver streaming data into Amazon S3, Amazon
Redshift, and Amazon Elasticsearch without writing an application or managing
infrastructure
Direct-to-data store integration: batch, compress, and encrypt streaming data for
delivery into data destinations in as little as 60 seconds using simple configurations
Seamless elasticity: seamlessly scales to match data throughput without intervention
Capture and submit
streaming data to Firehose
Analyze streaming data using your
favorite BI tools
Firehose loads streaming data
continuously into Amazon S3, Amazon
Redshift, and Amazon Elasticsearch
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Encryption ComplianceSecurity
 Identity and access
Management (IAM) policies
 Bucket policies
 Access Control Lists (ACLs)
 Private VPC endpoints to
Amazon S3
 Amazon S3 object tagging to
manage access policies
 SSL endpoints
 Server-side encryption
(SSE-S3)
 S3 server-side
encryption with
provided keys (SSE-C,
SSE-KMS)
 Client-side encryption
 Buckets access logs
 Lifecycle management
policies
 Access Control Lists
(ACLs)
 Versioning and MFA
deletes
 Certifications—HIPAA,
PCI, SOC 1/2/3, etc.
Implement the Right Cloud Security Controls
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake Best Practices
• Use Amazon S3 as the storage repository for your data lake, instead of a
Hadoop cluster or data warehouse
• Decoupled storage and compute is cheaper and more efficient to operate
• Decoupled storage and compute allow us to evolve to clusterless
architectures like Amazon Athena and AWS Glue
• Do not build data silos in Hadoop or the Enterprise DW
• Use granular encryption, roles, and access controls to build a secure,
multi-tenant centralized data platform
• Gain flexibility to use all the analytics tools in the ecosystem around
Amazon S3 and future-proof the architecture
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Workshop Process
• Log in to your AWS account
• Apply account credit
• Open the workshop guide
• Find and run the Data Lake Foundation on AWS
• Once complete, load the Behavioral Risk Factor Surveillance
system (BRFSS) data into Amazon S3 following the workshop
instructions
• Follow the analysis steps in the workshop guide
• Run Delete Stack to remove resources when done
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Questions?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!

Contenu connexe

Tendances

The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity
The Four V’s of Big Data Testing: Variety, Volume, Velocity, and VeracityThe Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity
The Four V’s of Big Data Testing: Variety, Volume, Velocity, and VeracityTechWell
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big DataMatthew Dennis
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035Neelam Rawat
 
Handling and Processing Big Data
Handling and Processing Big DataHandling and Processing Big Data
Handling and Processing Big DataUmair Shafique
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
big data and machine learning ppt.pptx
big data and machine learning ppt.pptxbig data and machine learning ppt.pptx
big data and machine learning ppt.pptxNATASHABANO
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph DatabasesDataStax
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
ChatGPT for Data Science Projects
ChatGPT for Data Science ProjectsChatGPT for Data Science Projects
ChatGPT for Data Science ProjectsAjitesh Kumar
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
NodeJS & Socket IO on Microsoft Azure Cloud Web Sites - DWX 2014
NodeJS & Socket IO on Microsoft Azure Cloud Web Sites - DWX 2014NodeJS & Socket IO on Microsoft Azure Cloud Web Sites - DWX 2014
NodeJS & Socket IO on Microsoft Azure Cloud Web Sites - DWX 2014Stéphane ESCANDELL
 
FUTURE OF DATA SCIENCE IN INDIA
FUTURE OF DATA SCIENCE IN INDIAFUTURE OF DATA SCIENCE IN INDIA
FUTURE OF DATA SCIENCE IN INDIAkaranramani4
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksDatabricks
 
Üzleti intelligencia és adatvizualizáció
Üzleti intelligencia és adatvizualizációÜzleti intelligencia és adatvizualizáció
Üzleti intelligencia és adatvizualizációMihály Minkó
 
Breaking down an Industrial IoT reference architecture.pptx
Breaking down an Industrial IoT reference architecture.pptxBreaking down an Industrial IoT reference architecture.pptx
Breaking down an Industrial IoT reference architecture.pptxNeel Sendas
 

Tendances (20)

The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity
The Four V’s of Big Data Testing: Variety, Volume, Velocity, and VeracityThe Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity
The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
Handling and Processing Big Data
Handling and Processing Big DataHandling and Processing Big Data
Handling and Processing Big Data
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
big data and machine learning ppt.pptx
big data and machine learning ppt.pptxbig data and machine learning ppt.pptx
big data and machine learning ppt.pptx
 
Big Data
Big DataBig Data
Big Data
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
ChatGPT for Data Science Projects
ChatGPT for Data Science ProjectsChatGPT for Data Science Projects
ChatGPT for Data Science Projects
 
Big data-analytics-ebook
Big data-analytics-ebookBig data-analytics-ebook
Big data-analytics-ebook
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hadoop
HadoopHadoop
Hadoop
 
NodeJS & Socket IO on Microsoft Azure Cloud Web Sites - DWX 2014
NodeJS & Socket IO on Microsoft Azure Cloud Web Sites - DWX 2014NodeJS & Socket IO on Microsoft Azure Cloud Web Sites - DWX 2014
NodeJS & Socket IO on Microsoft Azure Cloud Web Sites - DWX 2014
 
FUTURE OF DATA SCIENCE IN INDIA
FUTURE OF DATA SCIENCE IN INDIAFUTURE OF DATA SCIENCE IN INDIA
FUTURE OF DATA SCIENCE IN INDIA
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
 
Üzleti intelligencia és adatvizualizáció
Üzleti intelligencia és adatvizualizációÜzleti intelligencia és adatvizualizáció
Üzleti intelligencia és adatvizualizáció
 
Breaking down an Industrial IoT reference architecture.pptx
Breaking down an Industrial IoT reference architecture.pptxBreaking down an Industrial IoT reference architecture.pptx
Breaking down an Industrial IoT reference architecture.pptx
 
Big data in telecom
Big data in telecomBig data in telecom
Big data in telecom
 
Data Mining
Data MiningData Mining
Data Mining
 

Similaire à STG206_Big Data Data Lakes and Data Oceans

21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with ZopaAmazon Web Services
 
Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseArchitecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseAmazon Web Services
 
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Technology Trends in Data Processing - DAT311 - re:Invent 2017
Technology Trends in Data Processing - DAT311 - re:Invent 2017Technology Trends in Data Processing - DAT311 - re:Invent 2017
Technology Trends in Data Processing - DAT311 - re:Invent 2017Amazon Web Services
 
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeAmazon Web Services
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...Amazon Web Services
 
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...Amazon Web Services
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design PatternsJohn Yeung
 
Fanatics Ingests Streaming Data to a Data Lake on AWS
Fanatics Ingests Streaming Data to a Data Lake on AWSFanatics Ingests Streaming Data to a Data Lake on AWS
Fanatics Ingests Streaming Data to a Data Lake on AWSAmazon Web Services
 
ARC201_Scaling Up to Your First 10 Million Users
ARC201_Scaling Up to Your First 10 Million UsersARC201_Scaling Up to Your First 10 Million Users
ARC201_Scaling Up to Your First 10 Million UsersAmazon Web Services
 
ABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSightABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSightAmazon Web Services
 
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Amazon Web Services
 
How Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsHow Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsAmazon Web Services
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfAmazon Web Services
 
AWS Database and Analytics State of the Union
AWS Database and Analytics State of the UnionAWS Database and Analytics State of the Union
AWS Database and Analytics State of the UnionAmazon Web Services
 
How Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SFHow Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SFAmazon Web Services
 
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017Amazon Web Services
 

Similaire à STG206_Big Data Data Lakes and Data Oceans (20)

21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with Zopa
 
Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseArchitecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise
 
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWS
 
Technology Trends in Data Processing - DAT311 - re:Invent 2017
Technology Trends in Data Processing - DAT311 - re:Invent 2017Technology Trends in Data Processing - DAT311 - re:Invent 2017
Technology Trends in Data Processing - DAT311 - re:Invent 2017
 
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data Lake
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
 
Deep Dive on Big Data
Deep Dive on Big Data Deep Dive on Big Data
Deep Dive on Big Data
 
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
 
STG401_This Is My Architecture
STG401_This Is My ArchitectureSTG401_This Is My Architecture
STG401_This Is My Architecture
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
Building Data Lakes with AWS
Building Data Lakes with AWSBuilding Data Lakes with AWS
Building Data Lakes with AWS
 
Fanatics Ingests Streaming Data to a Data Lake on AWS
Fanatics Ingests Streaming Data to a Data Lake on AWSFanatics Ingests Streaming Data to a Data Lake on AWS
Fanatics Ingests Streaming Data to a Data Lake on AWS
 
ARC201_Scaling Up to Your First 10 Million Users
ARC201_Scaling Up to Your First 10 Million UsersARC201_Scaling Up to Your First 10 Million Users
ARC201_Scaling Up to Your First 10 Million Users
 
ABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSightABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSight
 
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
 
How Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsHow Amazon.com uses AWS Analytics
How Amazon.com uses AWS Analytics
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
AWS Database and Analytics State of the Union
AWS Database and Analytics State of the UnionAWS Database and Analytics State of the Union
AWS Database and Analytics State of the Union
 
How Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SFHow Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SF
 
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017
AWS Database and Analytics State of the Union - 2017 - DAT201 - re:Invent 2017
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

STG206_Big Data Data Lakes and Data Oceans

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Big Data: Data Lakes and Data Oceans L e x C r o s e t t , P r i n c i p a l S o l u t i o n s A r c h i t e c t S a j e e M a t h e w , P r i n c i p a l S o l u t i o n s A r c h i t e c t E r i c k D a m e , E n t e r p r i s e S o l u t i o n s A r c h i t e c t J o h n M a l l o r y , B u s i n e s s D e v e l o p m e n t M a n a g e r November 28, 2017
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What to Expect from the Workshop • Data lake and analytics review (30 min.) • Set up a data lake using an AWS solution (30 min.) • Add information to the data lake (30 min.) • Perform lightweight analysis with AWS big data tools (1 hour)
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Introduction to Data Lake Concepts
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Unlocking Data Most companies and organizations are embarking on ambitious innovation initiatives to unlock their data The data already exists but goes unused or is locked away from complimentary data sets in isolated data silos
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Challenges with Legacy Data Architectures • Can’t move data across silos • Can’t deal with dynamic data and real-time processing • Can’t deal with format diversity and change rate • Complex ETL processes • Difficult to find people with adequate skills to configure and manage these systems • Can’t integrate with the explosion of available social and behavior tracking data
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Legacy Data Architectures Exist as Isolated Data Silos Hadoop Cluster SQL Database Data Warehouse Appliance
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Legacy Data Architectures Are Monolithic Multiple layers of functionality all on a single cluster CPU Memory HDFS Storage CPU Memory HDFS Storage CPU Memory HDFS Storage Hadoop Master Node
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Evolution of Data Architectures 2009: Decoupled EMR architecture CPU Memory Hadoop Master Node CPU Memory CPU Memory Improvements • Decoupled storage and compute • Scale CPU and memory resources independently and up and down • Only pay for the 500 TB dataset (not 3X) • Multi-physical facility replication via Amazon S3 • Multiple clusters can run in parallel against shared data in Amazon S3 • Each job gets its own optimized cluster. For example, Spark on memory intensive, Hive on CPU intensive, HBase on I/O intensive, and so on Constraints • Still have a cluster to provision and manage • Must expose EMR cluster to SQL users via Hive, Presto, and so on S3 as HDFS
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Evolution of Data Architectures 2012: Amazon Redshift—cloud DW Improvements Constraints • Still have to load data into a schema Leader node Compute node 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal VPC BI tools SQL clientsAnalytics tools Compute node Compute node JDBC/ODBC • Automated installation, patching, backups • No servers to manage and maintain • MPP columnar relational database • $1,000/TB/year • Accessible to any ODBC or JDBC BI Tool
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Enter Data Lake Architectures Data lake is a new and increasingly popular architecture to store and analyze massive volumes and heterogeneous types of data
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of a Data Lake—All Data is in One Place Analyze all of your data, from all of your sources, in one stored location “Why is the data distributed in many locations? Where is the single source of truth?”
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of a Data Lake—Quick Ingest Quickly ingest data without needing to force it into a predefined schema “How can I collect data quickly from various sources and store it efficiently?”
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of a Data Lake—Storage vs. Compute Separating your storage and compute allows you to scale each component as required “How can I scale up with the volume of data being generated?”
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of a Data Lake—Schema on Read “Is there a way I can apply multiple analytics and processing frameworks to the same data?” A data lake enables ad-hoc analysis by applying schemas on read, not write
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Approach to Data Lake
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 is the Data Lake
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Designed for 11 9s of durability Designed for 99.99% availability Durable Available High performance  Multiple upload  Range GET  Scalable throughput  Store as much as you need  Scale storage and compute independently  No minimum usage commitments Scalable  Amazon EMR  Amazon Redshift Spectrum  Amazon DynamoDB  Amazon Athena  AWS Glue  Amazon Rekognition  Amazon Macie Integrated  Simple REST API  AWS SDKs  Simple management tools  Event notification  Lifecycle policies Easy to use Why Amazon S3 for a Data Lake?
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of an AWS Amazon S3 Data Lake Fixed cluster data lake AWS Amazon S3 data lake • Limited to only the single tool contained on the cluster (for example, Hadoop or data warehouse or Cassandra). Use cases and ecosystem tools change rapidly. • Expensive to add nodes to add storage capacity • Expensive to replicate data against node loss • Complexity in scaling local storage capacity • Long refresh cycles to add additional storage equipment • Decouple storage and compute by making S3 object based storage, not a fixed tool to manage the data lake • Flexibility to use any and all tools in the ecosystem. The right tool for the job. • Catalog, transform, and query in place • Future-proof your architecture. As new use cases and tools emerge you can plug and play current best of breed.
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building a Data Lake on AWS Kinesis Firehose Athena Query Service
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Processing and Analytics Real-time Batch AI and Predictive BI and Data Visualization Transactional and RDBMS AWS Lambda Apache Storm on EMR Apache Flink on EMR Spark Streaming on EMR Elasticsearch Service Kinesis Analytics, Kinesis Streams DynamoDB NoSQL DB Relational Database Aurora EMR Hadoop, Spark, Presto Amazon Redshift Data Warehouse Amazon Athena Query Service Amazon Lex Speech recognition Amazon Rekognition Amazon Polly Text to speech Machine Learning Predictive analytics Kinesis Streams and Firehose
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Further Evolution of Data Lake Architectures Today: clusterless Improvements • No cluster/infrastructure to manage • Business users and analysts can write SQL without having to provision a cluster or touch infrastructure • Pay by the query • Zero administration • Process data where it lives Constraints • Limited to SQL, Hive, and Spark jobs today • More frameworks to come! SQL Interface in web browser Amazon Athena for SQL S3 Data Lake AWS Glue for ETL S3 Data Lake Spark and Hive Interface in web browser
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Open Data Formats—Free Your Data! • Parquet, ORC, Avro, JSON, CSV, others • Allows multiple analytic tools to operate on same data • Proprietary data = countdown to next migration
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use an Optimal Combination of Highly Interoperable Services Amazon QuickSight 4. Visualize all your data
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Summary of AWS Analytics, Database, and AI Tools Amazon Redshift Enterprise Data Warehouse Amazon EMR Hadoop/Spark Amazon Athena Clusterless SQL AWS Glue Clusterless ETL Amazon Aurora Managed Relational Database Amazon Machine Learning Predictive Analytics Amazon QuickSight Business Intelligence/Visualization Amazon Elasticsearch Service Elasticsearch Amazon ElastiCache Redis In-memory Datastore Amazon DynamoDB Managed NoSQL Database Amazon Rekognition Deep Learning-based Image Recognition Amazon Lex Voice or Text Chatbots
  • 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Innovations to Produce More Efficient Data Lake Architectures
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Direct Connect AWS Snowball ISV Connectors Amazon Kinesis Firehose Amazon S3 Transfer Acceleration AWS Storage Gateway Data Ingestion into Amazon S3
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis Firehose Load massive volumes of streaming data into Amazon S3, Amazon Redshift, and Amazon Elasticsearch Zero administration: capture and deliver streaming data into Amazon S3, Amazon Redshift, and Amazon Elasticsearch without writing an application or managing infrastructure Direct-to-data store integration: batch, compress, and encrypt streaming data for delivery into data destinations in as little as 60 seconds using simple configurations Seamless elasticity: seamlessly scales to match data throughput without intervention Capture and submit streaming data to Firehose Analyze streaming data using your favorite BI tools Firehose loads streaming data continuously into Amazon S3, Amazon Redshift, and Amazon Elasticsearch
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Encryption ComplianceSecurity  Identity and access Management (IAM) policies  Bucket policies  Access Control Lists (ACLs)  Private VPC endpoints to Amazon S3  Amazon S3 object tagging to manage access policies  SSL endpoints  Server-side encryption (SSE-S3)  S3 server-side encryption with provided keys (SSE-C, SSE-KMS)  Client-side encryption  Buckets access logs  Lifecycle management policies  Access Control Lists (ACLs)  Versioning and MFA deletes  Certifications—HIPAA, PCI, SOC 1/2/3, etc. Implement the Right Cloud Security Controls
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lake Best Practices • Use Amazon S3 as the storage repository for your data lake, instead of a Hadoop cluster or data warehouse • Decoupled storage and compute is cheaper and more efficient to operate • Decoupled storage and compute allow us to evolve to clusterless architectures like Amazon Athena and AWS Glue • Do not build data silos in Hadoop or the Enterprise DW • Use granular encryption, roles, and access controls to build a secure, multi-tenant centralized data platform • Gain flexibility to use all the analytics tools in the ecosystem around Amazon S3 and future-proof the architecture
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Workshop Process • Log in to your AWS account • Apply account credit • Open the workshop guide • Find and run the Data Lake Foundation on AWS • Once complete, load the Behavioral Risk Factor Surveillance system (BRFSS) data into Amazon S3 following the workshop instructions • Follow the analysis steps in the workshop guide • Run Delete Stack to remove resources when done
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Questions?
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU!