SlideShare une entreprise Scribd logo
1  sur  31
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ben Snively, Solutions Architect – Data and Analytics, AI/ML
Wednesday, May 22, 2019
Data Lifecycle – Best Practices
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lifecycle
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Ingest
Mechanism for data
movement from
external sources into
your data system
Questions to ask:
What are my data sources?
What is the format of the data?
Is the data source immutable?
Is it real-time or batch?
Where is the destination?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Ingestion:
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Amazon Managed Streaming for Kafka
Real-time Data SourcesTraditional Data Sources
Media and Log Files
ERP Systems
Databases (SQL/NoSQL)
Data Warehouses (EDW)
IoT Sensors
Clickstream
Telemetry
Business Activities
Data Lake
Database
Data Warehouse
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon
Kinesis Data
Firehose
Real-time data movement and Data Lakes on AWS
AWS Glue
Data Catalog
Amazon
S3 Data
Data Lake
on AWS
Amazon
Kinesis Data
Streams
Data definitionKinesis Agent
Apache Kafka
AWS SDK
LOG4J
Flume
Fluentd
AWS Mobile SDK
Kinesis Producer Library
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Databases Anti-Pattern
Single Database Tier
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Best Practice: Use the Right Tool for the Job
Data Tier
Relational
Referential integrity
with strong
consistency,
transactions, and
hardened scale
Key-value
Low-latency, key-
based queries with
high throughput and
fast data ingestion
Document
Indexing and storing
of documents with
support for query on
any property
In-memory
Microsecond latency,
key-based queries,
specialized data
structures
Graph
Creating and
navigating relations
between data easily
and quickly
Complex query
support via SQL
Simple query
methods with filters
Simple query with
filters, projections
and aggregates
Simple query
methods with filters
Easily express queries
in terms of relations
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is the data
structure?
Access Patterns What to use?
Put/Get (key, value) In-memory, NoSQL
Simple relationships → 1:N, M:N NoSQL
Multi-table joins, transaction, SQL SQL
Faceting, Search Search
Graph traversal GraphDB
Data Structure What to use?
Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
Key/Value In-memory, NoSQL
Graph GraphDB
Time Interval Time Series
Ledger Ledger
Howwill thedatabe
accessed?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon
QLDB
Amazon
DynamoDB
Amazon
RDS / Aurora
Amazon
Timestream
Amazon
Elasticsearch
Amazon
Neptune
Amazon S3 +
Glacier
Use Cases Immutable
Ledger
Key Value with
GSI/LSI
Indexes
OLTP,
Transactional
stores and
processes this
data by time
intervals
Log Analysis,
Reverse
Indexing
Graph Data Lake /
File and
Object store
Performance Very High
Performance
Ultra High
request rate,
Ultra low to
low latency
Very high
request rate,
low latency
High request
rate, low
latency
Medium
request rate,
low latency
Medium
request rate,
low latency
High
Throughput
Shape Ledger K/V and
Document
Relational Time Series Documents Node/Edges Files
Size TB, PB (no
limits)
GB, Mid TB GB, Low TB GB, TB GB, Mid TB GB, TB, PB,
EB (no limits)
Cost / GB $ ¢¢ - $$ $ $ $$ $ ¢- ¢4/10
VPC Support Inside VPC VPC Endpoint Inside VPC Outside or
Inside VPC
Inside VPC VPC Endpoint
Database Characteristics
Warm data Cold data
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Staging
Validate, Verify,
Catalog the incoming
Raw Data
Perform common
housekeeping tasks
Questions to ask:
Which validation checks?
How will the raw dataset catalog be populated?
Automated Tagging of data?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Cleansing
Transform and
Process data for
downstream
analytics
Questions to ask:
Which users and analytics will consume data?
Is there a common data model?
Optimize for reads/queries or writes?
How will data cleanup over time be performed?
(compaction, etc..)
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ELT/ETL
Preparing Raw, Staging, and Cleansed Data Lakes
Raw
Ingestion
Staged
Datasets
Optimized
ML Datasets
Optimized
ML Datasets
Data Lake
on AWS
ELT/ETL
Cleansed “views” of the data
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demonstration
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Analytics & Visualization
Deliver decisions makers the
insights to transform an
organization by identifying
unmet needs within the
customers or by optimizing
operational processes
Questions to ask:
What business question is being answered?
Does the data support answering them?
Who are the users driving the insights?
What skills do those users have?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Business
Reporting
Data Scientists
Data Engineer
IDE
Data
Catalog
Central
Storage
SagemakerMachine Learning/Deep Learning
Start w/ the Business Problem and Users:
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
1. Process high variety or volume structured or unstructured datasets
• Big Data Processing
2. Power Business Users to drive Insights
• Data Warehousing
3. Interactively query and explore datasets
• Ad Hoc Querying
4. Analyze what’s happening now
• Streaming Analytics
5. Drive operational and security understanding.
• Log Analysis
Common Types of Data Analytics
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ad Hoc
&
Big Data
Analytics
process
store
Big Data
Analytics
Ad Hoc
Amazon EMR
Hive
Pig
Spark
Machine
Learning
Batch prediction
Real-time prediction
Amazon S3
Files
Amazon
Kinesis
Firehose
Amazon Kinesis
Analytics
Amazon Redshift
Amazon ES
Consume
Amazon EMR
Presto
Spark
Amazon Athena
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demonstration
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Petabytes of data generated
on-premises, brought to AWS,
and stored in S3
Thousands of analytical
queries performed on EMR
and Amazon Redshift.
Stringent security requirements
met by leveraging VPC, VPN,
encryption at-rest and in-
transit, CloudTrail, and
database auditing
Flexible
Interactive
Queries
Predefined
Queries
Surveillance
Analytics
Web Applications
Analysts; Regulators
FINRA: Migrating to AWS
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Interactive
&
Batch
Amazon S3
Amazon Redshift
Amazon EMR
Presto
Hive
Pig
Spark
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
AWS Lambda
Spark Streaming on
Amazon EMR
Applications
Amazon Kinesis
App state
or
Materialized
View
KCL
Machine
Learning
Real-time
Amazon
DynamoDB
Amazon
RDS
Change Data Capture or
Export
Transactions
Stream
Files
Data Lake
Amazon Kinesis
Analytics
Amazon Athena
Amazon Kinesis
Firehose
Amazon ES
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Streaming Analytics
Amazon EMR
KCL app
AWS Lambda
Spark
Streaming
Machine Learning
Real-time prediction
Amazon
ElastiCache
(Redis)
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
App state or
Materialized
View
KPI
process
store
Amazon
Kinesis
Amazon Kinesis
Analytics
Amazon
SNS NotificationsAlerts
Amazon
S3
Log
Amazon
KinesisFan out Downstream
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hearst’s Serverless Data Pipeline
cosmopolitan.com
caranddriver.com
sfchronicle.com
elle.com
Ingestion proxy
(Node.js)
Serverless data
pipeline
Offline
analysis and
archive
Real-time
analysis
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Log data analytics
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Log Analytics Data Lake
Realtime
Application
and Users
activities
On premises
activities
Log Analytics
Data Lake
AWS Security
Services
Analytics
Machine/Deep
Learning
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Application monitoring & root-cause
analysis
Security Information and Event
Management (SIEM)
IoT & mobile Business & clickstream analytics
Amazon Elasticsearch: Analyzing Log Data
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Which Analytics Should I Use? PROCESS / ANALYZE
Batch
Takes minutes to hours
Example: Daily/weekly/monthly reports
Amazon EMR (MapReduce, Hive, Pig, Spark)
Interactive
Takes seconds
Example: Self-service dashboards
Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark)
Stream
Takes milliseconds to seconds
Example: Fraud alerts, 1 minute metrics
Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL,
AWS Lambda, etc.
Predictive
Takes milliseconds (real-time) to hours (batch)
Example: Fraud detection, Forecasting demand, Speech
recognition
Amazon SageMaker, Polly, Rekognition, Transcribe, Translate,
Amazon EMR (Spark ML), Deep Learning AMI (MXNet, TensorFlow,
Theano, Torch, CNTK and Caffe)
FastSlow
Amazon Redshift
& Spectrum
Amazon Athena
BatchInteractive
Amazon ES
Presto
Amazon
EMR
Predictive
AmazonML
KCL
Apps
AWS Lambda
Amazon Kinesis
Analytics
Stream
Streaming
Fast
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Which Analytics Tool Should I Use?
Amazon Redshift Amazon Redshift
Spectrum
Amazon Athena Amazon EMR
Presto Spark Hive
Use case Optimized for data
warehousing
Query S3 data from
Redshift
Interactive Queries
over S3 data
Interactive
Query
General purpose Batch
Scale/Throughput ~Nodes ~Nodes Automatic ~ Nodes
Managed Service Yes Yes Yes, Serverless Yes
Storage Local storage Amazon S3 Amazon S3 Amazon S3, HDFS
Optimization Columnar storage,
data compression,
and zone maps
AVRO, PARQUET
TEXT, SEQ
RCFILE, ORC, etc.
AVRO, PARQUET
TEXT, SEQ
RCFILE, ORC, etc.
Framework dependent
Metadata Redshift Catalog Glue Catalog Glue Catalog Glue Catalog or
Hive Meta-store
Auth/Access controls IAM, Users, groups,
and access controls
IAM, Users, groups,
and access controls
IAM IAM, LDAP & Kerberos
UDF support Yes (Scalar) Yes (Scalar) No Yes
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Which Stream Processing Technology Should I Use?
Amazon EMR
(Spark
Streaming)
KCL Application Amazon Kinesis
Analytics
AWS Lambda
Managed Service Yes No (EC2 + Auto
Scaling)
Yes Yes
Serverless No No Yes Yes
Scale / Throughput No limits /
~ nodes
No limits /
~ nodes
No Limits /
automatic
No limits /
automatic
Availability Single AZ Multi-AZ Multi-AZ Multi-AZ
Programming
Languages
Java, Python,
Scala
Java, others via
MultiLangDaemon
ANSI SQL or
Java/Flink
Node.js, Java, Python, .Net Core
Sliding Window
Functions
Build-in App needs to
implement
Built-in No
Reliability KCL and Spark
checkpoints
Managed by KCL Managed by
Amazon Kinesis
Analytics
Managed by AWS Lambda
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Archiving
Makes the archival process easy
to manage, and allows you to
focus on the storage of your
data, rather than the
management of your tape
systems and library.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Architectural Principles
• Build decoupled systems
• Data → Store → Process → Store → Analyze → Answers
• Use the right tool for the job
• Data structure, latency, throughput, access patterns
• Leverage managed and serverless services
• Scalable/elastic, available, reliable, secure, no/low admin
• Use event-journal design patterns
• Immutable datasets (data lake), materialized views
• Be cost-conscious
• Big data ≠ big cost
• Machine Learning (ML) enable your applications
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Questions

Contenu connexe

Tendances

How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
Jeffrey T. Pollock
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0   virtual - subhadeep duttaCWIN17 India / Insights platform architecture v1 0   virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
Capgemini
 

Tendances (20)

Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
 
Migrate and Modernize Hadoop-Based Security Policies for Databricks
Migrate and Modernize Hadoop-Based Security Policies for DatabricksMigrate and Modernize Hadoop-Based Security Policies for Databricks
Migrate and Modernize Hadoop-Based Security Policies for Databricks
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data Engineering
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake Architecture
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
 
Oracle PL/SQL 12c and 18c New Features + RADstack + Community Sites
Oracle PL/SQL 12c and 18c New Features + RADstack + Community SitesOracle PL/SQL 12c and 18c New Features + RADstack + Community Sites
Oracle PL/SQL 12c and 18c New Features + RADstack + Community Sites
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Seeing Redshift: How Amazon Changed Data Warehousing Forever
Seeing Redshift: How Amazon Changed Data Warehousing ForeverSeeing Redshift: How Amazon Changed Data Warehousing Forever
Seeing Redshift: How Amazon Changed Data Warehousing Forever
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
Azure data stack_2019_08
Azure data stack_2019_08Azure data stack_2019_08
Azure data stack_2019_08
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobal
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data Pipelines
 
2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data Integration2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data Integration
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
 
Democratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druidDemocratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druid
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0   virtual - subhadeep duttaCWIN17 India / Insights platform architecture v1 0   virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
 

Similaire à Preparing Your Data for Cloud Analytics & AI/ML

Similaire à Preparing Your Data for Cloud Analytics & AI/ML (20)

Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/MLPreparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/ML
 
Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Leveraging Data Analytics in the Cloud to Support Data-Driven Decisions
Leveraging Data Analytics in the Cloud to Support Data-Driven DecisionsLeveraging Data Analytics in the Cloud to Support Data-Driven Decisions
Leveraging Data Analytics in the Cloud to Support Data-Driven Decisions
 
Modern Data Platforms - Thinking Data Flywheel on the Cloud
Modern Data Platforms - Thinking Data Flywheel on the CloudModern Data Platforms - Thinking Data Flywheel on the Cloud
Modern Data Platforms - Thinking Data Flywheel on the Cloud
 
Automate Business Insights on AWS - Simple, Fast, and Secure Analytics Platforms
Automate Business Insights on AWS - Simple, Fast, and Secure Analytics PlatformsAutomate Business Insights on AWS - Simple, Fast, and Secure Analytics Platforms
Automate Business Insights on AWS - Simple, Fast, and Secure Analytics Platforms
 
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
 
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
 
Immersion Day - Democratize o acesso ao dado
Immersion Day - Democratize o acesso ao dadoImmersion Day - Democratize o acesso ao dado
Immersion Day - Democratize o acesso ao dado
 
BI & Analytics - A Datalake on AWS
BI & Analytics - A Datalake on AWSBI & Analytics - A Datalake on AWS
BI & Analytics - A Datalake on AWS
 
AWS Purpose-Built Database Strategy: The Right Tool for The Right Job
AWS Purpose-Built Database Strategy: The Right Tool for The Right JobAWS Purpose-Built Database Strategy: The Right Tool for The Right Job
AWS Purpose-Built Database Strategy: The Right Tool for The Right Job
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 
AWS-Quick-Start
AWS-Quick-StartAWS-Quick-Start
AWS-Quick-Start
 
HK-AWS-Quick-Start-Workshop
HK-AWS-Quick-Start-WorkshopHK-AWS-Quick-Start-Workshop
HK-AWS-Quick-Start-Workshop
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWS
 
BI & Analytics
BI & AnalyticsBI & Analytics
BI & Analytics
 
Get to Know Your Customers - Build and Innovate with a Modern Data Architecture
Get to Know Your Customers - Build and Innovate with a Modern Data ArchitectureGet to Know Your Customers - Build and Innovate with a Modern Data Architecture
Get to Know Your Customers - Build and Innovate with a Modern Data Architecture
 

Plus de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Preparing Your Data for Cloud Analytics & AI/ML

  • 1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ben Snively, Solutions Architect – Data and Analytics, AI/ML Wednesday, May 22, 2019 Data Lifecycle – Best Practices
  • 2. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lifecycle
  • 3. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Ingest Mechanism for data movement from external sources into your data system Questions to ask: What are my data sources? What is the format of the data? Is the data source immutable? Is it real-time or batch? Where is the destination?
  • 4. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Ingestion: AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Amazon Managed Streaming for Kafka Real-time Data SourcesTraditional Data Sources Media and Log Files ERP Systems Databases (SQL/NoSQL) Data Warehouses (EDW) IoT Sensors Clickstream Telemetry Business Activities Data Lake Database Data Warehouse
  • 5. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis Data Firehose Real-time data movement and Data Lakes on AWS AWS Glue Data Catalog Amazon S3 Data Data Lake on AWS Amazon Kinesis Data Streams Data definitionKinesis Agent Apache Kafka AWS SDK LOG4J Flume Fluentd AWS Mobile SDK Kinesis Producer Library
  • 6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Databases Anti-Pattern Single Database Tier
  • 7. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Best Practice: Use the Right Tool for the Job Data Tier Relational Referential integrity with strong consistency, transactions, and hardened scale Key-value Low-latency, key- based queries with high throughput and fast data ingestion Document Indexing and storing of documents with support for query on any property In-memory Microsecond latency, key-based queries, specialized data structures Graph Creating and navigating relations between data easily and quickly Complex query support via SQL Simple query methods with filters Simple query with filters, projections and aggregates Simple query methods with filters Easily express queries in terms of relations
  • 8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is the data structure? Access Patterns What to use? Put/Get (key, value) In-memory, NoSQL Simple relationships → 1:N, M:N NoSQL Multi-table joins, transaction, SQL SQL Faceting, Search Search Graph traversal GraphDB Data Structure What to use? Fixed schema SQL, NoSQL Schema-free (JSON) NoSQL, Search Key/Value In-memory, NoSQL Graph GraphDB Time Interval Time Series Ledger Ledger Howwill thedatabe accessed?
  • 9. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon QLDB Amazon DynamoDB Amazon RDS / Aurora Amazon Timestream Amazon Elasticsearch Amazon Neptune Amazon S3 + Glacier Use Cases Immutable Ledger Key Value with GSI/LSI Indexes OLTP, Transactional stores and processes this data by time intervals Log Analysis, Reverse Indexing Graph Data Lake / File and Object store Performance Very High Performance Ultra High request rate, Ultra low to low latency Very high request rate, low latency High request rate, low latency Medium request rate, low latency Medium request rate, low latency High Throughput Shape Ledger K/V and Document Relational Time Series Documents Node/Edges Files Size TB, PB (no limits) GB, Mid TB GB, Low TB GB, TB GB, Mid TB GB, TB, PB, EB (no limits) Cost / GB $ ¢¢ - $$ $ $ $$ $ ¢- ¢4/10 VPC Support Inside VPC VPC Endpoint Inside VPC Outside or Inside VPC Inside VPC VPC Endpoint Database Characteristics Warm data Cold data
  • 10. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Staging Validate, Verify, Catalog the incoming Raw Data Perform common housekeeping tasks Questions to ask: Which validation checks? How will the raw dataset catalog be populated? Automated Tagging of data?
  • 11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Cleansing Transform and Process data for downstream analytics Questions to ask: Which users and analytics will consume data? Is there a common data model? Optimize for reads/queries or writes? How will data cleanup over time be performed? (compaction, etc..)
  • 12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ELT/ETL Preparing Raw, Staging, and Cleansed Data Lakes Raw Ingestion Staged Datasets Optimized ML Datasets Optimized ML Datasets Data Lake on AWS ELT/ETL Cleansed “views” of the data
  • 13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demonstration
  • 14. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Analytics & Visualization Deliver decisions makers the insights to transform an organization by identifying unmet needs within the customers or by optimizing operational processes Questions to ask: What business question is being answered? Does the data support answering them? Who are the users driving the insights? What skills do those users have?
  • 15. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Business Reporting Data Scientists Data Engineer IDE Data Catalog Central Storage SagemakerMachine Learning/Deep Learning Start w/ the Business Problem and Users:
  • 16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Process high variety or volume structured or unstructured datasets • Big Data Processing 2. Power Business Users to drive Insights • Data Warehousing 3. Interactively query and explore datasets • Ad Hoc Querying 4. Analyze what’s happening now • Streaming Analytics 5. Drive operational and security understanding. • Log Analysis Common Types of Data Analytics
  • 17. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ad Hoc & Big Data Analytics process store Big Data Analytics Ad Hoc Amazon EMR Hive Pig Spark Machine Learning Batch prediction Real-time prediction Amazon S3 Files Amazon Kinesis Firehose Amazon Kinesis Analytics Amazon Redshift Amazon ES Consume Amazon EMR Presto Spark Amazon Athena
  • 18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demonstration
  • 19. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Petabytes of data generated on-premises, brought to AWS, and stored in S3 Thousands of analytical queries performed on EMR and Amazon Redshift. Stringent security requirements met by leveraging VPC, VPN, encryption at-rest and in- transit, CloudTrail, and database auditing Flexible Interactive Queries Predefined Queries Surveillance Analytics Web Applications Analysts; Regulators FINRA: Migrating to AWS
  • 20. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Interactive & Batch Amazon S3 Amazon Redshift Amazon EMR Presto Hive Pig Spark Amazon ElastiCache Amazon DynamoDB Amazon RDS Amazon ES AWS Lambda Spark Streaming on Amazon EMR Applications Amazon Kinesis App state or Materialized View KCL Machine Learning Real-time Amazon DynamoDB Amazon RDS Change Data Capture or Export Transactions Stream Files Data Lake Amazon Kinesis Analytics Amazon Athena Amazon Kinesis Firehose Amazon ES
  • 21. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Streaming Analytics Amazon EMR KCL app AWS Lambda Spark Streaming Machine Learning Real-time prediction Amazon ElastiCache (Redis) Amazon DynamoDB Amazon RDS Amazon ES App state or Materialized View KPI process store Amazon Kinesis Amazon Kinesis Analytics Amazon SNS NotificationsAlerts Amazon S3 Log Amazon KinesisFan out Downstream
  • 22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hearst’s Serverless Data Pipeline cosmopolitan.com caranddriver.com sfchronicle.com elle.com Ingestion proxy (Node.js) Serverless data pipeline Offline analysis and archive Real-time analysis
  • 23. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Log data analytics
  • 24. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Log Analytics Data Lake Realtime Application and Users activities On premises activities Log Analytics Data Lake AWS Security Services Analytics Machine/Deep Learning
  • 25. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Application monitoring & root-cause analysis Security Information and Event Management (SIEM) IoT & mobile Business & clickstream analytics Amazon Elasticsearch: Analyzing Log Data
  • 26. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Analytics Should I Use? PROCESS / ANALYZE Batch Takes minutes to hours Example: Daily/weekly/monthly reports Amazon EMR (MapReduce, Hive, Pig, Spark) Interactive Takes seconds Example: Self-service dashboards Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark) Stream Takes milliseconds to seconds Example: Fraud alerts, 1 minute metrics Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL, AWS Lambda, etc. Predictive Takes milliseconds (real-time) to hours (batch) Example: Fraud detection, Forecasting demand, Speech recognition Amazon SageMaker, Polly, Rekognition, Transcribe, Translate, Amazon EMR (Spark ML), Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK and Caffe) FastSlow Amazon Redshift & Spectrum Amazon Athena BatchInteractive Amazon ES Presto Amazon EMR Predictive AmazonML KCL Apps AWS Lambda Amazon Kinesis Analytics Stream Streaming Fast
  • 27. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Analytics Tool Should I Use? Amazon Redshift Amazon Redshift Spectrum Amazon Athena Amazon EMR Presto Spark Hive Use case Optimized for data warehousing Query S3 data from Redshift Interactive Queries over S3 data Interactive Query General purpose Batch Scale/Throughput ~Nodes ~Nodes Automatic ~ Nodes Managed Service Yes Yes Yes, Serverless Yes Storage Local storage Amazon S3 Amazon S3 Amazon S3, HDFS Optimization Columnar storage, data compression, and zone maps AVRO, PARQUET TEXT, SEQ RCFILE, ORC, etc. AVRO, PARQUET TEXT, SEQ RCFILE, ORC, etc. Framework dependent Metadata Redshift Catalog Glue Catalog Glue Catalog Glue Catalog or Hive Meta-store Auth/Access controls IAM, Users, groups, and access controls IAM, Users, groups, and access controls IAM IAM, LDAP & Kerberos UDF support Yes (Scalar) Yes (Scalar) No Yes
  • 28. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Stream Processing Technology Should I Use? Amazon EMR (Spark Streaming) KCL Application Amazon Kinesis Analytics AWS Lambda Managed Service Yes No (EC2 + Auto Scaling) Yes Yes Serverless No No Yes Yes Scale / Throughput No limits / ~ nodes No limits / ~ nodes No Limits / automatic No limits / automatic Availability Single AZ Multi-AZ Multi-AZ Multi-AZ Programming Languages Java, Python, Scala Java, others via MultiLangDaemon ANSI SQL or Java/Flink Node.js, Java, Python, .Net Core Sliding Window Functions Build-in App needs to implement Built-in No Reliability KCL and Spark checkpoints Managed by KCL Managed by Amazon Kinesis Analytics Managed by AWS Lambda
  • 29. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Archiving Makes the archival process easy to manage, and allows you to focus on the storage of your data, rather than the management of your tape systems and library.
  • 30. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Architectural Principles • Build decoupled systems • Data → Store → Process → Store → Analyze → Answers • Use the right tool for the job • Data structure, latency, throughput, access patterns • Leverage managed and serverless services • Scalable/elastic, available, reliable, secure, no/low admin • Use event-journal design patterns • Immutable datasets (data lake), materialized views • Be cost-conscious • Big data ≠ big cost • Machine Learning (ML) enable your applications
  • 31. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Questions

Notes de l'éditeur

  1. Learn about data lifecycle best practices in the AWS Cloud, so you can optimize performance and lower the costs of data ingestion, staging, storage, cleansing, analytics and visualization, and archiving.
  2. The ingest mechanism describes the movement of data from an external source to land into the data lifecycle. Data ingest refers to identifying the correct data sources, validating and importing the data files from those sources, and sending the data to the desired destination. Data sources include transactions, ERP systems, clickstream data, log files, devices, and disparate databases being migrated over. Generally the destination could be some form of storage or a database (we discuss the “destination” in the following chapter on Data Staging).
  3. Amazon Kinesis Data Firehose provides a simple way to capture and load streaming data with just a few clicks in the AWS Management Console. You can easily create a Firehose delivery stream from the AWS Management Console, configure it with a few clicks, and start sending data to the stream from hundreds of thousands of data sources to be loaded continuously to AWS – all in just a few minutes.
  4. If you tried to a giant all-purpose tool to do every function, it wouldn’t be really good at any single thing. That’s often what customers see running their systems both on the cloud as well as on premises.\
  5. Because of this, AWS offers a set of data tier services. Ranging from your traditional relational stores like Aurora, Oracle, SQL Server, We also have a set of NoSQL database to store key/value, documents, and graphs. In memory stores to provide microsecond retrieval of re-hydratable data.
  6. Different data structures often require different types of storage. Key value stores are great for storing data that needs to be quickly stored and queried – this includes examples like session data or lat/long tracking data of car locations (like lyft does on AWS) Other use cases require different storage than heavily connected graph data, data warehousing data or relational data. The way that the data gets queried is also a major characteristics and related to the structure of the data.
  7. These are some of the characteristics or really design decisions on which database to use. Two important categories to focus on is the use case. And the shape of the data. The rest often times is driven from those factors.
  8. Staging provides the opportunity to perform any data housekeeping tasks prior to making the data available to the organization or its users for analytics. One of the most common challenges we hear from customers is that their organization has data in multiple systems or locations, including data warehouses, spreadsheets, databases, and text files. Not only is the variety expanding, but its volume in many cases is growing exponentially. Add the complexity of mandatory data security and governance, user access, and the data demands of the analytics, business, and reporting teams, an organization can find they are unable to see a way forward.
  9. Before data is analyzed, data cleansing ensures that data is transformed and presented in a format that is optimized for code. The Extract, Transform, Load (ETL) process is carried out as part of the data cleansing stage in the lifecycle. For example, a field may contain a data/time in a format that may not meet the needs of an algorithmic requirement. Or there may be a name field where the first and last names need to be separated out. Other concerns addressed during data cleansing stage might include merging of data sources, aligning formats, converting strings to numerical data, or summarizing of data.
  10. Problem #1 – Many organizations don’t know what they have. When you accumulate such a diversity of data, you need mechanisms to understand what data you have, where it is located, and what format. This is metadata management. And if not managed properly (or at all), the data is essentially lost. It is taking up space, but you have no means to put it to use. A common issue, regardless of whether it is on-prem or in the cloud, is the lack of a metadata management approach from the onset. The Financial Industry Regulatory Authority (FINRA) oversees more than 3,900 securities firms with approximately 640,000 brokers. FINRA processes approximately 6 terabytes of data and 37 billion records on an average day to build a complete, holistic picture of market trading in the U.S. On busy days, the stock markets can generate 75 billion+ records. The way they’re able to make all this data useful, whether to data scientists or business users or others, is through a metadata system they developed and open sourced, called HERD. This is the same platform that is used by LinkedIn, for example. But most organizations don’t actually go off and built their own tooling. Ivy Tech is a community college - 60,000 online and in-person course sections, 8,300 on staff, 170,000 students, and130 locations. Ivy Tech uses metadata capabilities provided by AWS to manage their information.
  11. Data analytics is the stage where an organization can identify ways to increase revenue or reduce cost. Analytics and visualization delivers decisions makers the insights to transform an organization by identifying unmet needs within the customers or by optimizing operational processes. Data-driven decisions leads to transforming how managers allocate resources and evaluate results within an organization. Reliance on data reduces the role of hearsay and instincts when making choices. A manager’s intuition is now backed with data at the front-end of the planning process, through the course of implementation, and when evaluating the impact of his or her decisions. Key considerations in this phase include the requirements for analytics being clearly defined; the output being aligned to the use cases; and the consumers of data within the organization finding the insight generated as actionable data. Let’s review some of the solutions available for analytics within the AWS portfolio during this stage.
  12. Picking the right analytical engine for your needs (200) AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics. In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats. You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
  13. Picking the right analytical engine for your needs (200) AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics. In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats. You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
  14. Picking the right analytical engine for your needs (200) AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics. In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats. You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
  15. Picking the right analytical engine for your needs (200) AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics. In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats. You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
  16. Picking the right analytical engine for your needs (200) AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics. In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats. You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
  17. Picking the right analytical engine for your needs (200) AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics. In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats. You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
  18. Picking the right analytical engine for your needs (200) AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics. In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats. You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
  19. Picking the right analytical engine for your needs (200) AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics. In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats. You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
  20. Picking the right analytical engine for your needs (200) AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics. In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats. You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
  21. Picking the right analytical engine for your needs (200) AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics. In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats. You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
  22. Redshifft Specturm AVRO PARQUET TEXTFILE SEQUENCEFILE RCFILE RegexSerDe ORC Grok OpenCSV Athea SEQUENCEFILE TEXTFILE RCFILE ORC PARQUET AVRO SEQUENCEFILE TEXTFILE RCFILE ORC PARQUET AVRO Athena – Simple Redshift – Fast EMR - Configurable Add connector 30% queries & 70% of data – Athena… Direct Acyclic Graphs? Exactly once processing & DAG? — how do you do this?? https://storm.apache.org/documentation/Rationale.html http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
  23. Add connector Direct Acyclic Graphs? Exactly once processing & DAG? – how do you do this?? https://storm.apache.org/documentation/Rationale.html http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
  24. Amazon Web Services offers a complete set of cloud storage services for archiving. You can choose Amazon S3 Glacier or Deep Archive for affordable, non-time sensitive cloud storage, or Amazon S3 for faster storage, depending on your needs. AWS cloud storage solutions have achieved numerous compliance standards, security certifications, and provide built-in encryption; helping to ensure that your data stored in AWS meets all the requirements for your business. AWS's cloud storage solutions makes the archival process easy to manage, and allows you to focus on the storage of your data, rather than the management of your tape systems and library. 
  25. Picking the right analytical engine for your needs (200) AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics. In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats. You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.