SlideShare une entreprise Scribd logo
1  sur  56
Télécharger pour lire hors ligne
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ian Meyers, Principal Solutions Architect
July 7th
, 2016
Big Data Architectural Patterns
and Best Practices on AWS
Big Data Evolution
Batch
processing
Stream
processing
Machine
learning
Powerful Services, Open Ecosystem
Amazon
Glacier
S3 DynamoDB
RDS
EMR
Amazon
Redshift
Data Pipeline
Amazon
Kinesis
Kinesis-enabled
app
Lambda ML
SQS
ElastiCache
DynamoDB
Streams
Amazon Elasticsearch
Service
Amazon Web Services Open Source & Commercial Ecosystem
Separation of compute & storage
• Event driven, loose coupling, independent scaling & sizing
“Data Sympathy” - use the right tool for the job
Data structure, latency, throughput, access patterns
Managed services, open technologies
Scalable/elastic, available, reliable, secure, no/low admin
Focus on the data, and your business, not undifferentiated tasks
Architectural Principles
Big Data != Big Cost
Aim to reduce the cost of experimentation to near 0
Simplify Big Data Processing
COLLECT STORE
PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoTApplications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Messaging
Message MESSAGES
Messaging
COLLECT
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Mobile SDK
AWS IoT
Amazon CloudWatchLogs Agent
Amazon CognitoSync
AWS Mobile Analytics
Easily Capture Mobile Data
through Sync and Server
Logging
File based integration to
Streaming Storage
API Integration via SDK’s
and Ecosystem Tools
Amazon Kinesis Producer Library
Amazon Kinesis Agent
Principles for Collection Architecture
• Should be easy to install, manage, and
understand
• Meet requirements for data durability & recovery
• Ensure ability to meet latency & performance
requirements
• Data handling should be sympathetic to the type
of data
Simplify Big Data Processing
COLLECT STORE
PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost
Types of Data
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoTApplications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Messaging
Message MESSAGES
Messaging
COLLECT
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Database
STORE
Search
File store
Queue
Stream
storage
Cache Heavily read, application defined structure
Consistently read, relational or document data
Frequently read, often requiring natural language
processing or geospatial access
Data received as files, or where object storage is
appropriate
Data intended for a single recipient, with low
latency
Data intended for many recipients, with low
latency
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENT
S
FILES
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Stream
Amazon SQS
Message
Amazon Elasticsearch
Service
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
SearchSQLNoSQLCacheFile
LoggingIoTApplicationsTransportMessaging
File store
Queue
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams
Stream
storage
Cache,
Database,
Search
Database Anti-pattern
Myriad of Applications
One GREAT BIG Database
Myriad of Applications
Best Practice - Use the Right Tool for the job
Data Tier
Search
Amazon Elasticsearch Service
Amazon CloudSearch
Cache
Amazon ElastiCache
Redis
Memcached
SQL
Amazon Aurora
Amazon RDS
Amazon Redshift
NoSQL
Amazon DynamoDB
Cassandra
HBase
MongoDB
Database Tier
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENT
S
FILES
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Stream
Amazon SQS
Message
Amazon S3
File
LoggingIoTApplicationsTransportMessaging
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams
Queue
Stream
storage
Amazon Elasticsearch
Service
Amazon DynamoDB
Amazon ElastiCache
Amazon RDS
File store
Why is Amazon S3 Good for Big Data?
• Virtually unlimited storage, unlimited object count
• Very high bandwidth – no aggregate throughput limit
• Highly available – can tolerate AZ failure
• Designed for 99.999999999% durability
• Tired-storage (Standard, IA, Amazon Glacier) via life-cycle
policies
• Native support for Versioning
• Secure – SSL, client/server-side encryption at rest
• Low cost
Why is Amazon S3 Good for Big Data?
• Natively supported by big data frameworks (Spark, Hive,
Presto, etc.)
• No need to run compute clusters for storage (unlike HDFS)
• No need to pay for data replication
• Can run transient Hadoop clusters & Amazon EC2 Spot
Instances
• Multiple & heterogenous analysis clusters can use the same
data
What about HDFS & Data Tiering (S3 IA & Amazon Glacier)?
• Use HDFS for very frequently accessed
(hot) data
• Use Amazon S3 Standard as system of
record for durability
• Use Amazon S3 Standard – IA for near line
archive data
• Use Amazon Glacier for cold/offline archive
data
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams
Amazon SQS
Amazon SQS
Managed message queue
service
Amazon Kinesis Streams
Managed stream storage +
processing
Amazon Kinesis Firehose
Managed data delivery
Amazon DynamoDB
Streams
Managed NoSQL database
Tables can be stream-enabled
Message & Stream Data
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoT
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Applications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Messaging
Message MESSAGES
Messaging
Queue
Stream
Amazon Elasticsearch
Service
Amazon DynamoDB
Amazon ElastiCache
Amazon RDS
Amazon S3
Queue
Stream
storage
Why Stream Storage?
Decouple producers & consumers
Persistent buffer
Collect multiple streams
Preserve client ordering
Streaming analysis
Parallel consumption
4 4 3 3 2 2 1 1
4 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
shard 1 / partition 1
shard 2 / partition 2
Consumer 1
Count of
red = 4
Count of
violet = 4
Consumer 2
Count of
blue = 4
Count of
green = 4
DynamoDB stream Amazon Kinesis stream
What Data Store Should I Use?
Data structure → Fixed schema, JSON, key-value
Access patterns → Store data in the format you will access it
Data / access characteristics → Hot, warm, cold
Cost → Right cost
Data Structure and Access Patterns
Access Patterns What to use?
Put/Get (key, value) Cache, NoSQL
Simple relationships → 1:N, M:N NoSQL
Cross table joins, transaction, SQL SQL
Faceting, natural language, geo Search
Data Structure What to use?
Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
(Key, value) Cache, NoSQL
What Is the Temperature of Your Data / Access ?
Hot Warm Cold
Volume MB–GB GB–TB PB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low – high High Very high
Request rate Very high High-Med Low
Cost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
Data / Access Characteristics: Hot, Warm, Cold
Amazon ElastiCache Amazon
DynamoDB
Amazon
RDS/Aurora
Amazon
Elasticsearch
Amazon S3 Amazon Glacier
Average
latency
ms ms ms, sec ms,sec ms,sec,min
(~ size)
hrs
Typical
data stored
GB GB–TBs
(no limit)
GB–TB
(64 TB max)
GB–TB MB–PB
(no limit)
GB–PB
(no limit)
Typical
item size
B-KB KB
(400 KB max)
KB
(64 KB max)
KB
(2 GB max)
KB-TB
(5 TB max)
GB
(40 TB max)
Request
Rate
High – very high Very high
(no limit)
High High Low – high
(no limit)
Very low
Storage cost
GB/month
$$ ¢¢ ¢¢ ¢¢ ¢ ¢/10
Durability Low - moderate Very high Very high High Very high Very high
Availability High
2 AZ
Very high
3 AZ
Very high
3 AZ
High
2 AZ
Very high
3 AZ
Very high
3 AZ
Hot data Warm data Cold data
What Data Store Should I Use?
Simplify Big Data Processing
COLLECT STORE
PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost
COLLECT STORE PROCESS / ANALYZE
Amazon Elasticsearch
Service
Amazon SQS
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
HotWarm
SearchSQLNoSQLCacheFileMessage
Stream
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENT
S
FILES
MESSAGES
STREAMS
LoggingIoTApplicationsTransportMessaging
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams
Hot
Tools and Frameworks
Amazon SQS apps
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
PROCESS / ANALYZE
Amazon Machine
Learning
Presto
Amazon
EMR
FastSlowFast
BatchMessageInteractiveStreamML
Amazon EC2
Amazon EC2
Batch - Minutes or hours on cold data
Daily/weekly/monthly reports
Interactive – Seconds on warm/cold data
Self-service dashboards
Messaging – Milliseconds or seconds on hot data
Message/eventbuffering
Streaming - Milliseconds or seconds on hot data
Billing/fraud alerts, 1 minute metrics
Tools and Frameworks
Batch
Amazon EMR (MapReduce, Hive, Pig, Spark)
Interactive
Amazon Redshift, Amazon EMR (Presto, Spark)
Messaging
Amazon SQS application on Amazon EC2
Streaming
Micro-batch: Spark Streaming, KCL
Real-time: Amazon Kinesis Analytics, Storm,
AWS Lambda, KCL
Amazon SQS apps
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
PROCESS / ANALYZE
Amazon Machine
Learning
Presto
Amazon
EMR
FastSlowFast
BatchMessageInteractiveStreamML
Amazon EC2
Amazon EC2
What Analytics Technology Should I Use?
Amazon
Redshift
Amazon EMR
Presto Spark Hive
Query
latency
Low Low Low High
Durability High High High High
Data volume 1.6 PB max ~Nodes ~Nodes ~Nodes
AWS
managed
Yes Yes Yes Yes
Storage Native HDFS / S3 HDFS / S3 HDFS / S3
SQL
compatibility
High High Low (SparkSQL) Medium (HQL)
Slow
What Streaming Analysis Technology Should I Use?
Spark
Streaming
Apache
Storm
Kinesis KCL
Application
AWS Lambda Amazon SQS
Apps
Scale ~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes
Micro-batch or
Real-time
Micro-batch Real-time Near-real-time Near-real-time Near-real-time
AWS managed
service
Yes (EMR) No (EC2) No (KCL + EC2
+ Auto Scaling)
Yes No (EC2 + Auto
Scaling)
Scalability No limits
~ nodes
No limits
~ nodes
No limits
~ nodes
No limits No limits
Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ
Programming
languages
Java, Python,
Scala
Any language
via Thrift
Java, via
MultiLang
Daemon (.NET,
Python, Ruby,
Node.js)
Node.js, Java,
Python
AWS SDK
languages
(Java, .NET,
Python, …)
Simplify Big Data Processing
COLLECT STORE
PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost
Amazon SQS apps
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon Machine
Learning
Presto
Amazon
EMR
Amazon Elasticsearch
Service
Amazon SQS
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
HotHotWarm
FastSlowFast
BatchMessageInteractiveStreamML
SearchSQLNoSQLCacheFileMessage
Stream
Amazon EC2
Amazon EC2
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENT
S
FILES
MESSAGES
STREAMS
LoggingIoTApplicationsTransportMessaging
ETL
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams
STORE CONSUMEPROCESS / ANALYZE
Amazon QuickSight
Analysis&visualizationNotebooksIDE
Apps & Services
API
Applications & API
Reporting, Exploratory and
Ad-Hoc Analysis and Data
Visualization
Notebooks
Data Science
Business
users
Data scientist,
developers
COLLECT ETL
Design Patterns
Primitive: Batch Analysis & Enhancement
process
store
Amazon S3
Amazon
DynamoDB
Amazon EMR
Presto
Spark
SQL
Amazon
Redshift
Amazon
Quicksight
AWS
Marketplace
Primitive: Event Processing “Data Bus”
Multiple stages
Storage raises events to decouple processing stages
Store Process Store Process
Primitive: Pub/Sub
process
store
Amazon
Kinesis
AWS Lambda
Amazon Kinesis
S3
Connector
Apache Spark
Kinesis Client
Library
Combining Primitives: Event Processing Network
Combines data bus and pub/sub patterns to create
asynchronous, continuously executing transformations
Store Process Store Process
Process Store Process
Process
Process
Primitive: Messaging “Fan Out”
process
store
Amazon SQS AWS Lambda
Amazon
Kinesis
Real Time
Processing
Batch Analytics
process
store
Amazon S3
Amazon
Redshift
Amazon EMR
Presto
Hive
Pig
Spark
Real Time
StorageAmazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
Amazon
ML
KCL
AWS Lambda
Storm
Real Time &
Batch
Architecture
Spark Streaming
on Amazon EMR
Applications
Amazon
Kinesis
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Zoltan Arvai, Big Data Architect
July 7th, 2016
Leveraging S3 and EMR at Adbrain
About Adbrain
• Adbrain is a market leader in providing Probabilistic
Identity Solutions
• At the core of our business we build an identity graph
that represents connected devices on the planet
• We enable our clients to understand and target their
audience better
What does that mean?
• Ingesting large amounts of log data from
different data sources
• Build a graph using statistical and
machine learning algorithms
• Extract a relevant user graph for our
clients
Our main scenarios
• Production graph generation and servicing our clients
• Data Science team to run computations (Spark)
• R&D work with ML
• Data feed evaluation
• Graph analytics
• Data Engineering development workload
Our original setup
• HDFS Cluster on 18 nodes with most of our hot data
• Not leveraging co-location of data
• Spark clusters are separated
• Spark instance type optimized according to workload
• Custom cluster provisioning solution to run Spark jobs
• Built in-house
• Required maintenance
challenge
What are the issues?
• “Hadoop is a cheap Big Data solution on
commodity hardware” … define cheap for a
startup
• Scaling Hadoop means more focus on
infrastructure
• Data grows really quickly (adding a new country,
data feed, etc.)
What do we need?
• Better scale
• Cheaper solution
• Less focus on infrastructure
• Reduced risk
EMR and S3 to the rescue!
• S3 is significantly cheaper compared to HDFS
• We use infrequent access and Glacier storage as
well
• EMR is a fully managed solution
• Up to date Spark distributions
• Support of Spot instances
• Managed Hadoop when / if needed
Amazon
S3
Amazon
EMR
The implementation challenge
Some technical challenges
• Spark 1.3.1 did not work well with S3 and
Parquet files
• Writing to S3 using the default
ParquetOutputComitter is slow
• Calculating output summary metadata is slow
• S3 is not consistent by default (it’s not a file
system)
• S3 can be slower than HDFS
Solutions
• Upgrade to Spark 1.5.1+
• Using DirectParquetOutputCommitter with speculation
disabled
• parquet.enable.summary-metadata = false
• Enable S3 Consistent View in EMR!
• (Remember the DynamoDB tables!)
• In case of a multi-step Spark workflow, utilize HDFS on
the Spark Cluster
The end result
• Decreasing costs by 70%+
• Focusing on development instead of
commodities (hardware/versioning/patching)
• Eliminated risks
Thank you!
aws.amazon.com/big-data
Remember to complete
your evaluations!

Contenu connexe

Tendances

Migrating Oracle Databases to AWS
Migrating Oracle Databases to AWSMigrating Oracle Databases to AWS
Migrating Oracle Databases to AWSAWS Germany
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueAmazon Web Services
 
Running Active Directory in the AWS Cloud
Running Active Directory in the AWS Cloud Running Active Directory in the AWS Cloud
Running Active Directory in the AWS Cloud Amazon Web Services
 
Getting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceGetting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceAmazon Web Services
 
Cloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for PartnersCloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for PartnersAmazon Web Services
 
Data Migration Using AWS Snowball, Snowball Edge & Snowmobile
Data Migration Using AWS Snowball, Snowball Edge & SnowmobileData Migration Using AWS Snowball, Snowball Edge & Snowmobile
Data Migration Using AWS Snowball, Snowball Edge & SnowmobileAmazon Web Services
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueAmazon Web Services
 
Cloud Architecture - Multi Cloud, Edge, On-Premise
Cloud Architecture - Multi Cloud, Edge, On-PremiseCloud Architecture - Multi Cloud, Edge, On-Premise
Cloud Architecture - Multi Cloud, Edge, On-PremiseAraf Karsh Hamid
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Database Migration Using AWS DMS and AWS SCT (GPSCT307) - AWS re:Invent 2018
Database Migration Using AWS DMS and AWS SCT (GPSCT307) - AWS re:Invent 2018Database Migration Using AWS DMS and AWS SCT (GPSCT307) - AWS re:Invent 2018
Database Migration Using AWS DMS and AWS SCT (GPSCT307) - AWS re:Invent 2018Amazon Web Services
 
Speed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWSSpeed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWSData Science Milan
 
Executing a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWSExecuting a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWSAmazon Web Services
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseSnowflake Computing
 
Getting started with Amazon ElastiCache
Getting started with Amazon ElastiCacheGetting started with Amazon ElastiCache
Getting started with Amazon ElastiCacheAmazon Web Services
 

Tendances (20)

Migrating Oracle Databases to AWS
Migrating Oracle Databases to AWSMigrating Oracle Databases to AWS
Migrating Oracle Databases to AWS
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
Running Active Directory in the AWS Cloud
Running Active Directory in the AWS Cloud Running Active Directory in the AWS Cloud
Running Active Directory in the AWS Cloud
 
Getting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceGetting Started with AWS Database Migration Service
Getting Started with AWS Database Migration Service
 
ElastiCache & Redis
ElastiCache & RedisElastiCache & Redis
ElastiCache & Redis
 
Cloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for PartnersCloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for Partners
 
Data Migration Using AWS Snowball, Snowball Edge & Snowmobile
Data Migration Using AWS Snowball, Snowball Edge & SnowmobileData Migration Using AWS Snowball, Snowball Edge & Snowmobile
Data Migration Using AWS Snowball, Snowball Edge & Snowmobile
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Migrating to the Cloud
Migrating to the CloudMigrating to the Cloud
Migrating to the Cloud
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Cloud Architecture - Multi Cloud, Edge, On-Premise
Cloud Architecture - Multi Cloud, Edge, On-PremiseCloud Architecture - Multi Cloud, Edge, On-Premise
Cloud Architecture - Multi Cloud, Edge, On-Premise
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Database Migration Using AWS DMS and AWS SCT (GPSCT307) - AWS re:Invent 2018
Database Migration Using AWS DMS and AWS SCT (GPSCT307) - AWS re:Invent 2018Database Migration Using AWS DMS and AWS SCT (GPSCT307) - AWS re:Invent 2018
Database Migration Using AWS DMS and AWS SCT (GPSCT307) - AWS re:Invent 2018
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Speed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWSSpeed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWS
 
Executing a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWSExecuting a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWS
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
 
Cloud Migration Workshop
Cloud Migration WorkshopCloud Migration Workshop
Cloud Migration Workshop
 
Getting started with Amazon ElastiCache
Getting started with Amazon ElastiCacheGetting started with Amazon ElastiCache
Getting started with Amazon ElastiCache
 

En vedette

(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Amazon Web Services
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
 
Getting started with Amazon Kinesis
Getting started with Amazon KinesisGetting started with Amazon Kinesis
Getting started with Amazon KinesisAmazon Web Services
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep DiveAmazon Web Services
 
Exploitez le Big Data dans le cadre de votre stratégie MDM
Exploitez le Big Data dans le cadre de votre stratégie MDMExploitez le Big Data dans le cadre de votre stratégie MDM
Exploitez le Big Data dans le cadre de votre stratégie MDMJean-Michel Franco
 
The Impact of Messaging Standards on Event-Driven Architecture and IoT
The Impact of Messaging Standards on Event-Driven Architecture and IoTThe Impact of Messaging Standards on Event-Driven Architecture and IoT
The Impact of Messaging Standards on Event-Driven Architecture and IoTSolace
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackAnirvan Chakraborty
 
Demystifying salesforce for developers
Demystifying salesforce for developersDemystifying salesforce for developers
Demystifying salesforce for developersHeitor Souza
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringAnant Rustagi
 
Extreme Salesforce Data Volumes Webinar
Extreme Salesforce Data Volumes WebinarExtreme Salesforce Data Volumes Webinar
Extreme Salesforce Data Volumes WebinarSalesforce Developers
 
How Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and StormHow Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and StormEdureka!
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
 
AWS re:Invent 2016: Lift and Evolve – Saving Money in the Cloud is Easy, Maki...
AWS re:Invent 2016: Lift and Evolve – Saving Money in the Cloud is Easy, Maki...AWS re:Invent 2016: Lift and Evolve – Saving Money in the Cloud is Easy, Maki...
AWS re:Invent 2016: Lift and Evolve – Saving Money in the Cloud is Easy, Maki...Amazon Web Services
 
Machine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWSMachine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWSAmazon Web Services
 
Handling of Large Data by Salesforce
Handling of Large Data by SalesforceHandling of Large Data by Salesforce
Handling of Large Data by SalesforceThinqloud
 
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...Data Con LA
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudStreamsets Inc.
 

En vedette (20)

Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
Getting started with Amazon Kinesis
Getting started with Amazon KinesisGetting started with Amazon Kinesis
Getting started with Amazon Kinesis
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
 
Exploitez le Big Data dans le cadre de votre stratégie MDM
Exploitez le Big Data dans le cadre de votre stratégie MDMExploitez le Big Data dans le cadre de votre stratégie MDM
Exploitez le Big Data dans le cadre de votre stratégie MDM
 
The Impact of Messaging Standards on Event-Driven Architecture and IoT
The Impact of Messaging Standards on Event-Driven Architecture and IoTThe Impact of Messaging Standards on Event-Driven Architecture and IoT
The Impact of Messaging Standards on Event-Driven Architecture and IoT
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
 
Demystifying salesforce for developers
Demystifying salesforce for developersDemystifying salesforce for developers
Demystifying salesforce for developers
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroring
 
Extreme Salesforce Data Volumes Webinar
Extreme Salesforce Data Volumes WebinarExtreme Salesforce Data Volumes Webinar
Extreme Salesforce Data Volumes Webinar
 
How Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and StormHow Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and Storm
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
AWS re:Invent 2016: Lift and Evolve – Saving Money in the Cloud is Easy, Maki...
AWS re:Invent 2016: Lift and Evolve – Saving Money in the Cloud is Easy, Maki...AWS re:Invent 2016: Lift and Evolve – Saving Money in the Cloud is Easy, Maki...
AWS re:Invent 2016: Lift and Evolve – Saving Money in the Cloud is Easy, Maki...
 
Machine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWSMachine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWS
 
Handling of Large Data by Salesforce
Handling of Large Data by SalesforceHandling of Large Data by Salesforce
Handling of Large Data by Salesforce
 
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 

Similaire à Big Data Architectural Patterns and Best Practices on AWS

Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Big Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesBig Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesAmazon Web Services
 
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017Amazon Web Services
 
Building a Data Processing Pipeline on AWS - AWS Summit SG 2017
Building a Data Processing Pipeline on AWS - AWS Summit SG 2017Building a Data Processing Pipeline on AWS - AWS Summit SG 2017
Building a Data Processing Pipeline on AWS - AWS Summit SG 2017Amazon Web Services
 
Database and Analytics on the AWS Cloud - AWS Innovate Toronto
Database and Analytics on the AWS Cloud - AWS Innovate TorontoDatabase and Analytics on the AWS Cloud - AWS Innovate Toronto
Database and Analytics on the AWS Cloud - AWS Innovate TorontoAmazon Web Services
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Building a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSBuilding a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSAmazon Web Services
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...Amazon Web Services
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Amazon Web Services
 
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...Amazon Web Services
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design PatternsJohn Yeung
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Amazon Web Services
 
Simplify Big Data with AWS
Simplify Big Data with AWSSimplify Big Data with AWS
Simplify Big Data with AWSJulien SIMON
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Amazon Web Services
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Amazon Web Services
 
AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...
AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...
AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...Amazon Web Services
 

Similaire à Big Data Architectural Patterns and Best Practices on AWS (20)

Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Big Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesBig Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best Practices
 
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
 
Building a Data Processing Pipeline on AWS - AWS Summit SG 2017
Building a Data Processing Pipeline on AWS - AWS Summit SG 2017Building a Data Processing Pipeline on AWS - AWS Summit SG 2017
Building a Data Processing Pipeline on AWS - AWS Summit SG 2017
 
Database and Analytics on the AWS Cloud - AWS Innovate Toronto
Database and Analytics on the AWS Cloud - AWS Innovate TorontoDatabase and Analytics on the AWS Cloud - AWS Innovate Toronto
Database and Analytics on the AWS Cloud - AWS Innovate Toronto
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Building a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSBuilding a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWS
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 
Deep Dive in Big Data
Deep Dive in Big DataDeep Dive in Big Data
Deep Dive in Big Data
 
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWS
 
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
 
Deep Dive on Big Data
Deep Dive on Big Data Deep Dive on Big Data
Deep Dive on Big Data
 
Simplify Big Data with AWS
Simplify Big Data with AWSSimplify Big Data with AWS
Simplify Big Data with AWS
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 
AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...
AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...
AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 

Dernier (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Big Data Architectural Patterns and Best Practices on AWS

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ian Meyers, Principal Solutions Architect July 7th , 2016 Big Data Architectural Patterns and Best Practices on AWS
  • 3. Powerful Services, Open Ecosystem Amazon Glacier S3 DynamoDB RDS EMR Amazon Redshift Data Pipeline Amazon Kinesis Kinesis-enabled app Lambda ML SQS ElastiCache DynamoDB Streams Amazon Elasticsearch Service Amazon Web Services Open Source & Commercial Ecosystem
  • 4. Separation of compute & storage • Event driven, loose coupling, independent scaling & sizing “Data Sympathy” - use the right tool for the job Data structure, latency, throughput, access patterns Managed services, open technologies Scalable/elastic, available, reliable, secure, no/low admin Focus on the data, and your business, not undifferentiated tasks Architectural Principles
  • 5. Big Data != Big Cost Aim to reduce the cost of experimentation to near 0
  • 6. Simplify Big Data Processing COLLECT STORE PROCESS/ ANALYZE CONSUME Time to answer (Latency) Throughput Cost
  • 7. Devices Sensors & IoT platforms AWS IoT STREAMS IoTApplications AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES LoggingTransport Messaging Message MESSAGES Messaging COLLECT Mobile apps Web apps Data centers AWS Direct Connect RECORDS AWS Mobile SDK AWS IoT Amazon CloudWatchLogs Agent Amazon CognitoSync AWS Mobile Analytics Easily Capture Mobile Data through Sync and Server Logging File based integration to Streaming Storage API Integration via SDK’s and Ecosystem Tools Amazon Kinesis Producer Library Amazon Kinesis Agent
  • 8. Principles for Collection Architecture • Should be easy to install, manage, and understand • Meet requirements for data durability & recovery • Ensure ability to meet latency & performance requirements • Data handling should be sympathetic to the type of data
  • 9. Simplify Big Data Processing COLLECT STORE PROCESS/ ANALYZE CONSUME Time to answer (Latency) Throughput Cost
  • 10. Types of Data Devices Sensors & IoT platforms AWS IoT STREAMS IoTApplications AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES LoggingTransport Messaging Message MESSAGES Messaging COLLECT Mobile apps Web apps Data centers AWS Direct Connect RECORDS Database STORE Search File store Queue Stream storage Cache Heavily read, application defined structure Consistently read, relational or document data Frequently read, often requiring natural language processing or geospatial access Data received as files, or where object storage is appropriate Data intended for a single recipient, with low latency Data intended for many recipients, with low latency
  • 11. COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENT S FILES Messaging Message MESSAGES Devices Sensors & IoT platforms AWS IoT STREAMS Stream Amazon SQS Message Amazon Elasticsearch Service Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS SearchSQLNoSQLCacheFile LoggingIoTApplicationsTransportMessaging File store Queue Amazon Kinesis Firehose Amazon Kinesis Streams Amazon DynamoDB Streams Stream storage Cache, Database, Search
  • 12. Database Anti-pattern Myriad of Applications One GREAT BIG Database
  • 13. Myriad of Applications Best Practice - Use the Right Tool for the job Data Tier Search Amazon Elasticsearch Service Amazon CloudSearch Cache Amazon ElastiCache Redis Memcached SQL Amazon Aurora Amazon RDS Amazon Redshift NoSQL Amazon DynamoDB Cassandra HBase MongoDB Database Tier
  • 14. COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENT S FILES Messaging Message MESSAGES Devices Sensors & IoT platforms AWS IoT STREAMS Stream Amazon SQS Message Amazon S3 File LoggingIoTApplicationsTransportMessaging Amazon Kinesis Firehose Amazon Kinesis Streams Amazon DynamoDB Streams Queue Stream storage Amazon Elasticsearch Service Amazon DynamoDB Amazon ElastiCache Amazon RDS File store
  • 15. Why is Amazon S3 Good for Big Data? • Virtually unlimited storage, unlimited object count • Very high bandwidth – no aggregate throughput limit • Highly available – can tolerate AZ failure • Designed for 99.999999999% durability • Tired-storage (Standard, IA, Amazon Glacier) via life-cycle policies • Native support for Versioning • Secure – SSL, client/server-side encryption at rest • Low cost
  • 16. Why is Amazon S3 Good for Big Data? • Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • No need to run compute clusters for storage (unlike HDFS) • No need to pay for data replication • Can run transient Hadoop clusters & Amazon EC2 Spot Instances • Multiple & heterogenous analysis clusters can use the same data
  • 17. What about HDFS & Data Tiering (S3 IA & Amazon Glacier)? • Use HDFS for very frequently accessed (hot) data • Use Amazon S3 Standard as system of record for durability • Use Amazon S3 Standard – IA for near line archive data • Use Amazon Glacier for cold/offline archive data
  • 18. Amazon Kinesis Firehose Amazon Kinesis Streams Amazon DynamoDB Streams Amazon SQS Amazon SQS Managed message queue service Amazon Kinesis Streams Managed stream storage + processing Amazon Kinesis Firehose Managed data delivery Amazon DynamoDB Streams Managed NoSQL database Tables can be stream-enabled Message & Stream Data Devices Sensors & IoT platforms AWS IoT STREAMS IoT COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS Applications AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES LoggingTransport Messaging Message MESSAGES Messaging Queue Stream Amazon Elasticsearch Service Amazon DynamoDB Amazon ElastiCache Amazon RDS Amazon S3 Queue Stream storage
  • 19. Why Stream Storage? Decouple producers & consumers Persistent buffer Collect multiple streams Preserve client ordering Streaming analysis Parallel consumption 4 4 3 3 2 2 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 4 3 3 2 2 1 1 shard 1 / partition 1 shard 2 / partition 2 Consumer 1 Count of red = 4 Count of violet = 4 Consumer 2 Count of blue = 4 Count of green = 4 DynamoDB stream Amazon Kinesis stream
  • 20. What Data Store Should I Use? Data structure → Fixed schema, JSON, key-value Access patterns → Store data in the format you will access it Data / access characteristics → Hot, warm, cold Cost → Right cost
  • 21. Data Structure and Access Patterns Access Patterns What to use? Put/Get (key, value) Cache, NoSQL Simple relationships → 1:N, M:N NoSQL Cross table joins, transaction, SQL SQL Faceting, natural language, geo Search Data Structure What to use? Fixed schema SQL, NoSQL Schema-free (JSON) NoSQL, Search (Key, value) Cache, NoSQL
  • 22. What Is the Temperature of Your Data / Access ?
  • 23. Hot Warm Cold Volume MB–GB GB–TB PB Item size B–KB KB–MB KB–TB Latency ms ms, sec min, hrs Durability Low – high High Very high Request rate Very high High-Med Low Cost/GB $$-$ $-¢¢ ¢ Hot data Warm data Cold data Data / Access Characteristics: Hot, Warm, Cold
  • 24. Amazon ElastiCache Amazon DynamoDB Amazon RDS/Aurora Amazon Elasticsearch Amazon S3 Amazon Glacier Average latency ms ms ms, sec ms,sec ms,sec,min (~ size) hrs Typical data stored GB GB–TBs (no limit) GB–TB (64 TB max) GB–TB MB–PB (no limit) GB–PB (no limit) Typical item size B-KB KB (400 KB max) KB (64 KB max) KB (2 GB max) KB-TB (5 TB max) GB (40 TB max) Request Rate High – very high Very high (no limit) High High Low – high (no limit) Very low Storage cost GB/month $$ ¢¢ ¢¢ ¢¢ ¢ ¢/10 Durability Low - moderate Very high Very high High Very high Very high Availability High 2 AZ Very high 3 AZ Very high 3 AZ High 2 AZ Very high 3 AZ Very high 3 AZ Hot data Warm data Cold data What Data Store Should I Use?
  • 25. Simplify Big Data Processing COLLECT STORE PROCESS/ ANALYZE CONSUME Time to answer (Latency) Throughput Cost
  • 26. COLLECT STORE PROCESS / ANALYZE Amazon Elasticsearch Service Amazon SQS Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS HotWarm SearchSQLNoSQLCacheFileMessage Stream Mobile apps Web apps Devices Messaging Message Sensors & IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENT S FILES MESSAGES STREAMS LoggingIoTApplicationsTransportMessaging Amazon Kinesis Firehose Amazon Kinesis Streams Amazon DynamoDB Streams Hot
  • 27. Tools and Frameworks Amazon SQS apps Streaming Amazon Kinesis Analytics Amazon KCL apps AWS Lambda Amazon Redshift PROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR FastSlowFast BatchMessageInteractiveStreamML Amazon EC2 Amazon EC2 Batch - Minutes or hours on cold data Daily/weekly/monthly reports Interactive – Seconds on warm/cold data Self-service dashboards Messaging – Milliseconds or seconds on hot data Message/eventbuffering Streaming - Milliseconds or seconds on hot data Billing/fraud alerts, 1 minute metrics
  • 28. Tools and Frameworks Batch Amazon EMR (MapReduce, Hive, Pig, Spark) Interactive Amazon Redshift, Amazon EMR (Presto, Spark) Messaging Amazon SQS application on Amazon EC2 Streaming Micro-batch: Spark Streaming, KCL Real-time: Amazon Kinesis Analytics, Storm, AWS Lambda, KCL Amazon SQS apps Streaming Amazon Kinesis Analytics Amazon KCL apps AWS Lambda Amazon Redshift PROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR FastSlowFast BatchMessageInteractiveStreamML Amazon EC2 Amazon EC2
  • 29. What Analytics Technology Should I Use? Amazon Redshift Amazon EMR Presto Spark Hive Query latency Low Low Low High Durability High High High High Data volume 1.6 PB max ~Nodes ~Nodes ~Nodes AWS managed Yes Yes Yes Yes Storage Native HDFS / S3 HDFS / S3 HDFS / S3 SQL compatibility High High Low (SparkSQL) Medium (HQL) Slow
  • 30. What Streaming Analysis Technology Should I Use? Spark Streaming Apache Storm Kinesis KCL Application AWS Lambda Amazon SQS Apps Scale ~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes Micro-batch or Real-time Micro-batch Real-time Near-real-time Near-real-time Near-real-time AWS managed service Yes (EMR) No (EC2) No (KCL + EC2 + Auto Scaling) Yes No (EC2 + Auto Scaling) Scalability No limits ~ nodes No limits ~ nodes No limits ~ nodes No limits No limits Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ Programming languages Java, Python, Scala Any language via Thrift Java, via MultiLang Daemon (.NET, Python, Ruby, Node.js) Node.js, Java, Python AWS SDK languages (Java, .NET, Python, …)
  • 31. Simplify Big Data Processing COLLECT STORE PROCESS/ ANALYZE CONSUME Time to answer (Latency) Throughput Cost
  • 32. Amazon SQS apps Streaming Amazon Kinesis Analytics Amazon KCL apps AWS Lambda Amazon Redshift COLLECT STORE CONSUMEPROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR Amazon Elasticsearch Service Amazon SQS Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS HotHotWarm FastSlowFast BatchMessageInteractiveStreamML SearchSQLNoSQLCacheFileMessage Stream Amazon EC2 Amazon EC2 Mobile apps Web apps Devices Messaging Message Sensors & IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENT S FILES MESSAGES STREAMS LoggingIoTApplicationsTransportMessaging ETL Amazon Kinesis Firehose Amazon Kinesis Streams Amazon DynamoDB Streams
  • 33. STORE CONSUMEPROCESS / ANALYZE Amazon QuickSight Analysis&visualizationNotebooksIDE Apps & Services API Applications & API Reporting, Exploratory and Ad-Hoc Analysis and Data Visualization Notebooks Data Science Business users Data scientist, developers COLLECT ETL
  • 34.
  • 36. Primitive: Batch Analysis & Enhancement process store Amazon S3 Amazon DynamoDB Amazon EMR Presto Spark SQL Amazon Redshift Amazon Quicksight AWS Marketplace
  • 37. Primitive: Event Processing “Data Bus” Multiple stages Storage raises events to decouple processing stages Store Process Store Process
  • 38. Primitive: Pub/Sub process store Amazon Kinesis AWS Lambda Amazon Kinesis S3 Connector Apache Spark Kinesis Client Library
  • 39. Combining Primitives: Event Processing Network Combines data bus and pub/sub patterns to create asynchronous, continuously executing transformations Store Process Store Process Process Store Process Process Process
  • 40. Primitive: Messaging “Fan Out” process store Amazon SQS AWS Lambda Amazon Kinesis
  • 41. Real Time Processing Batch Analytics process store Amazon S3 Amazon Redshift Amazon EMR Presto Hive Pig Spark Real Time StorageAmazon ElastiCache Amazon DynamoDB Amazon RDS Amazon ES Amazon ML KCL AWS Lambda Storm Real Time & Batch Architecture Spark Streaming on Amazon EMR Applications Amazon Kinesis
  • 42. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Zoltan Arvai, Big Data Architect July 7th, 2016 Leveraging S3 and EMR at Adbrain
  • 43. About Adbrain • Adbrain is a market leader in providing Probabilistic Identity Solutions • At the core of our business we build an identity graph that represents connected devices on the planet • We enable our clients to understand and target their audience better
  • 44. What does that mean? • Ingesting large amounts of log data from different data sources • Build a graph using statistical and machine learning algorithms • Extract a relevant user graph for our clients
  • 45. Our main scenarios • Production graph generation and servicing our clients • Data Science team to run computations (Spark) • R&D work with ML • Data feed evaluation • Graph analytics • Data Engineering development workload
  • 46. Our original setup • HDFS Cluster on 18 nodes with most of our hot data • Not leveraging co-location of data • Spark clusters are separated • Spark instance type optimized according to workload • Custom cluster provisioning solution to run Spark jobs • Built in-house • Required maintenance
  • 48. What are the issues? • “Hadoop is a cheap Big Data solution on commodity hardware” … define cheap for a startup • Scaling Hadoop means more focus on infrastructure • Data grows really quickly (adding a new country, data feed, etc.)
  • 49. What do we need? • Better scale • Cheaper solution • Less focus on infrastructure • Reduced risk
  • 50. EMR and S3 to the rescue! • S3 is significantly cheaper compared to HDFS • We use infrequent access and Glacier storage as well • EMR is a fully managed solution • Up to date Spark distributions • Support of Spot instances • Managed Hadoop when / if needed Amazon S3 Amazon EMR
  • 52. Some technical challenges • Spark 1.3.1 did not work well with S3 and Parquet files • Writing to S3 using the default ParquetOutputComitter is slow • Calculating output summary metadata is slow • S3 is not consistent by default (it’s not a file system) • S3 can be slower than HDFS
  • 53. Solutions • Upgrade to Spark 1.5.1+ • Using DirectParquetOutputCommitter with speculation disabled • parquet.enable.summary-metadata = false • Enable S3 Consistent View in EMR! • (Remember the DynamoDB tables!) • In case of a multi-step Spark workflow, utilize HDFS on the Spark Cluster
  • 54. The end result • Decreasing costs by 70%+ • Focusing on development instead of commodities (hardware/versioning/patching) • Eliminated risks