SlideShare une entreprise Scribd logo
1  sur  49
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Big Data and Analytics on
AWS
KD Singh
Solutions Architect
Amazon Web Services
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
What is big data?
When your data sets become so large that you have to start
innovating around how to collect, store, organize, analyze, and
share it
• Velocity
– Rate of data flow in
• Latency
– High or Low
• Volume
– High or Low
• Variety
– Diversity of source data
• Item Size
– KB or MB
• Request Rate
– Access patterns
• Change Rate
– How much is the data changing?
• Processing Requirements
– How much computation?
• Durability
– Preservation of source data?
• Availability
– Tolerance for downtime?
• Growth Rate
– Rate of data growth?
• Views
– The diversity of consumers?
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Plethora of tools
Amazon
Glacier
Amazon
S3
Amazon
DynamoDB
Amazon
RDS
Amazon
EMR
Amazon
Redshift
AWS
Data Pipeline
Amazon
Kinesis
Cassandra
Amazon
CloudSearch
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Ingest Store Analyze Visualize
Data Answers
Time
Multiple stages
Storage decoupled from processing
Simplify data analytics flow
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Amazon
GlacierS3
DynamoDB
RDS
Amazon Kinesis
Spark
Streaming
EMR
Ingest Store Process/Analyze Visualize
Data Pipeline
Storm
Kafka
Amazon
Redshift
Cassandra
Amazon
CloudSearch
Amazon
Kinesis
Connector
Kinesis
enabled app
App Server
Web Server
Devices
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Collect / Ingest
Amazon Kinesis
Process / Analyze
Amazon EMR Amazon EC2
Amazon
Redshift
AWS
Data Pipeline
Visualize / ReportStore
Amazon Glacier
Amazon S3
Amazon DynamoDB
Amazon RDS
AWS Import/Export
AWS Direct Connect
Amazon SQS
AWS big data portfolio
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Mobile / Cable
Telecom
Oil and Gas
Industrial
Manufacturing
Retail/Consumer
Entertainment
Hospitality
Life Sciences
Scientific
Exploration
Financial
Services
Publishing Media
Advertising
Online Media
Social Network
Gaming
Industries using AWS for data analysis
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Ingest: The act of collecting and storing data
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Types of data ingest
• Transactional
– Database reads/writes
• File
– Media files; log files
• Stream
– Click-stream logs (sets of
events)
Database
Cloud
Storage
Stream
Storage
LoggingFrameworksDevicesApps
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Real-time processing of streaming data
High throughput
Elastic
Easy to use
Connectors for EMR, S3, Amazon Redshift,
DynamoDB
Amazon
Kinesis
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Sending and reading data from Amazon
Kinesis streams
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client Library
+
Connector Library
Apache
Storm
Amazon Elastic
MapReduce
Sending Reading
Write Read
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Hparser, Big Data Edition
Flume, Sqoop
AWS Partners for data ingest, load, and
transformation
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Storage
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
App/Web Tier
Client Tier
Database & Storage Tier
Cloud database and storage tier anti-pattern
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
App/Web Tier
Client Tier
Data TierDatabase & Storage Tier
Search
Hadoop/HDFS
Cache
Blob Store
SQL NoSQL
Cloud database and storage tier — use the right
tool for the job!
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Database & Storage Tier
Amazon RDSAmazon
DynamoDB
Amazon
ElastiCache
Amazon S3
Amazon
Glacier
Amazon
CloudSearch
HDFS on Amazon EMR
Cloud database and storage tier — use the right
tool for the job!
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Store anything
Object storage
Scalable
Designed for 99.999999999% durability
Amazon
S3
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Aggregate all data in S3 surrounded by a collection of the right tools
Amazon
EMR
Amazon
Kinesis
Amazon
Redshift
Amazon
DynamoDB
Amazon RDS
AWS
Data Pipeline
Spark
Streaming
Cassandra Storm
Amazon S3
• No limit on the number of objects
• Object size up to 5 TB
• Central data storage for all
systems
• High bandwidth
• 99.999999999% durability
• Versioning; lifecycle policies
• Amazon Glacier integration
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Fully managed NoSQL database service
Built on solid-state drives (SSDs)
Consistent low-latency performance
Any throughput rate
No storage limits
Amazon
DynamoDB
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
• Scaling without downtime
• Automatic sharding
• Security inspections, patches,
upgrades
• Automatic hardware failover
• Multi-AZ replication
• Hardware configuration
designed specifically for
DynamoDB
• Performance tuning
DynamoDB: managed high availability and
durability
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Relational databases
Fully managed; zero admin
MySQL, PostgreSQL, Oracle, SQL Server
Aurora
Amazon
RDS
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Process and analyze
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Processing frameworks
• Batch processing
– Take large amount (>100 TB) of cold data and ask questions
– Takes minutes or hours to get answers back
– Example: Generating hourly, daily, weekly reports
• Stream processing (real-time)
– Take small amount of hot data and ask questions
– Takes short amount of time to get your answer back
– Example: 1 min metrics
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Processing frameworks
• Batch processing/analytic
– Amazon Redshift
– Amazon EMR (Hadoop)
– Spark, Hive/Tez, Pig, Impala, Presto, ….
• Stream processing
– Amazon Kinesis client and connector library
– Spark Streaming
– Storm (+Trident)
MPPMPPHadoopStreamProcessing
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Columnar data warehouse
ANSI SQL compatible
Massively parallel
Petabyte scale
Fully managed
Very cost-effective
Amazon
Redshift
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Amazon Redshift architecture
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via
Amazon S3
– Parallel load from Amazon DynamoDB
• Hardware optimized for data processing
• Two hardware platforms
– DS2 (dense storage): HDD; scale to 1.6PB
– DC1 (dense compute): SSD; scale to 256TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Hadoop/HDFS clusters
Hive, Pig, Impala, HBase
Easy to use; fully managed
On-demand and spot pricing
Tight integration with S3,
DynamoDB, and Amazon Kinesis
Amazon
Elastic
MapReduce
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
EMR Cluster
S3
1. Put the data
into S3
2. Choose: Hadoop distribution, #
of nodes, types of nodes, Hadoop
apps like Hive/Pig/HBase
4. Get the output
from S3
3. Launch the cluster using
the EMR console, CLI, SDK, or
APIs
How does EMR work?
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
The Hadoop ecosystem works with EMR
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Partners – advanced analytics
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Visualize
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
AWS Partners for BI & data visualization
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Putting it all together
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Amazon EMR
as ETL Grid
and Analysis
Amazon Redshift –
Production DWH
VisualizationLogs
Traffic Statistics
Demo
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Demo
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
ICAO and Hadoop
Marco Merens
Chief (Acting) Integrated Analysis
International Civil Aviation Organization
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
ICAO in the cloud
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Cloudability principles at ICAO
1. What comes from the cloud, can stay in the
cloud
2. What comes from in-house
A. should stay in-house if private, or
B. can be synced with the cloud if public
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Data sync
Data
Basic UI
Create
Read
Update
Delete
sync Data
FancyUI
Read
Metrics
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Collect Map Reduce Publish
Key Priority
EMR example: blended accident list
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Input format
XML
<?xml version="1.0" encoding="utf-8"?>
<root>
<ADREP><FilingInformation
State="XX"><ReportingOrganization>Ascend</ReportingOrganization>
<StateFileNumber>S1982045</StateFileNumber>
<Headline>MU-2, Collision with high ground, (near) Kelowna</Headline>
</FilingInformation>
….
</root>
CSV
|26/12/2001|Germany|Germany|"ICE:Icing"|Accident|Fatal|8|Germany|Bremerhav
en|D-IAAI|"BRITTEN NORMAN"||||"2 251 to 5 700 Kg"|Scheduled|Airplane|Take-
off||
…
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
#!/bin/sh
wget "http://somexml" -qO- | tr -d "n" | tr -d "r" |
sed "s#<Accident>#n<Accident>#g" > tmp
aws s3 put tmp s3://accidents/input/source1
…….
Amazon S3
Amazon EC2
Use linux
crontab
to schedule
Make One XML
element per line for
EMR
Collect
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
EMR command line
elastic-mapreduce
--create
--bootstrap-action
s3://elasticmapreduce/samples/node/install-node-bin-
x86.sh
--instance-type m1.small --instance-count 3
--json job.json
--put /home/ec2-user/key/newtest.pem
--to /home/hadoop
--enable-debugging
Put ssh key to hadoop
if you need to remote
sh
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
EMR json config file
[{
"Name": "Make accident map",
"ActionOnFailure": "CANCEL_AND_WAIT",
"HadoopJarStep": {
"Jar":"/home/hadoop/contrib/streaming/hadoop-streaming.jar",
"Args": [
"-input",
"s3://accidentstats/input/*", …
]},{
"Name": "Store in mongo",
"ActionOnFailure": "CANCEL_AND_WAIT",
"HadoopJarStep": {
"Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar" ,
"Args": [
"s3://edmscripts/uploadtomongo.sh",
"accidentstats/output",
"NEWACCIDENTLIST"
]}}
Move the results
from S3 to
somewhere else
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Map
sourceX
#!/usr/bin/env node
function treatline(line) {
If (line.indexOf(“<ADREP>”))
{
source1(line)
}
……
Function source1(line)
{
Var data=xml2json(line)
data.records.forEach(function(v){
Var el={ Date:v.Date,
Registration:v.Registration,
Model:v.Model,
Source:”Source1”,
Priority:1
}
var key=el.Date+”#”+el.Registration
process.stdout.write(key+”/t”+JSON.stringify(el)
)
})
}
mapped
Amazon Elastic
MapReduce
Amazon S3
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Reduce
Mapped and
sorted
#!/usr/bin/env node
Var oldkey,key,array=[]
function treatline(line) {
key=line.split(“/t”)[0]
data=JSON.parse(line.split(“/t”)[1])
If ((key==oldkey) || !oldkey)
{
array.push(data)}
Else {
treat(array)
array=[]}
oldkey=key
……}
Function treat(array)
{
el={}
array=array.sort(prioritysort)
array.forEach(function(v){
el=updateresult(el,v)
})
process.stdout.write(JSON.stringify(el)+”n”)
}
Reduced
Amazon Elastic
MapReduce
Amazon S3
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Real-time statistics
Amazon Elastic
MapReduce
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Thank You.
This presentation will be loaded to SlideShare the week following the Symposium.
http://www.slideshare.net/AmazonWebServices
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015

Contenu connexe

Tendances

Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
How Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SFHow Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SFAmazon Web Services
 
Introduction to AWS Cloud Computing
Introduction to AWS Cloud ComputingIntroduction to AWS Cloud Computing
Introduction to AWS Cloud ComputingAmazon Web Services
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWSAmazon Web Services
 
Building APIs with Amazon API Gateway
Building APIs with Amazon API GatewayBuilding APIs with Amazon API Gateway
Building APIs with Amazon API GatewayAmazon Web Services
 
re:Invent 2022 DAT326 Deep dive into Amazon Aurora and its innovations
re:Invent 2022  DAT326 Deep dive into Amazon Aurora and its innovationsre:Invent 2022  DAT326 Deep dive into Amazon Aurora and its innovations
re:Invent 2022 DAT326 Deep dive into Amazon Aurora and its innovationsGrant McAlister
 
Introduction to AWS Cost Management
Introduction to AWS Cost ManagementIntroduction to AWS Cost Management
Introduction to AWS Cost ManagementAmazon Web Services
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overviewJames Serra
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...Amazon Web Services Korea
 
AWS 101: Introduction to AWS
AWS 101: Introduction to AWSAWS 101: Introduction to AWS
AWS 101: Introduction to AWSIan Massingham
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!Chris Taylor
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSAmazon Web Services
 
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)Amazon Web Services Korea
 
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기Amazon Web Services Korea
 

Tendances (20)

What is AWS?
What is AWS?What is AWS?
What is AWS?
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
How Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SFHow Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SF
 
Introduction to AWS Cloud Computing
Introduction to AWS Cloud ComputingIntroduction to AWS Cloud Computing
Introduction to AWS Cloud Computing
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Building APIs with Amazon API Gateway
Building APIs with Amazon API GatewayBuilding APIs with Amazon API Gateway
Building APIs with Amazon API Gateway
 
re:Invent 2022 DAT326 Deep dive into Amazon Aurora and its innovations
re:Invent 2022  DAT326 Deep dive into Amazon Aurora and its innovationsre:Invent 2022  DAT326 Deep dive into Amazon Aurora and its innovations
re:Invent 2022 DAT326 Deep dive into Amazon Aurora and its innovations
 
Introduction to AWS Cost Management
Introduction to AWS Cost ManagementIntroduction to AWS Cost Management
Introduction to AWS Cost Management
 
Introduction to Amazon Aurora
Introduction to Amazon AuroraIntroduction to Amazon Aurora
Introduction to Amazon Aurora
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
AWS 101: Introduction to AWS
AWS 101: Introduction to AWSAWS 101: Introduction to AWS
AWS 101: Introduction to AWS
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWS
 
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
 
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
 

En vedette

The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewAmazon Web Services
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Amazon Web Services
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudAmazon Web Services
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Amazon Web Services
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...Amazon Web Services
 
Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...
Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...
Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...Amazon Web Services
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Jaroslav Gergic
 
Big Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace WebinarBig Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace WebinarAmazon Web Services
 
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Amazon Web Services
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Amazon Web Services
 
深入淺出 AWS 大數據工具
深入淺出 AWS 大數據工具深入淺出 AWS 大數據工具
深入淺出 AWS 大數據工具Amazon Web Services
 
Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Clemens Valiente
 
Big Data Analytics on AWS - Carlos Conde - AWS Summit Paris
Big Data Analytics on AWS - Carlos Conde - AWS Summit ParisBig Data Analytics on AWS - Carlos Conde - AWS Summit Paris
Big Data Analytics on AWS - Carlos Conde - AWS Summit ParisAmazon Web Services
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014Amazon Web Services
 

En vedette (20)

The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
 
2016 AWS Big Data Solution Days
2016 AWS Big Data Solution Days2016 AWS Big Data Solution Days
2016 AWS Big Data Solution Days
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
 
Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...
Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...
Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
 
Big Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace WebinarBig Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace Webinar
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301
 
深入淺出 AWS 大數據工具
深入淺出 AWS 大數據工具深入淺出 AWS 大數據工具
深入淺出 AWS 大數據工具
 
Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago
 
Big Data Analytics on AWS - Carlos Conde - AWS Summit Paris
Big Data Analytics on AWS - Carlos Conde - AWS Summit ParisBig Data Analytics on AWS - Carlos Conde - AWS Summit Paris
Big Data Analytics on AWS - Carlos Conde - AWS Summit Paris
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
 

Similaire à Big Data and Analytics on AWS

ModernizationAWS.pdf
ModernizationAWS.pdfModernizationAWS.pdf
ModernizationAWS.pdfIsmailCassiem
 
C2S Tech Tips: Rapid Prototyping
C2S Tech Tips: Rapid PrototypingC2S Tech Tips: Rapid Prototyping
C2S Tech Tips: Rapid PrototypingAmazon Web Services
 
Disaster Recovery of On-Premises IT Infrastructure with AWS
Disaster Recovery of On-Premises IT Infrastructure with AWSDisaster Recovery of On-Premises IT Infrastructure with AWS
Disaster Recovery of On-Premises IT Infrastructure with AWSAmazon Web Services
 
Hybrid Cloud Solutions to Transform Your Organization
Hybrid Cloud Solutions to Transform Your OrganizationHybrid Cloud Solutions to Transform Your Organization
Hybrid Cloud Solutions to Transform Your OrganizationAmazon Web Services
 
AWS as a Data Platform - AWS Symposium 2014 - Washington D.C.
AWS as a Data Platform - AWS Symposium 2014 - Washington D.C. AWS as a Data Platform - AWS Symposium 2014 - Washington D.C.
AWS as a Data Platform - AWS Symposium 2014 - Washington D.C. Amazon Web Services
 
Enhanced Security and Compliance with AWS
Enhanced Security and Compliance with AWSEnhanced Security and Compliance with AWS
Enhanced Security and Compliance with AWSAmazon Web Services
 
Using AWS Services to Go “All In” on AWS
Using AWS Services to Go “All In” on AWSUsing AWS Services to Go “All In” on AWS
Using AWS Services to Go “All In” on AWSAmazon Web Services
 
DevOps in the Public Sector: How the Democratic Party Implemented DevOps to M...
DevOps in the Public Sector: How the Democratic Party Implemented DevOps to M...DevOps in the Public Sector: How the Democratic Party Implemented DevOps to M...
DevOps in the Public Sector: How the Democratic Party Implemented DevOps to M...Amazon Web Services
 
Transparency and Control with AWS CloudTrail and AWS Config
Transparency and Control with AWS CloudTrail and AWS ConfigTransparency and Control with AWS CloudTrail and AWS Config
Transparency and Control with AWS CloudTrail and AWS ConfigAmazon Web Services
 
Accelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science:Transforming Research in the CloudAccelerating Time to Science:Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the Cloud Amazon Web Services
 
Transforming Education in the Cloud
Transforming Education in the CloudTransforming Education in the Cloud
Transforming Education in the CloudAmazon Web Services
 
3. 195883 open gis data slides jw_edit_js-mh
3. 195883 open gis data slides jw_edit_js-mh3. 195883 open gis data slides jw_edit_js-mh
3. 195883 open gis data slides jw_edit_js-mhAmazon Web Services
 
Scaling by Design: AWS Web Services Patterns
Scaling by Design:AWS Web Services PatternsScaling by Design:AWS Web Services Patterns
Scaling by Design: AWS Web Services PatternsAmazon Web Services
 
Scaling by Design: AWS Web Services Patterns
Scaling by Design:AWS Web Services PatternsScaling by Design:AWS Web Services Patterns
Scaling by Design: AWS Web Services PatternsAmazon Web Services
 
A Framework for Cloud IT and Business Transformation
A Framework for Cloud IT and Business TransformationA Framework for Cloud IT and Business Transformation
A Framework for Cloud IT and Business TransformationAmazon Web Services
 
Big Data on AWS - AWS Washington D.C. Symposium 2014
Big Data on AWS - AWS Washington D.C. Symposium 2014Big Data on AWS - AWS Washington D.C. Symposium 2014
Big Data on AWS - AWS Washington D.C. Symposium 2014Amazon Web Services
 

Similaire à Big Data and Analytics on AWS (20)

ModernizationAWS.pdf
ModernizationAWS.pdfModernizationAWS.pdf
ModernizationAWS.pdf
 
AWS as a Data Platform
AWS as a Data PlatformAWS as a Data Platform
AWS as a Data Platform
 
C2S Tech Tips: Rapid Prototyping
C2S Tech Tips: Rapid PrototypingC2S Tech Tips: Rapid Prototyping
C2S Tech Tips: Rapid Prototyping
 
Disaster Recovery of On-Premises IT Infrastructure with AWS
Disaster Recovery of On-Premises IT Infrastructure with AWSDisaster Recovery of On-Premises IT Infrastructure with AWS
Disaster Recovery of On-Premises IT Infrastructure with AWS
 
Hybrid Cloud Solutions to Transform Your Organization
Hybrid Cloud Solutions to Transform Your OrganizationHybrid Cloud Solutions to Transform Your Organization
Hybrid Cloud Solutions to Transform Your Organization
 
AWS as a Data Platform - AWS Symposium 2014 - Washington D.C.
AWS as a Data Platform - AWS Symposium 2014 - Washington D.C. AWS as a Data Platform - AWS Symposium 2014 - Washington D.C.
AWS as a Data Platform - AWS Symposium 2014 - Washington D.C.
 
Enhanced Security and Compliance with AWS
Enhanced Security and Compliance with AWSEnhanced Security and Compliance with AWS
Enhanced Security and Compliance with AWS
 
Using AWS Services to Go “All In” on AWS
Using AWS Services to Go “All In” on AWSUsing AWS Services to Go “All In” on AWS
Using AWS Services to Go “All In” on AWS
 
DevOps in the Public Sector: How the Democratic Party Implemented DevOps to M...
DevOps in the Public Sector: How the Democratic Party Implemented DevOps to M...DevOps in the Public Sector: How the Democratic Party Implemented DevOps to M...
DevOps in the Public Sector: How the Democratic Party Implemented DevOps to M...
 
AWS GovCloud (US) - An Overview
AWS GovCloud (US) - An OverviewAWS GovCloud (US) - An Overview
AWS GovCloud (US) - An Overview
 
Transparency and Control with AWS CloudTrail and AWS Config
Transparency and Control with AWS CloudTrail and AWS ConfigTransparency and Control with AWS CloudTrail and AWS Config
Transparency and Control with AWS CloudTrail and AWS Config
 
Accelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science:Transforming Research in the CloudAccelerating Time to Science:Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the Cloud
 
Transforming Education in the Cloud
Transforming Education in the CloudTransforming Education in the Cloud
Transforming Education in the Cloud
 
3. 195883 open gis data slides jw_edit_js-mh
3. 195883 open gis data slides jw_edit_js-mh3. 195883 open gis data slides jw_edit_js-mh
3. 195883 open gis data slides jw_edit_js-mh
 
Open GIS Data
Open GIS DataOpen GIS Data
Open GIS Data
 
Scaling by Design: AWS Web Services Patterns
Scaling by Design:AWS Web Services PatternsScaling by Design:AWS Web Services Patterns
Scaling by Design: AWS Web Services Patterns
 
Scaling by Design: AWS Web Services Patterns
Scaling by Design:AWS Web Services PatternsScaling by Design:AWS Web Services Patterns
Scaling by Design: AWS Web Services Patterns
 
Adobe : The Future of SaaS
Adobe : The Future of SaaSAdobe : The Future of SaaS
Adobe : The Future of SaaS
 
A Framework for Cloud IT and Business Transformation
A Framework for Cloud IT and Business TransformationA Framework for Cloud IT and Business Transformation
A Framework for Cloud IT and Business Transformation
 
Big Data on AWS - AWS Washington D.C. Symposium 2014
Big Data on AWS - AWS Washington D.C. Symposium 2014Big Data on AWS - AWS Washington D.C. Symposium 2014
Big Data on AWS - AWS Washington D.C. Symposium 2014
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Dernier

A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Dernier (20)

A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Big Data and Analytics on AWS

  • 1. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Big Data and Analytics on AWS KD Singh Solutions Architect Amazon Web Services ©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 2. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 What is big data? When your data sets become so large that you have to start innovating around how to collect, store, organize, analyze, and share it • Velocity – Rate of data flow in • Latency – High or Low • Volume – High or Low • Variety – Diversity of source data • Item Size – KB or MB • Request Rate – Access patterns • Change Rate – How much is the data changing? • Processing Requirements – How much computation? • Durability – Preservation of source data? • Availability – Tolerance for downtime? • Growth Rate – Rate of data growth? • Views – The diversity of consumers?
  • 3. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Plethora of tools Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS Amazon EMR Amazon Redshift AWS Data Pipeline Amazon Kinesis Cassandra Amazon CloudSearch
  • 4. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Ingest Store Analyze Visualize Data Answers Time Multiple stages Storage decoupled from processing Simplify data analytics flow
  • 5. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Amazon GlacierS3 DynamoDB RDS Amazon Kinesis Spark Streaming EMR Ingest Store Process/Analyze Visualize Data Pipeline Storm Kafka Amazon Redshift Cassandra Amazon CloudSearch Amazon Kinesis Connector Kinesis enabled app App Server Web Server Devices
  • 6. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Collect / Ingest Amazon Kinesis Process / Analyze Amazon EMR Amazon EC2 Amazon Redshift AWS Data Pipeline Visualize / ReportStore Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS AWS Import/Export AWS Direct Connect Amazon SQS AWS big data portfolio
  • 7. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Mobile / Cable Telecom Oil and Gas Industrial Manufacturing Retail/Consumer Entertainment Hospitality Life Sciences Scientific Exploration Financial Services Publishing Media Advertising Online Media Social Network Gaming Industries using AWS for data analysis
  • 8. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Ingest: The act of collecting and storing data
  • 9. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Types of data ingest • Transactional – Database reads/writes • File – Media files; log files • Stream – Click-stream logs (sets of events) Database Cloud Storage Stream Storage LoggingFrameworksDevicesApps
  • 10. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Real-time processing of streaming data High throughput Elastic Easy to use Connectors for EMR, S3, Amazon Redshift, DynamoDB Amazon Kinesis
  • 11. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Sending and reading data from Amazon Kinesis streams HTTP Post AWS SDK LOG4J Flume Fluentd Get* APIs Kinesis Client Library + Connector Library Apache Storm Amazon Elastic MapReduce Sending Reading Write Read
  • 12. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Hparser, Big Data Edition Flume, Sqoop AWS Partners for data ingest, load, and transformation
  • 13. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Storage
  • 14. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 App/Web Tier Client Tier Database & Storage Tier Cloud database and storage tier anti-pattern
  • 15. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 App/Web Tier Client Tier Data TierDatabase & Storage Tier Search Hadoop/HDFS Cache Blob Store SQL NoSQL Cloud database and storage tier — use the right tool for the job!
  • 16. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Database & Storage Tier Amazon RDSAmazon DynamoDB Amazon ElastiCache Amazon S3 Amazon Glacier Amazon CloudSearch HDFS on Amazon EMR Cloud database and storage tier — use the right tool for the job!
  • 17. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Store anything Object storage Scalable Designed for 99.999999999% durability Amazon S3
  • 18. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Aggregate all data in S3 surrounded by a collection of the right tools Amazon EMR Amazon Kinesis Amazon Redshift Amazon DynamoDB Amazon RDS AWS Data Pipeline Spark Streaming Cassandra Storm Amazon S3 • No limit on the number of objects • Object size up to 5 TB • Central data storage for all systems • High bandwidth • 99.999999999% durability • Versioning; lifecycle policies • Amazon Glacier integration
  • 19. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Fully managed NoSQL database service Built on solid-state drives (SSDs) Consistent low-latency performance Any throughput rate No storage limits Amazon DynamoDB
  • 20. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 • Scaling without downtime • Automatic sharding • Security inspections, patches, upgrades • Automatic hardware failover • Multi-AZ replication • Hardware configuration designed specifically for DynamoDB • Performance tuning DynamoDB: managed high availability and durability
  • 21. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Relational databases Fully managed; zero admin MySQL, PostgreSQL, Oracle, SQL Server Aurora Amazon RDS
  • 22. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Process and analyze
  • 23. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Processing frameworks • Batch processing – Take large amount (>100 TB) of cold data and ask questions – Takes minutes or hours to get answers back – Example: Generating hourly, daily, weekly reports • Stream processing (real-time) – Take small amount of hot data and ask questions – Takes short amount of time to get your answer back – Example: 1 min metrics
  • 24. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Processing frameworks • Batch processing/analytic – Amazon Redshift – Amazon EMR (Hadoop) – Spark, Hive/Tez, Pig, Impala, Presto, …. • Stream processing – Amazon Kinesis client and connector library – Spark Streaming – Storm (+Trident) MPPMPPHadoopStreamProcessing
  • 25. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Columnar data warehouse ANSI SQL compatible Massively parallel Petabyte scale Fully managed Very cost-effective Amazon Redshift
  • 26. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Amazon Redshift architecture • Leader Node – SQL endpoint – Stores metadata – Coordinates query execution • Compute Nodes – Local, columnar storage – Execute queries in parallel – Load, backup, restore via Amazon S3 – Parallel load from Amazon DynamoDB • Hardware optimized for data processing • Two hardware platforms – DS2 (dense storage): HDD; scale to 1.6PB – DC1 (dense compute): SSD; scale to 256TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  • 27. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Hadoop/HDFS clusters Hive, Pig, Impala, HBase Easy to use; fully managed On-demand and spot pricing Tight integration with S3, DynamoDB, and Amazon Kinesis Amazon Elastic MapReduce
  • 28. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 EMR Cluster S3 1. Put the data into S3 2. Choose: Hadoop distribution, # of nodes, types of nodes, Hadoop apps like Hive/Pig/HBase 4. Get the output from S3 3. Launch the cluster using the EMR console, CLI, SDK, or APIs How does EMR work?
  • 29. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 The Hadoop ecosystem works with EMR
  • 30. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Partners – advanced analytics
  • 31. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Visualize
  • 32. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 AWS Partners for BI & data visualization
  • 33. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Putting it all together
  • 34. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
  • 35. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Amazon EMR as ETL Grid and Analysis Amazon Redshift – Production DWH VisualizationLogs Traffic Statistics Demo
  • 36. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Demo
  • 37. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 ICAO and Hadoop Marco Merens Chief (Acting) Integrated Analysis International Civil Aviation Organization
  • 38. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 ICAO in the cloud
  • 39. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Cloudability principles at ICAO 1. What comes from the cloud, can stay in the cloud 2. What comes from in-house A. should stay in-house if private, or B. can be synced with the cloud if public
  • 40. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Data sync Data Basic UI Create Read Update Delete sync Data FancyUI Read Metrics
  • 41. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Collect Map Reduce Publish Key Priority EMR example: blended accident list
  • 42. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Input format XML <?xml version="1.0" encoding="utf-8"?> <root> <ADREP><FilingInformation State="XX"><ReportingOrganization>Ascend</ReportingOrganization> <StateFileNumber>S1982045</StateFileNumber> <Headline>MU-2, Collision with high ground, (near) Kelowna</Headline> </FilingInformation> …. </root> CSV |26/12/2001|Germany|Germany|"ICE:Icing"|Accident|Fatal|8|Germany|Bremerhav en|D-IAAI|"BRITTEN NORMAN"||||"2 251 to 5 700 Kg"|Scheduled|Airplane|Take- off|| …
  • 43. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 #!/bin/sh wget "http://somexml" -qO- | tr -d "n" | tr -d "r" | sed "s#<Accident>#n<Accident>#g" > tmp aws s3 put tmp s3://accidents/input/source1 ……. Amazon S3 Amazon EC2 Use linux crontab to schedule Make One XML element per line for EMR Collect
  • 44. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 EMR command line elastic-mapreduce --create --bootstrap-action s3://elasticmapreduce/samples/node/install-node-bin- x86.sh --instance-type m1.small --instance-count 3 --json job.json --put /home/ec2-user/key/newtest.pem --to /home/hadoop --enable-debugging Put ssh key to hadoop if you need to remote sh
  • 45. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 EMR json config file [{ "Name": "Make accident map", "ActionOnFailure": "CANCEL_AND_WAIT", "HadoopJarStep": { "Jar":"/home/hadoop/contrib/streaming/hadoop-streaming.jar", "Args": [ "-input", "s3://accidentstats/input/*", … ]},{ "Name": "Store in mongo", "ActionOnFailure": "CANCEL_AND_WAIT", "HadoopJarStep": { "Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar" , "Args": [ "s3://edmscripts/uploadtomongo.sh", "accidentstats/output", "NEWACCIDENTLIST" ]}} Move the results from S3 to somewhere else
  • 46. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Map sourceX #!/usr/bin/env node function treatline(line) { If (line.indexOf(“<ADREP>”)) { source1(line) } …… Function source1(line) { Var data=xml2json(line) data.records.forEach(function(v){ Var el={ Date:v.Date, Registration:v.Registration, Model:v.Model, Source:”Source1”, Priority:1 } var key=el.Date+”#”+el.Registration process.stdout.write(key+”/t”+JSON.stringify(el) ) }) } mapped Amazon Elastic MapReduce Amazon S3
  • 47. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Reduce Mapped and sorted #!/usr/bin/env node Var oldkey,key,array=[] function treatline(line) { key=line.split(“/t”)[0] data=JSON.parse(line.split(“/t”)[1]) If ((key==oldkey) || !oldkey) { array.push(data)} Else { treat(array) array=[]} oldkey=key ……} Function treat(array) { el={} array=array.sort(prioritysort) array.forEach(function(v){ el=updateresult(el,v) }) process.stdout.write(JSON.stringify(el)+”n”) } Reduced Amazon Elastic MapReduce Amazon S3
  • 48. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Real-time statistics Amazon Elastic MapReduce
  • 49. AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Thank You. This presentation will be loaded to SlideShare the week following the Symposium. http://www.slideshare.net/AmazonWebServices AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Notes de l'éditeur

  1. The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages and we dig deeper into AWS services for these different stages
  2. AWS Big Data Portfolio Customers can of course can use compute, storage and networking building blocks + open source tools. But managed services take care of undifferentiated heavy lifting of setting up, patching and scaling allowing you to focus on the mission. Visualization we rely on our partners – really good at it and what our customers are using.
  3. Amazon Kinesis is a fully managed service for real-time data processing over large, distributed data streams. Amazon Kinesis can continuously capture and store terabytes of data per hour from hundreds of thousands of sources such as website clickstreams, financial transactions, social media feeds, IT logs, sensor data, IoT, location-tracking events. With Amazon Kinesis Client Library (KCL), you can build Amazon Kinesis Applications and use streaming data to power real-time dashboards, generate alerts, implement dynamic pricing and advertising, and more. You can also emit data from Amazon Kinesis to other AWS services such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elastic Map Reduce (Amazon EMR), and AWS Lambda.
  4. A shard is the base throughput unit of an Amazon Kinesis stream. One shard provides a capacity of 1MB/sec data input and 2MB/sec data output. One shard can support up to 1000 PUT records per second. You will specify the number of shards needed when you create a stream. Amazon Kinesis Client Library (KCL) is a pre-built library that helps you easily build Amazon Kinesis Applications for reading and processing data from an Amazon Kinesis stream. It handles complex issues such as adapting to changes in stream volume, load-balancing streaming data, coordinating distributed services, and processing data with fault-tolerance. Amazon Kinesis Connector Library is a pre-built library that helps you easily integrate Amazon Kinesis with other AWS services and third-party tools. The current version of this library provides connectors to Amazon DynamoDB, Amazon Redshift, Amazon S3, and Elasticsearch. The library also includes sample connectors of each type, plus Apache Ant build files for running the samples.
  5. Data structure Query complexity Data characteristics: hot, warm, cold 2 x 2 Matrix Structured Level of query (from none to complex)
  6. Amazon S3 is object storage service which is highly-scalable, reliable, low-latency and low cost. Designed for 11 9’s of durability. Amazon S3 stores data as objects within resources called "buckets." You can store as many objects as you want within a bucket, and write, read, and delete objects in your bucket. Objects can be up to 5 terabytes in size. You can control access to the bucket (who can create, delete, and retrieve objects in the bucket for example), view access logs for the bucket and its objects, and choose the AWS region where a bucket is stored to optimize for latency, minimize costs, or address regulatory requirements.
  7. Integrates well with other AWS services and a lot of tools from ISVs and in the open source community Acts as data lake in a large majority of big data solutions Features include versioning and life cycle management. Glacier integration for archival of data It protects your data by offering encryption of data at rest and in-flight and provides security and access management features for fine grained control on who can access the data Several other features including Event Notifications that can be delivered using Amazon SQS or Amazon SNS, or sent directly to AWS Lambda, enabling you to trigger workflows, alerts, or other processing.
  8. Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. Amazon DynamoDB enables customers to offload the administrative burdens of operating and scaling distributed databases to AWS, so they don’t have to worry about hardware provisioning, setup and configuration, replication, software patching, or cluster scaling. DynamoDB supports key-value and document data structures. A key-value store is a database service that provides support for storing, querying and updating collections of objects that are identified using a key and values that contain the actual content being stored. A document store provides support for storing, querying and updating items in a document format such as JSON, XML, and HTML.
  9. Table: A table is a collection of data items – just like a table in a relational database is a collection of rows. Each table can have an infinite number of data items. Amazon DynamoDB is schema-less, in that the data items in a table need not have the same attributes or even the same number of attributes. Each table must have a primary key. Item: An Item is composed of a primary or composite key and a flexible number of attributes. There is no explicit limitation on the number of attributes associated with an individual item, but the aggregate size of an item, including all the attribute names and attribute values, is 400K Attribute: Each attribute associated with a data item is composed of an attribute name (e.g. “Color”) and a value or set of values (e.g. “Red” or “Red, Yellow, Green”). Individual attributes have no explicit size limit, but the total value of an item (including all attribute names and values) cannot exceed 400KB. Amazon DynamoDB supports GET/PUT operations using a user-defined primary key. The primary key is the only required attribute for items in a table and it uniquely identifies each item. You specify the primary key when you create a table. In addition to that DynamoDB provides flexible querying by letting query on non-primary key attributes using Global Secondary Indexes and Local Secondary Indexes.
  10. Transition Statement – RDBMS is still a viable and important component in Big Data Architecture Amazon Relational Database Service (Amazon RDS) is a managed service that makes it easy to set up, operate, and scale a relational database in the cloud. Amazon RDS gives you access to the capabilities of a familiar MySQL, Oracle, SQL Server, or PostgreSQL database. This means that the code, applications, and tools you already use today with your existing databases should work seamlessly with Amazon RDS. Amazon RDS automatically patches the database software and backs up your database, storing the backups for a user-defined retention period. For optional Multi-AZ deployments, Amazon RDS also manages synchronous data replication across Availability Zones and automatic failover. Amazon Aurora is a MySQL-compatible, relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.
  11. Generally come in two major types Batch Streaming
  12. Examples
  13. Query Speed Redshift – Extremely fast SQL queries Spark, Impala – Extremely Fast to Fast Hive QL Hive, Tez – Moderately Fast to Slow Hive QL Data Volume? UDFs? Manageability? http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-tez-and-yarn https://amplab.cs.berkeley.edu/benchmark/
  14. Add connector Direct Acyclic Graphs? Exactly once processing & DAG? – how do you do this?? https://storm.apache.org/documentation/Rationale.html http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
  15. Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. Customers can start small for just $0.25 per hour with no commitments or upfront costs and scale to a petabyte or more for $1,000 per terabyte per year, less than a tenth of most other data warehousing solutions. Traditional data warehouses require significant time and resource to administer, especially for large datasets. And they are costly. Amazon Redshift not only significantly lowers the cost of a data warehouse, but also makes it easy to analyze large amounts of data very quickly. Amazon Redshift uses a variety of innovations to achieve up to ten times higher performance than traditional databases for data warehousing and analytics workloads: Columnar Data Storage: Instead of storing data as a series of rows, Amazon Redshift organizes the data by column. Unlike row-based systems, which are ideal for transaction processing, column-based systems are ideal for data warehousing and analytics, where queries often involve aggregates performed over large data sets. Since only the columns involved in the queries are processed and columnar data is stored sequentially on the storage media, column-based systems require far fewer I/Os, greatly improving query performance. Advanced Compression: Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores. In addition, Amazon Redshift doesn't require indexes or materialized views and so uses less space than traditional relational database systems. When loading data into an empty table, Amazon Redshift automatically samples your data and selects the most appropriate compression scheme. Massively Parallel Processing (MPP): Amazon Redshift automatically distributes data and query load across all nodes. Amazon Redshift makes it easy to add nodes to your data warehouse and enables you to maintain fast query performance as your data warehouse grows.
  16. Amazon Redshift gives you fast querying capabilities over structured data using familiar SQL-based clients and business intelligence (BI) tools using standard ODBC and JDBC connections. Queries are distributed and parallelized across multiple physical resources. Easy to scale Amazon Redshift automatically patches and backs up your data warehouse, storing the backups for a user-defined retention period. You can create a cluster using either Dense Storage (DS) nodes or Dense Compute nodes (DC). Dense Storage nodes allow you to create very large data warehouses using hard disk drives (HDDs) for a very low price point. Dense Compute nodes allow you to create very high performance data warehouses using fast CPUs, large amounts of RAM and solid-state disks (SSDs). An Amazon Redshift data warehouse cluster can contain from 1-128 compute nodes, depending on the node type
  17. Amazon EMR enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. Amazon EMR uses Apache Hadoop as its distributed data processing engine. Hadoop is an open source, Java software framework that supports data-intensive distributed applications running on large clusters of commodity hardware. Hadoop implements a programming model named “MapReduce,” where the data is divided into many small fragments of work, each of which may be executed on any node in the cluster. EMR uses EC2 and S3 to set up hadoop clusters. Regular Hadoop/HDFS Support for popular add-ons Fully managed and easy to use On demand and SPOT pricing Integrated with other AWS services S3 DDB Kinesis Bootstrap capabilities have most flexibility at the layer above core Hadoop/HDFS
  18. Popular pattern 1-Customer puts data into S3 2-Make some decisions about what to run (type, number and other technologies to install) 3-Use CLI, SDK, Console or API to launch 4-Output is sent to S3 Easy to resize cluster Use spot instances to save money
  19. Time to resize is going to be a combination of EC2/AMI boot time + the bootstrap options.
  20. Task nodes - Additional nodes to a running cluster that are SPOT S3DistCp to load/unload from HDFS Shutdown the cluster (stop being charged except
  21. Core Hadoop is: Map Reduce – Computational Model HDFS – Hadoop Distributed File System Additional Tools have entered the eco system Tools to help get data into Hadoop Tools to connect to Relational Systems Monitoring Machine Learning This slide is a small slice
  22. Scientific, algorithmic, predictive, etc
  23. Real time / stream processing: kinesis and dynamo (First two boxes in first row) Batch processing: last two boxes, hdfs and s3 This is a summary of all six design patterns together. This summarizes all of the solutions available in the context of the temperature of the data and the data processing latency requirements. Hive – 1 year worth of click stream data Spark – 1 year of click stream data – what people are buying frequently together Redshift – reporting, enterprise reporting tool – SQL Heavy Impala – same as redshift Preseto same league as Impala presto – Interactive SQL analytics – have a Hadoop installed base…. NoSQL – Analytics on NoSQL
  24. Contains several months of hourly pageview statistics for all articles in Wikipedia Data copied into EMR from S3 emr does data transformation using hive raw data doesn’t have date and time processing file name to get that info, hive is doing it
  25. Contains several months of hourly pageview statistics for all articles in Wikipedia Data copied into EMR from S3 emr does data transformation using hive raw data doesn’t have date and time processing file name to get that info, hive is doing it