SlideShare a Scribd company logo
1 of 52
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rahul Bhartia, Ecosystem Solution Architect
15th
September 2015
Building your first Big Data
application on AWS
Amazon S3
Amazon Kinesis
Amazon DynamoDB
Amazon RDS (Aurora)
AWS Lambda
KCL Apps
Amazon
EMR
Amazon
Redshift
Amazon Machine
Learning
CollectCollect ProcessProcess AnalyzeAnalyze
StoreStore
Data Collection
and Storage
Data
Processing
Event
Processing
Data
Analysis
Data Answers
Big Data ecosystem on AWS
Your first Big Data application on AWS
?
Big Data ecosystem on AWS - Collect
CollectCollect ProcessProcess AnalyzeAnalyze
StoreStore
Data Answers
Big Data ecosystem on AWS - Process
CollectCollect ProcessProcess AnalyzeAnalyze
StoreStore
Data Answers
Big Data ecosystem on AWS - Analyze
CollectCollect ProcessProcess AnalyzeAnalyze
StoreStore
Data Answers
SQL
Setup
Resources
1. AWS Command Line Interface (aws-cli) configured
2. Amazon Kinesis stream with a single shard
3. Amazon S3 bucket to hold the files
4. Amazon EMR cluster (two nodes) with Spark and Hive
5. Amazon Redshift data warehouse cluster (single node)
Amazon Kinesis
Create an Amazon Kinesis stream to hold incoming data:
aws kinesis create-stream 
--stream-name AccessLogStream 
--shard-count 1
Amazon S3
Amazon EMR
Amazon Redshift
Your first Big Data application on AWS
1. COLLECT: Stream data into
Kinesis with Log4J
2. PROCESS: Process data
with EMR using Spark & Hive
3. ANALYZE: Analyze data in
Redshift using SQL
STORE
SQL
1. Collect
Amazon Kinesis Log4J Appender
Log file format
Spark
•Fast and general engine for large-
scale data processing
•Write applications quickly in Java,
Scala or Python
•Combine SQL, streaming, and
complex analytics.
Using Spark on EMR
Amazon Kinesis and Spark Streaming
Producer Amazon
Kinesis
Amazon
S3
DynamoD
B
KCL
Spark-Streaming uses
KCL for Kinesis
Amazon
EMR
Spark-Streaming application to read from Kinesis and write to S3
Spark-streaming - Reading from Kinesis
Spark-streaming – Writing to S3
View the output files in Amazon S3
2. Process
Amazon EMR’s Hive
Adapts a SQL-like (HiveQL) query to run on Hadoop
Schema on read: map table to the input data
Access data in Amazon S3, Amazon DymamoDB, and Amazon Kinesis
Query complex input formats using SerDe
Transform data with User Defined Functions (UDF)
Using Hive on Amazon EMR
Create a table that points to your Amazon S3 bucket
CREATE EXTERNAL TABLE access_log_raw(
host STRING, identity STRING,
user STRING, request_time STRING,
request STRING, status STRING,
size STRING, referrer STRING,
agent STRING
)
PARTITIONED BY (year INT, month INT, day INT, hour INT, min INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|[[^]]*])
([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|"[^"]*")
([^ "]*|"[^"]*"))?"
)
LOCATION 's3://YOUR-S3-BUCKET/access-log-raw';
msck repair table access_log_raw;
Process data using Hive
We will transform the data that is returned by the query before writing it
to our Amazon S3-stored external Hive table
Hive User Defined Functions (UDF) in use for the text transformations:
from_unixtime, unix_timestamp and hour
The “hour” value is important: this is what’s used to split and organize
the output files before writing to Amazon S3. These splits will allow us
to more efficiently load the data into Amazon Redshift later in the lab
using the parallel “COPY” command
Create an external Hive table in Amazon S3
Configure partition and compression
Query Hive and write output to Amazon S3
-- convert the Apache log timestamp to a UNIX timestamp
-- split files in Amazon S3 by the hour in the log lines
INSERT OVERWRITE TABLE access_log_processed PARTITION (hour)
SELECT
from_unixtime(unix_timestamp(request_time,
'[dd/MMM/yyyy:HH:mm:ss Z]')),
host,
request,
status,
referrer,
agent,
hour(from_unixtime(unix_timestamp(request_time,
'[dd/MMM/yyyy:HH:mm:ss Z]'))) as hour
FROM access_log_raw;
Viewing Job status
http://127.0.0.1/9026
View the output files in Amazon S3
Spark SQL
Spark's module for working with structured data using SQL
Run unmodified Hive queries on existing data.
Using Spark-SQL on Amazon EMR
Query the data with Spark
3. Analyze
Connect to Amazon Redshift
Create an Amazon Redshift table to hold your data
Loading data into Amazon Redshift
“COPY” command loads files in parallel
COPY accesslogs
FROM 's3://YOUR-S3-BUCKET/access-log-processed'
CREDENTIALS
'aws_access_key_id=YOUR-IAM-ACCESS_KEY;
aws_secret_access_key=YOUR-IAM-SECRET-KEY'
DELIMITER 't' IGNOREHEADER 0
MAXERROR 0
GZIP;
Amazon Redshift test queries
Your first Big Data application on AWS
A favicon would fix 398 of the total 977 PAGE NOT FOUND (404) errors
…around the same cost as a cup of coffee
Try it yourself on the AWS Cloud…
Service Est. Cost*
Amazon Kinesis $1.00
Amazon S3 (free tier) $0
Amazon EMR $0.44
Amazon Redshift $1.00
Est. Total $2.44
*Estimated costs assumes: use of free tier where available, lower cost instances, dataset no bigger than 10MB and instances running
for less than 4 hours. Costs may vary depending on options selected, size of dataset, and usage.
$3.50
Thank you
AWS Big Data blog
blogs.aws.amazon.com/bigdata

More Related Content

What's hot

What's hot (20)

Introduction to Amazon Lightsail
Introduction to Amazon LightsailIntroduction to Amazon Lightsail
Introduction to Amazon Lightsail
 
Deep dive and best practices on real time streaming applications nyc-loft_oct...
Deep dive and best practices on real time streaming applications nyc-loft_oct...Deep dive and best practices on real time streaming applications nyc-loft_oct...
Deep dive and best practices on real time streaming applications nyc-loft_oct...
 
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
 
SQL Strikes Back! Options for Large Scale SQL Analytics
SQL Strikes Back! Options for Large Scale SQL AnalyticsSQL Strikes Back! Options for Large Scale SQL Analytics
SQL Strikes Back! Options for Large Scale SQL Analytics
 
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
 
AWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:CapAWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:Cap
 
利用 Amazon QuickSight 視覺化分析服務剖析資料
利用 Amazon QuickSight 視覺化分析服務剖析資料利用 Amazon QuickSight 視覺化分析服務剖析資料
利用 Amazon QuickSight 視覺化分析服務剖析資料
 
Using Amazon CloudWatch Events, AWS Lambda and Spark Streaming to Process E...
Using Amazon CloudWatch Events,  AWS Lambda and Spark Streaming  to Process E...Using Amazon CloudWatch Events,  AWS Lambda and Spark Streaming  to Process E...
Using Amazon CloudWatch Events, AWS Lambda and Spark Streaming to Process E...
 
SEC301 Security @ (Cloud) Scale
SEC301 Security @ (Cloud) ScaleSEC301 Security @ (Cloud) Scale
SEC301 Security @ (Cloud) Scale
 
Deep Dive on Amazon RDS (May 2016)
Deep Dive on Amazon RDS (May 2016)Deep Dive on Amazon RDS (May 2016)
Deep Dive on Amazon RDS (May 2016)
 
AWS re:Invent 2016: Workshop: Building Your First Big Data Application with A...
AWS re:Invent 2016: Workshop: Building Your First Big Data Application with A...AWS re:Invent 2016: Workshop: Building Your First Big Data Application with A...
AWS re:Invent 2016: Workshop: Building Your First Big Data Application with A...
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012
 
BDA402 Deep Dive: Log analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log analytics with Amazon Elasticsearch Service
 
Deep Dive on Amazon S3 (May 2016)
Deep Dive on Amazon S3 (May 2016)Deep Dive on Amazon S3 (May 2016)
Deep Dive on Amazon S3 (May 2016)
 
Almacenamiento en la nube con AWS
Almacenamiento en la nube con AWSAlmacenamiento en la nube con AWS
Almacenamiento en la nube con AWS
 
SRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon AuroraSRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon Aurora
 
使用 AWS Step Functions 開發 Serverless 服務
使用 AWS Step Functions 開發 Serverless 服務使用 AWS Step Functions 開發 Serverless 服務
使用 AWS Step Functions 開發 Serverless 服務
 
What's New & What's Next from AWS?
What's New & What's Next from AWS?What's New & What's Next from AWS?
What's New & What's Next from AWS?
 
ENT314 Automate Best Practices and Operational Health for Your AWS Resources
ENT314 Automate Best Practices and Operational Health for Your AWS ResourcesENT314 Automate Best Practices and Operational Health for Your AWS Resources
ENT314 Automate Best Practices and Operational Health for Your AWS Resources
 

Viewers also liked

Viewers also liked (20)

AWS Finland March meetup 2017 - selecting enterprise IoT platform
AWS Finland March meetup 2017 - selecting enterprise IoT platformAWS Finland March meetup 2017 - selecting enterprise IoT platform
AWS Finland March meetup 2017 - selecting enterprise IoT platform
 
Deep Learning with AWS (November 2016)
Deep Learning with AWS (November 2016)Deep Learning with AWS (November 2016)
Deep Learning with AWS (November 2016)
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
 
Extending Application Data In The Cloud
Extending Application Data In The CloudExtending Application Data In The Cloud
Extending Application Data In The Cloud
 
AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)
AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)
AWS re:Invent 2016: Driving Innovation with Big Data and IoT (GPSST304)
 
Big Data application - OSS / BSS
Big Data application - OSS / BSSBig Data application - OSS / BSS
Big Data application - OSS / BSS
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
 
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
 
Application of data mining
Application of data miningApplication of data mining
Application of data mining
 
AWS re:Invent Recap 2016 Taiwan part 1
AWS re:Invent Recap 2016 Taiwan part 1AWS re:Invent Recap 2016 Taiwan part 1
AWS re:Invent Recap 2016 Taiwan part 1
 
AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...
AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...
AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...
 
Open data : from Insight to Visualisation with Google BigQuery and Carto.com ...
Open data : from Insight to Visualisation with Google BigQuery and Carto.com ...Open data : from Insight to Visualisation with Google BigQuery and Carto.com ...
Open data : from Insight to Visualisation with Google BigQuery and Carto.com ...
 
Converging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven PoutsyConverging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven Poutsy
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to AWS September Webinar Series - Building Your First Big Data Application on AWS

Similar to AWS September Webinar Series - Building Your First Big Data Application on AWS (20)

AWS APAC Webinar Week - Launching Your First Big Data Project on AWS
AWS APAC Webinar Week - Launching Your First Big Data Project on AWSAWS APAC Webinar Week - Launching Your First Big Data Project on AWS
AWS APAC Webinar Week - Launching Your First Big Data Project on AWS
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
 
Workshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWSWorkshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWS
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Amazed by AWS Series #4
Amazed by AWS Series #4Amazed by AWS Series #4
Amazed by AWS Series #4
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Building Your First Big Data Application on AWS
Building Your First Big Data Application on AWSBuilding Your First Big Data Application on AWS
Building Your First Big Data Application on AWS
 
Building your first Data lake platform
Building your first Data lake platform Building your first Data lake platform
Building your first Data lake platform
 
Your First Data Lake on AWS_Simon Elisha
Your First Data Lake on AWS_Simon ElishaYour First Data Lake on AWS_Simon Elisha
Your First Data Lake on AWS_Simon Elisha
 
ABD330_Combining Batch and Stream Processing to Get the Best of Both Worlds
ABD330_Combining Batch and Stream Processing to Get the Best of Both WorldsABD330_Combining Batch and Stream Processing to Get the Best of Both Worlds
ABD330_Combining Batch and Stream Processing to Get the Best of Both Worlds
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Building Your First Big Data Application on AWS
Building Your First Big Data Application on AWSBuilding Your First Big Data Application on AWS
Building Your First Big Data Application on AWS
 
Building Your First Big Data Application on AWS
Building Your First Big Data Application on AWSBuilding Your First Big Data Application on AWS
Building Your First Big Data Application on AWS
 
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
 
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
 
Log Analysis At Scale
Log Analysis At ScaleLog Analysis At Scale
Log Analysis At Scale
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Adding Search to Relational Databases
Adding Search to Relational DatabasesAdding Search to Relational Databases
Adding Search to Relational Databases
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
 
Adding Search to Relational Databases
Adding Search to Relational DatabasesAdding Search to Relational Databases
Adding Search to Relational Databases
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

AWS September Webinar Series - Building Your First Big Data Application on AWS

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rahul Bhartia, Ecosystem Solution Architect 15th September 2015 Building your first Big Data application on AWS
  • 2. Amazon S3 Amazon Kinesis Amazon DynamoDB Amazon RDS (Aurora) AWS Lambda KCL Apps Amazon EMR Amazon Redshift Amazon Machine Learning CollectCollect ProcessProcess AnalyzeAnalyze StoreStore Data Collection and Storage Data Processing Event Processing Data Analysis Data Answers Big Data ecosystem on AWS
  • 3. Your first Big Data application on AWS ?
  • 4. Big Data ecosystem on AWS - Collect CollectCollect ProcessProcess AnalyzeAnalyze StoreStore Data Answers
  • 5. Big Data ecosystem on AWS - Process CollectCollect ProcessProcess AnalyzeAnalyze StoreStore Data Answers
  • 6. Big Data ecosystem on AWS - Analyze CollectCollect ProcessProcess AnalyzeAnalyze StoreStore Data Answers SQL
  • 8. Resources 1. AWS Command Line Interface (aws-cli) configured 2. Amazon Kinesis stream with a single shard 3. Amazon S3 bucket to hold the files 4. Amazon EMR cluster (two nodes) with Spark and Hive 5. Amazon Redshift data warehouse cluster (single node)
  • 9. Amazon Kinesis Create an Amazon Kinesis stream to hold incoming data: aws kinesis create-stream --stream-name AccessLogStream --shard-count 1
  • 13. Your first Big Data application on AWS 1. COLLECT: Stream data into Kinesis with Log4J 2. PROCESS: Process data with EMR using Spark & Hive 3. ANALYZE: Analyze data in Redshift using SQL STORE SQL
  • 16.
  • 18. Spark •Fast and general engine for large- scale data processing •Write applications quickly in Java, Scala or Python •Combine SQL, streaming, and complex analytics.
  • 20. Amazon Kinesis and Spark Streaming Producer Amazon Kinesis Amazon S3 DynamoD B KCL Spark-Streaming uses KCL for Kinesis Amazon EMR Spark-Streaming application to read from Kinesis and write to S3
  • 23.
  • 24. View the output files in Amazon S3
  • 25.
  • 27. Amazon EMR’s Hive Adapts a SQL-like (HiveQL) query to run on Hadoop Schema on read: map table to the input data Access data in Amazon S3, Amazon DymamoDB, and Amazon Kinesis Query complex input formats using SerDe Transform data with User Defined Functions (UDF)
  • 28. Using Hive on Amazon EMR
  • 29. Create a table that points to your Amazon S3 bucket CREATE EXTERNAL TABLE access_log_raw( host STRING, identity STRING, user STRING, request_time STRING, request STRING, status STRING, size STRING, referrer STRING, agent STRING ) PARTITIONED BY (year INT, month INT, day INT, hour INT, min INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|[[^]]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|"[^"]*") ([^ "]*|"[^"]*"))?" ) LOCATION 's3://YOUR-S3-BUCKET/access-log-raw'; msck repair table access_log_raw;
  • 30.
  • 31. Process data using Hive We will transform the data that is returned by the query before writing it to our Amazon S3-stored external Hive table Hive User Defined Functions (UDF) in use for the text transformations: from_unixtime, unix_timestamp and hour The “hour” value is important: this is what’s used to split and organize the output files before writing to Amazon S3. These splits will allow us to more efficiently load the data into Amazon Redshift later in the lab using the parallel “COPY” command
  • 32. Create an external Hive table in Amazon S3
  • 33. Configure partition and compression
  • 34. Query Hive and write output to Amazon S3 -- convert the Apache log timestamp to a UNIX timestamp -- split files in Amazon S3 by the hour in the log lines INSERT OVERWRITE TABLE access_log_processed PARTITION (hour) SELECT from_unixtime(unix_timestamp(request_time, '[dd/MMM/yyyy:HH:mm:ss Z]')), host, request, status, referrer, agent, hour(from_unixtime(unix_timestamp(request_time, '[dd/MMM/yyyy:HH:mm:ss Z]'))) as hour FROM access_log_raw;
  • 35.
  • 37. View the output files in Amazon S3
  • 38.
  • 39. Spark SQL Spark's module for working with structured data using SQL Run unmodified Hive queries on existing data.
  • 40. Using Spark-SQL on Amazon EMR
  • 41. Query the data with Spark
  • 42.
  • 44. Connect to Amazon Redshift
  • 45. Create an Amazon Redshift table to hold your data
  • 46. Loading data into Amazon Redshift “COPY” command loads files in parallel COPY accesslogs FROM 's3://YOUR-S3-BUCKET/access-log-processed' CREDENTIALS 'aws_access_key_id=YOUR-IAM-ACCESS_KEY; aws_secret_access_key=YOUR-IAM-SECRET-KEY' DELIMITER 't' IGNOREHEADER 0 MAXERROR 0 GZIP;
  • 47.
  • 48.
  • 50. Your first Big Data application on AWS A favicon would fix 398 of the total 977 PAGE NOT FOUND (404) errors
  • 51. …around the same cost as a cup of coffee Try it yourself on the AWS Cloud… Service Est. Cost* Amazon Kinesis $1.00 Amazon S3 (free tier) $0 Amazon EMR $0.44 Amazon Redshift $1.00 Est. Total $2.44 *Estimated costs assumes: use of free tier where available, lower cost instances, dataset no bigger than 10MB and instances running for less than 4 hours. Costs may vary depending on options selected, size of dataset, and usage. $3.50
  • 52. Thank you AWS Big Data blog blogs.aws.amazon.com/bigdata