SlideShare une entreprise Scribd logo
1  sur  56
What is Amazon Elastic MapReduce ? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Amazon Elastic MapReduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Getting Started With Amazon Elastic MapReduce http://aws.amazon.com/elasticmapreduce
Overview ,[object Object],[object Object],[object Object],[object Object]
What is Amazon Elastic MapReduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Getting Started with Elastic MapReduce ,[object Object],[object Object],[object Object],[object Object]
Create your AWS Account http://aws.amazon.com
Claim your AWS Credits ,[object Object]
Sign up  for  Amazon Elastic MapReduce http://aws.amazon.com/elasticmapreduce
Sign up for Amazon SimpleDB http://aws.amazon.com/simpledb
Download the Command Line Client http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264
Install the Command Line Client cd $HOME mkdir -p elastic-mapreduce cd elastic-mapreduce wget  http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip unzip  elastic-mapreduce-ruby.zip export PATH=$PATH:$(pwd) { "access-id":  “1111111111111111111", "private-key": “ababababababababababababababaaba", "key-pair":  “emr-demo", "key-pair-file": "/home/richcole/emr-demo.pem", "log-uri":  "s3://emr-demo/logs" } credentials.json
Obtaining your AWS Credentials http://aws.amazon.com/
AWS Management Console http://console.aws.amazon.com/
Create EC2 Keypair
Fill out credentials.json file ,[object Object],{ "access-id":  “1111111111111111111", "private-key": “ababababababababababababababaaba", "key-pair":  “emr-demo", "key-pair-file": "/home/richcole/emr-demo.pem", "log-uri":  "s3://emr-demo/logs" } credentials.json
AWS Java SDK http://aws.amazon.com/sdkforjava/
Recap ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Terminology ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Job Flow ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Job Flow Steps ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Bootstrap Actions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Bootstrap Actions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Developing a Bootstrap Action ,[object Object],[object Object],#!/bin/bash set -e -x sudo apt-get install libmysql-ruby s3://emr-demo/scripts/install-mysql.sh
Test on Interactive Job Flow ,[object Object],elastic-mapreduce --create --alive --name “My Development JobFlow” --enable-debugging elastic-mapreduce --ssh j-ABABABABABABA ,[object Object],rm --rf test-tmp && mkdir test-tmp && cd test-tmp hadoop fs --copyToLocal s3://emr-demo/scripts/install-mysql.sh . chmod a+x install-mysql.sh ./install-mysql.sh
Testing a Bootstrap Action ,[object Object],elastic-mapreduce --create --alive --name “My Development JobFlow” --enable-debugging --bootstrap-script s3://emr-demo/scripts/install-mysql.sh --bootstrap-name “Install Ruby MySQL”  ,[object Object]
Predefined Bootstrap Actions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Recap -- Bootstrap Action ,[object Object],[object Object],[object Object],[object Object]
Tips and Tricks ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Questions? ,[object Object],[object Object],[object Object],[object Object]
Hive Example - Outline ,[object Object],[object Object],[object Object],[object Object]
Hive Example ,[object Object],[object Object],[object Object],[object Object]
Example Continued select  impressions.adId as adId,  count(distinct clickId) /  count(1)  as clickthrough from  impressions left outer join clicks on impressions.impressionId = clicks.impressionId group by impressions.adId  ; impression_id, user_id, ad_id, … i-ABABABAB, u-ABABA, a-ABABABA …  impression_id, click_id, … i-ABABABA, c-ABABA, … … impressions clicks
Partitioned Tables in Amazon S3 s3://elasticmapreduce/samples/hive-ads/tables/clicks/ dt=2009-04-14-13-00/ ec2-93-18-66-22.amazon.com-2009-04-14-13-00.log ec2-64-41-91-42.amazon.com-2009-04-14-13-00.log ec2-32-38-73-65.amazon.com-2009-04-14-13-00.log ec2-15-01-21-88.amazon.com-2009-04-14-13-00.log dt=2009-04-14-13-01/ ec2-93-18-66-22.amazon.com-2009-04-14-13-01.log ec2-64-41-91-42.amazon.com-2009-04-14-13-01.log ec2-32-38-73-65.amazon.com-2009-04-14-13-01.log ec2-15-01-21-88.amazon.com-2009-04-14-13-01.log
SSH To The Master Node $ chmod og-rwx $HOME/emr-demo.pem $ export PATH=$HOME/elastic-mapreduce $ elastic-mapreduce --list --active j-1FGYJOQRLQ7OH  WAITING  ec2-184-72-141-9.compute-1.amazonaws.com  My Interactive JobFlow COMPLETED  Setup Hadoop Debugging  COMPLETED  Setup Hive $ elastic-mapreduce --ssh --jobflow j-1FGYJOQRLQ7OH ssh -o StrictHostKeyChecking=no -i /home/... ...  hadoop@ip-10-242-235-81:~$ hive
Start Hive hive -d SAMPLE=s3://elasticmapreduce/samples/hive-ads -d DAY=2009-04-13 -d HOUR=08 -d NEXT_DAY=2009-04-13 -d NEXT_HOUR=09 -d OUTPUT=s3://mybucket/samples/output
Declare the Impressions Table ADD JAR ${SAMPLE}/libs/jsonserde.jar ; CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string,  userAgent string, userCookie string, ip string )  PARTITIONED BY (dt string) ROW FORMAT  serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties (  'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION '${SAMPLE}/tables/impressions' ; ALTER TABLE impressions ADD PARTITION (dt='2009-04-13-08-05') ;
Declare Clicks Table CREATE EXTERNAL TABLE clicks ( impressionId string, clickId string ) PARTITIONED BY (dt string) ROW FORMAT  SERDE 'com.amazon.elasticmapreduce.JsonSerde' WITH SERDEPROPERTIES ( 'paths'='impressionId, number' ) LOCATION '${SAMPLE}/tables/clicks' ; ALTER TABLE clicks ADD PARTITION (dt='2009-04-13-08-05') ;
Execute Hive Query INSERT OVERWRITE DIRECTORY "s3://emr-demo/output/clickthough" SELECT impressions.adId as adId,  count(distinct clickId) /  count(1)  as clickthrough FROM impressions left outer join clicks on impressions.impressionId = clicks.impressionId GROUP BY impressions.adId  ORDER by clickthrough desc ; Ended Job = job_201006270056_0011 2868 Rows loaded to s3://emr-demo/output/clickthough
Viewing Steps
Viewing Hadoop Jobs
Viewing Tasks
Viewing Task Attempts
Download the Output
Accessing the Hadoop UI ssh -i c:/Users/richcole/emr-demo.pem -ND 8157 [email_address] Install FoxyProxy https://addons.mozilla.org/en-US/firefox/addon/2464/ Leave the Default proxy setting as is, add a new proxy - select SOCKS Proxy, and SOCKS 5 - select localhost and port 8157 - add a whitelist rule for  http://*ec2*.amazonaws.com* - add a whitelist rule for  http://*ec2.internal*
The Hadoop UI through FoxyProxy
Viewing Live Task Attempts
Running a Hive Job from CLI ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Recap ,[object Object],[object Object],[object Object],[object Object],[object Object]
The End ,[object Object],[object Object]
Starting an Interactive Job Flow
Choose Interactive Session
Select Keypair, Log Path,  and Enable Debugging
Proceed with no bootstrap actions
Final Review of Selections

Contenu connexe

Tendances

Plug in development
Plug in developmentPlug in development
Plug in development
Lucky Ali
 
Awsgsg freetier
Awsgsg freetierAwsgsg freetier
Awsgsg freetier
Sebin John
 
High-Availability Websites and Web Applications with AWS
High-Availability Websites and Web Applications with AWSHigh-Availability Websites and Web Applications with AWS
High-Availability Websites and Web Applications with AWS
Amazon Web Services
 
Aws building fault_tolerant_applications
Aws building fault_tolerant_applicationsAws building fault_tolerant_applications
Aws building fault_tolerant_applications
Sebin John
 
Programming Amazon Web Services for Beginners (1)
Programming Amazon Web Services for Beginners (1)Programming Amazon Web Services for Beginners (1)
Programming Amazon Web Services for Beginners (1)
Markus Klems
 

Tendances (18)

Plug in development
Plug in developmentPlug in development
Plug in development
 
Hybrid cloud wiskyweb2012
Hybrid cloud wiskyweb2012Hybrid cloud wiskyweb2012
Hybrid cloud wiskyweb2012
 
Awsgsg wah
Awsgsg wahAwsgsg wah
Awsgsg wah
 
Awsgsg freetier
Awsgsg freetierAwsgsg freetier
Awsgsg freetier
 
LDAP, SAML and Hue
LDAP, SAML and HueLDAP, SAML and Hue
LDAP, SAML and Hue
 
How to make a WordPress theme
How to make a WordPress themeHow to make a WordPress theme
How to make a WordPress theme
 
Optimize Site Deployments with Drush (DrupalCamp WNY 2011)
Optimize Site Deployments with Drush (DrupalCamp WNY 2011)Optimize Site Deployments with Drush (DrupalCamp WNY 2011)
Optimize Site Deployments with Drush (DrupalCamp WNY 2011)
 
ARC302 AWS Cloud Design Patterns - AWS re: Invent 2012
ARC302 AWS Cloud Design Patterns - AWS re: Invent 2012ARC302 AWS Cloud Design Patterns - AWS re: Invent 2012
ARC302 AWS Cloud Design Patterns - AWS re: Invent 2012
 
Deep Learning for Developers (December 2017)
Deep Learning for Developers (December 2017)Deep Learning for Developers (December 2017)
Deep Learning for Developers (December 2017)
 
Mashup University 4: Intro To Mashups
Mashup University 4: Intro To MashupsMashup University 4: Intro To Mashups
Mashup University 4: Intro To Mashups
 
Childthemes ottawa-word camp-1919
Childthemes ottawa-word camp-1919Childthemes ottawa-word camp-1919
Childthemes ottawa-word camp-1919
 
Deep Dive on Amazon Elastic Container Service (ECS) | AWS Summit Tel Aviv 2019
Deep Dive on Amazon Elastic Container Service (ECS)  | AWS Summit Tel Aviv 2019Deep Dive on Amazon Elastic Container Service (ECS)  | AWS Summit Tel Aviv 2019
Deep Dive on Amazon Elastic Container Service (ECS) | AWS Summit Tel Aviv 2019
 
High-Availability Websites and Web Applications with AWS
High-Availability Websites and Web Applications with AWSHigh-Availability Websites and Web Applications with AWS
High-Availability Websites and Web Applications with AWS
 
Moving from Django Apps to Services
Moving from Django Apps to ServicesMoving from Django Apps to Services
Moving from Django Apps to Services
 
Aws building fault_tolerant_applications
Aws building fault_tolerant_applicationsAws building fault_tolerant_applications
Aws building fault_tolerant_applications
 
Site optimization
Site optimizationSite optimization
Site optimization
 
Awsgsg swh
Awsgsg swhAwsgsg swh
Awsgsg swh
 
Programming Amazon Web Services for Beginners (1)
Programming Amazon Web Services for Beginners (1)Programming Amazon Web Services for Beginners (1)
Programming Amazon Web Services for Beginners (1)
 

En vedette (10)

AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce
 
Lanka Agri
Lanka AgriLanka Agri
Lanka Agri
 
First impression (1)
First impression (1)First impression (1)
First impression (1)
 
ebay v. Amazon / Harvard Case Analysis Solution
ebay v. Amazon / Harvard Case Analysis Solutionebay v. Amazon / Harvard Case Analysis Solution
ebay v. Amazon / Harvard Case Analysis Solution
 
ITEC-610 Ebay Case Study
ITEC-610 Ebay Case StudyITEC-610 Ebay Case Study
ITEC-610 Ebay Case Study
 
eBay .vs. Amazon
eBay .vs. AmazoneBay .vs. Amazon
eBay .vs. Amazon
 
Powerpoint Presentation on eBay.com
Powerpoint Presentation on eBay.comPowerpoint Presentation on eBay.com
Powerpoint Presentation on eBay.com
 
Company presentation - Amazon
Company presentation - AmazonCompany presentation - Amazon
Company presentation - Amazon
 
Amazon ppt
Amazon pptAmazon ppt
Amazon ppt
 
Brand Management Study of Amazon
Brand Management Study of Amazon Brand Management Study of Amazon
Brand Management Study of Amazon
 

Similaire à Richard Cole of Amazon Gives Lightning Tallk at BigDataCamp

Similaire à Richard Cole of Amazon Gives Lightning Tallk at BigDataCamp (20)

AutoScaling and Drupal
AutoScaling and DrupalAutoScaling and Drupal
AutoScaling and Drupal
 
AWS Serverless Workshop
AWS Serverless WorkshopAWS Serverless Workshop
AWS Serverless Workshop
 
Deploying and running Grails in the cloud
Deploying and running Grails in the cloudDeploying and running Grails in the cloud
Deploying and running Grails in the cloud
 
Scaling drupal horizontally and in cloud
Scaling drupal horizontally and in cloudScaling drupal horizontally and in cloud
Scaling drupal horizontally and in cloud
 
Picking the right AWS backend for your Java application (May 2017)
Picking the right AWS backend for your Java application (May 2017)Picking the right AWS backend for your Java application (May 2017)
Picking the right AWS backend for your Java application (May 2017)
 
(SDD420) Amazon WorkSpaces: Advanced Topics and Deep Dive | AWS re:Invent 2014
(SDD420) Amazon WorkSpaces: Advanced Topics and Deep Dive | AWS re:Invent 2014(SDD420) Amazon WorkSpaces: Advanced Topics and Deep Dive | AWS re:Invent 2014
(SDD420) Amazon WorkSpaces: Advanced Topics and Deep Dive | AWS re:Invent 2014
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Lampstack (1)
Lampstack (1)Lampstack (1)
Lampstack (1)
 
B3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsB3 - Business intelligence apps on aws
B3 - Business intelligence apps on aws
 
Continuous Deployment @ AWS Re:Invent
Continuous Deployment @ AWS Re:InventContinuous Deployment @ AWS Re:Invent
Continuous Deployment @ AWS Re:Invent
 
Continuous Integration and Deployment Best Practices on AWS (ARC307) | AWS re...
Continuous Integration and Deployment Best Practices on AWS (ARC307) | AWS re...Continuous Integration and Deployment Best Practices on AWS (ARC307) | AWS re...
Continuous Integration and Deployment Best Practices on AWS (ARC307) | AWS re...
 
Simple Odoo ERP auto scaling on AWS
Simple Odoo ERP auto scaling on AWSSimple Odoo ERP auto scaling on AWS
Simple Odoo ERP auto scaling on AWS
 
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
 
Technology tips to ceo & architect
Technology tips to ceo & architectTechnology tips to ceo & architect
Technology tips to ceo & architect
 
DynamoDB for PHP sessions
DynamoDB for PHP sessionsDynamoDB for PHP sessions
DynamoDB for PHP sessions
 
Serverless in production, an experience report (Going Serverless)
Serverless in production, an experience report (Going Serverless)Serverless in production, an experience report (Going Serverless)
Serverless in production, an experience report (Going Serverless)
 
Into The Box | Alexa and ColdBox Api's
Into The Box | Alexa and ColdBox Api'sInto The Box | Alexa and ColdBox Api's
Into The Box | Alexa and ColdBox Api's
 
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...
 
Serverless in production, an experience report (JeffConf)
Serverless in production, an experience report (JeffConf)Serverless in production, an experience report (JeffConf)
Serverless in production, an experience report (JeffConf)
 
Performance and Scalability
Performance and ScalabilityPerformance and Scalability
Performance and Scalability
 

Plus de BigDataCamp

BigDataCamp LA 2014 Schedule
BigDataCamp LA 2014 ScheduleBigDataCamp LA 2014 Schedule
BigDataCamp LA 2014 Schedule
BigDataCamp
 
5 kinesis lightning
5 kinesis lightning5 kinesis lightning
5 kinesis lightning
BigDataCamp
 
4 hadoop for-the-disillusioned
4 hadoop for-the-disillusioned4 hadoop for-the-disillusioned
4 hadoop for-the-disillusioned
BigDataCamp
 
3 analytic strategies shree dandekar dell 12-10-13
3 analytic strategies shree dandekar dell 12-10-133 analytic strategies shree dandekar dell 12-10-13
3 analytic strategies shree dandekar dell 12-10-13
BigDataCamp
 
2 one spot redshift bigdatacamp 1.02
2 one spot redshift bigdatacamp 1.022 one spot redshift bigdatacamp 1.02
2 one spot redshift bigdatacamp 1.02
BigDataCamp
 
1 big datacampdell2013
1 big datacampdell20131 big datacampdell2013
1 big datacampdell2013
BigDataCamp
 

Plus de BigDataCamp (11)

Ingest, Transform & Visualize w Amazon Web Services
Ingest, Transform & Visualize w Amazon Web ServicesIngest, Transform & Visualize w Amazon Web Services
Ingest, Transform & Visualize w Amazon Web Services
 
BigDataCamp LA 2014 Schedule
BigDataCamp LA 2014 ScheduleBigDataCamp LA 2014 Schedule
BigDataCamp LA 2014 Schedule
 
5 kinesis lightning
5 kinesis lightning5 kinesis lightning
5 kinesis lightning
 
4 hadoop for-the-disillusioned
4 hadoop for-the-disillusioned4 hadoop for-the-disillusioned
4 hadoop for-the-disillusioned
 
3 analytic strategies shree dandekar dell 12-10-13
3 analytic strategies shree dandekar dell 12-10-133 analytic strategies shree dandekar dell 12-10-13
3 analytic strategies shree dandekar dell 12-10-13
 
2 one spot redshift bigdatacamp 1.02
2 one spot redshift bigdatacamp 1.022 one spot redshift bigdatacamp 1.02
2 one spot redshift bigdatacamp 1.02
 
1 big datacampdell2013
1 big datacampdell20131 big datacampdell2013
1 big datacampdell2013
 
Stefan Groschupf of Datameer Gives Lightning Talk at BigDataCamp
Stefan Groschupf of Datameer Gives Lightning Talk at BigDataCampStefan Groschupf of Datameer Gives Lightning Talk at BigDataCamp
Stefan Groschupf of Datameer Gives Lightning Talk at BigDataCamp
 
Stefan Groschupf of Datameer Gives Lightning Tallk at BigDataCamp
Stefan Groschupf of Datameer Gives Lightning Tallk at BigDataCampStefan Groschupf of Datameer Gives Lightning Tallk at BigDataCamp
Stefan Groschupf of Datameer Gives Lightning Tallk at BigDataCamp
 
Sam Charrington Of Appistry Gives Lighting Talk
Sam Charrington Of Appistry Gives Lighting TalkSam Charrington Of Appistry Gives Lighting Talk
Sam Charrington Of Appistry Gives Lighting Talk
 
Steve Woolege Of Aster Data Gives Lightning Talk At BigDataCamp
Steve Woolege Of Aster Data Gives Lightning Talk At BigDataCampSteve Woolege Of Aster Data Gives Lightning Talk At BigDataCamp
Steve Woolege Of Aster Data Gives Lightning Talk At BigDataCamp
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Richard Cole of Amazon Gives Lightning Tallk at BigDataCamp

  • 1.
  • 2.
  • 3.
  • 4. Getting Started With Amazon Elastic MapReduce http://aws.amazon.com/elasticmapreduce
  • 5.
  • 6.
  • 7.
  • 8. Create your AWS Account http://aws.amazon.com
  • 9.
  • 10. Sign up for Amazon Elastic MapReduce http://aws.amazon.com/elasticmapreduce
  • 11. Sign up for Amazon SimpleDB http://aws.amazon.com/simpledb
  • 12. Download the Command Line Client http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264
  • 13. Install the Command Line Client cd $HOME mkdir -p elastic-mapreduce cd elastic-mapreduce wget http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip unzip elastic-mapreduce-ruby.zip export PATH=$PATH:$(pwd) { "access-id": “1111111111111111111", "private-key": “ababababababababababababababaaba", "key-pair": “emr-demo", "key-pair-file": "/home/richcole/emr-demo.pem", "log-uri": "s3://emr-demo/logs" } credentials.json
  • 14. Obtaining your AWS Credentials http://aws.amazon.com/
  • 15. AWS Management Console http://console.aws.amazon.com/
  • 17.
  • 18. AWS Java SDK http://aws.amazon.com/sdkforjava/
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34. Example Continued select impressions.adId as adId, count(distinct clickId) / count(1) as clickthrough from impressions left outer join clicks on impressions.impressionId = clicks.impressionId group by impressions.adId ; impression_id, user_id, ad_id, … i-ABABABAB, u-ABABA, a-ABABABA … impression_id, click_id, … i-ABABABA, c-ABABA, … … impressions clicks
  • 35. Partitioned Tables in Amazon S3 s3://elasticmapreduce/samples/hive-ads/tables/clicks/ dt=2009-04-14-13-00/ ec2-93-18-66-22.amazon.com-2009-04-14-13-00.log ec2-64-41-91-42.amazon.com-2009-04-14-13-00.log ec2-32-38-73-65.amazon.com-2009-04-14-13-00.log ec2-15-01-21-88.amazon.com-2009-04-14-13-00.log dt=2009-04-14-13-01/ ec2-93-18-66-22.amazon.com-2009-04-14-13-01.log ec2-64-41-91-42.amazon.com-2009-04-14-13-01.log ec2-32-38-73-65.amazon.com-2009-04-14-13-01.log ec2-15-01-21-88.amazon.com-2009-04-14-13-01.log
  • 36. SSH To The Master Node $ chmod og-rwx $HOME/emr-demo.pem $ export PATH=$HOME/elastic-mapreduce $ elastic-mapreduce --list --active j-1FGYJOQRLQ7OH WAITING ec2-184-72-141-9.compute-1.amazonaws.com My Interactive JobFlow COMPLETED Setup Hadoop Debugging COMPLETED Setup Hive $ elastic-mapreduce --ssh --jobflow j-1FGYJOQRLQ7OH ssh -o StrictHostKeyChecking=no -i /home/... ... hadoop@ip-10-242-235-81:~$ hive
  • 37. Start Hive hive -d SAMPLE=s3://elasticmapreduce/samples/hive-ads -d DAY=2009-04-13 -d HOUR=08 -d NEXT_DAY=2009-04-13 -d NEXT_HOUR=09 -d OUTPUT=s3://mybucket/samples/output
  • 38. Declare the Impressions Table ADD JAR ${SAMPLE}/libs/jsonserde.jar ; CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string ) PARTITIONED BY (dt string) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION '${SAMPLE}/tables/impressions' ; ALTER TABLE impressions ADD PARTITION (dt='2009-04-13-08-05') ;
  • 39. Declare Clicks Table CREATE EXTERNAL TABLE clicks ( impressionId string, clickId string ) PARTITIONED BY (dt string) ROW FORMAT SERDE 'com.amazon.elasticmapreduce.JsonSerde' WITH SERDEPROPERTIES ( 'paths'='impressionId, number' ) LOCATION '${SAMPLE}/tables/clicks' ; ALTER TABLE clicks ADD PARTITION (dt='2009-04-13-08-05') ;
  • 40. Execute Hive Query INSERT OVERWRITE DIRECTORY "s3://emr-demo/output/clickthough" SELECT impressions.adId as adId, count(distinct clickId) / count(1) as clickthrough FROM impressions left outer join clicks on impressions.impressionId = clicks.impressionId GROUP BY impressions.adId ORDER by clickthrough desc ; Ended Job = job_201006270056_0011 2868 Rows loaded to s3://emr-demo/output/clickthough
  • 46. Accessing the Hadoop UI ssh -i c:/Users/richcole/emr-demo.pem -ND 8157 [email_address] Install FoxyProxy https://addons.mozilla.org/en-US/firefox/addon/2464/ Leave the Default proxy setting as is, add a new proxy - select SOCKS Proxy, and SOCKS 5 - select localhost and port 8157 - add a whitelist rule for http://*ec2*.amazonaws.com* - add a whitelist rule for http://*ec2.internal*
  • 47. The Hadoop UI through FoxyProxy
  • 48. Viewing Live Task Attempts
  • 49.
  • 50.
  • 51.
  • 54. Select Keypair, Log Path, and Enable Debugging
  • 55. Proceed with no bootstrap actions
  • 56. Final Review of Selections

Notes de l'éditeur

  1. We also support hadoop 0.18
  2. Hi, I’m Richard Cole, a software engineer on the Amazon Elastic MapReduce team. I’m going run through some of the features of the Elastic MapReduce. At the end of the talk I’ll give you the URL to these slides so you can download them. That way you don’t need to keep note down URLS.
  3. Here’s a overview. First I’ll talk a little about what Amazon Elastic MapReduce is. Then I’ll explain how to get setup to EMR. Next I’ll run through an example of Developing a Bootstrap Action. I’ll then go through a quick example using Hive. My intention here is to take you through many of the useful features of our service.
  4. We also support hadoop 0.18
  5. Now I want to show you briefly how to get started with Elastic MapReduce. I’m going to show you how to sign up for EMR and SimpleDB. You should be able to use your.
  6. Go to aws.amazon.com. This is the main page for Amazon Web Services. Click the orange sign up button on the right.
  7. This is the page of Amazon Elastic MapReduce. Click the orange sign up button on the right.
  8. This is the main page for Amazon SimpleDB. Click the sign up button on the right. Simple DB is required for Hadoop Debugging.
  9. Next download the Elastic MapReduce command line client. Click the download button.
  10. To install the command line client you need to have ruby installed. You basically unzip the client into a directory and create a credentials file either there, or in your home directory. The credentials file needs to be filled in with some details that we’ll fetch in the next few slides. You need your AWS Credentials. You need an EC2 keypair. You need also to specify a log-uri, this is where log files from your jobflow will be uploaded to.
  11. Next we need a copy of the access credentials. Copy your access id and private key into the credentials file.
  12. To create an EC2 keypair we’re going to the AWS Management Console. Click the orange button on the right.
  13. Click on the EC2 tab. The EC2 Key Pair is required to SSH to the cluster. Click create a new Key Pair. Save the secret key somewhere safe. Copy the name of the key pair and the location key pair file into the credentials.json file.
  14. You don’t need to use the command line client. You can also call the web service from Java. Here’s the AWS SDK for Java. To download it you click the yellow button on the right.
  15. Here’s a recap of what we just did.
  16. A job flow is what we call a Hadoop cluster is running or ran at some time. Log files from the cluster are stored in S3 so that they’re accessible later after the job flow has shutdown. Typically a jobflow runs in batch mode. That is it executes a series of MapReduce jobs and then terminates. The batch job might analyse log files over some period of time and produce data in a structured format that is stored in S3 for example. You might also run a jobflow in interactive mode. The typical use case for an interactive mode jobflow is when your developing a batch process. Here you might start with a smaller jobflow and a small portion of your data, you run your Hadoop jobs that are under development and test the results that you get. Another reason to run an interactive jobflow is for Adhoc analysis. You might for example be investigating some aspect of your data, and each query that you run suggests the next query to be run. In this case you run a job flow in interactive mode. You could also choose to run a job flow as an always on, long running job flow. In this case you persist data to Amazon S3 so that you can recover in the event of a master failure, but in the normal case you pull data continuously to your datawarehouse and you run a variety of batch mode and ad-hoc processing on the job flow.
  17. Job flows have steps. A step specifies a jar located in Amazon S3 to be run on the master node. The jar is like a Hadoop job jar. It has a main function that is either specified in the manfiest of the jar or on the command line and it can contain lib jars in the same way that a Hadoop job jar does. Typically a step will use the Hadoop API’s to create one or more Hadoop jobs and wait for them to terminate. Steps are executed sequentially. A step jar indicates failure by returning non-zero value. There is step property called ActionOnFailure, this says what to do after a step fails. The options are: CONTINUE, which will just continue on to the next step effectively ignoring the error, CANCEL_AND_WAIT, which will cancel all following steps and TERMINATE_JOBFLOW which terminate JOBFLOW regardless of the setting KeepJobFlowAliveWhenNoSteps. This last property is property of a jobflow, it is used to decide what to do one all the steps have been executed or cancelled. If you want an interactive or long lived cluster then you need to set this property to true.
  18. Steps only run on the master node. Bootstrap actions run on all nodes. They are run after Hadoop is configured but before Hadoop is started. So you can use them to modify the site-config to set settings that are not settable on a per job basis. You can also use bootstrap actions to install additional software on the nodes or to modify the machine configuration. For example you might want to add more swap space to the nodes. Bootstrap actions run as hadoop user, however Hadoop user to escalate to root without a password using sudo. So really within bootstrap actions you have complete control over the nodes.
  19. Bootstrap actions are typically scripts located in Amazon S3. They can use Hadoop to download additional software to execute from S3. They indicate failure by returning a non-zero value. If a bootstrap action fails then the node will be discarded. Be carefull though, if more than 10% of your nodes fail their bootstrap action then the job flow will fail.
  20. bNext I want to show you an example of developing a bootstrap action. Lets say that your application requires the mysql client library for Ruby. Lets say you have a streaming job and it needs to fetch some parameters from an Amazon RDS instance that is running. So you want to make a bootstrap action that will install the mysql client library. First you create an install script, we’re going to use bash but you could use ruby, or python, or perl or whatever is your favorite. This script first does set --e --x to turn on tracing and to make the script fail with non-zero value if any command in the script fail. Next it escalates to root using sudo and then installs the library using apt-get. The nodes run Debian/stable and the tool for installing software under Debian is called apt-get. We’ll put this script in a file and upload it to S3.
  21. So next lets run an interactive job flow using the command line client. The --alive option makes the jobflow keep running even when all steps are finished. It is important for an interactive jobflow. Next we ssh to the master node and copy our script from Amazon S3 where we uploaded it. Then we make the script executable and execute it.
  22. Next we’ll run a jobflow specifying the bootstrap action script on the command line. The script will then be run on all nodes in the jobflow and install the ruby mysql client for us.
  23. Test on a small subset so you don’t waste lots of money
  24. Logs are delayed by 5 minutes
  25. Log directory must be a bucket that you own