34. Example Continued select impressions.adId as adId, count(distinct clickId) / count(1) as clickthrough from impressions left outer join clicks on impressions.impressionId = clicks.impressionId group by impressions.adId ; impression_id, user_id, ad_id, … i-ABABABAB, u-ABABA, a-ABABABA … impression_id, click_id, … i-ABABABA, c-ABABA, … … impressions clicks
38. Declare the Impressions Table ADD JAR ${SAMPLE}/libs/jsonserde.jar ; CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string ) PARTITIONED BY (dt string) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION '${SAMPLE}/tables/impressions' ; ALTER TABLE impressions ADD PARTITION (dt='2009-04-13-08-05') ;
39. Declare Clicks Table CREATE EXTERNAL TABLE clicks ( impressionId string, clickId string ) PARTITIONED BY (dt string) ROW FORMAT SERDE 'com.amazon.elasticmapreduce.JsonSerde' WITH SERDEPROPERTIES ( 'paths'='impressionId, number' ) LOCATION '${SAMPLE}/tables/clicks' ; ALTER TABLE clicks ADD PARTITION (dt='2009-04-13-08-05') ;
40. Execute Hive Query INSERT OVERWRITE DIRECTORY "s3://emr-demo/output/clickthough" SELECT impressions.adId as adId, count(distinct clickId) / count(1) as clickthrough FROM impressions left outer join clicks on impressions.impressionId = clicks.impressionId GROUP BY impressions.adId ORDER by clickthrough desc ; Ended Job = job_201006270056_0011 2868 Rows loaded to s3://emr-demo/output/clickthough
46. Accessing the Hadoop UI ssh -i c:/Users/richcole/emr-demo.pem -ND 8157 [email_address] Install FoxyProxy https://addons.mozilla.org/en-US/firefox/addon/2464/ Leave the Default proxy setting as is, add a new proxy - select SOCKS Proxy, and SOCKS 5 - select localhost and port 8157 - add a whitelist rule for http://*ec2*.amazonaws.com* - add a whitelist rule for http://*ec2.internal*
Hi, I’m Richard Cole, a software engineer on the Amazon Elastic MapReduce team. I’m going run through some of the features of the Elastic MapReduce. At the end of the talk I’ll give you the URL to these slides so you can download them. That way you don’t need to keep note down URLS.
Here’s a overview. First I’ll talk a little about what Amazon Elastic MapReduce is. Then I’ll explain how to get setup to EMR. Next I’ll run through an example of Developing a Bootstrap Action. I’ll then go through a quick example using Hive. My intention here is to take you through many of the useful features of our service.
We also support hadoop 0.18
Now I want to show you briefly how to get started with Elastic MapReduce. I’m going to show you how to sign up for EMR and SimpleDB. You should be able to use your.
Go to aws.amazon.com. This is the main page for Amazon Web Services. Click the orange sign up button on the right.
This is the page of Amazon Elastic MapReduce. Click the orange sign up button on the right.
This is the main page for Amazon SimpleDB. Click the sign up button on the right. Simple DB is required for Hadoop Debugging.
Next download the Elastic MapReduce command line client. Click the download button.
To install the command line client you need to have ruby installed. You basically unzip the client into a directory and create a credentials file either there, or in your home directory. The credentials file needs to be filled in with some details that we’ll fetch in the next few slides. You need your AWS Credentials. You need an EC2 keypair. You need also to specify a log-uri, this is where log files from your jobflow will be uploaded to.
Next we need a copy of the access credentials. Copy your access id and private key into the credentials file.
To create an EC2 keypair we’re going to the AWS Management Console. Click the orange button on the right.
Click on the EC2 tab. The EC2 Key Pair is required to SSH to the cluster. Click create a new Key Pair. Save the secret key somewhere safe. Copy the name of the key pair and the location key pair file into the credentials.json file.
You don’t need to use the command line client. You can also call the web service from Java. Here’s the AWS SDK for Java. To download it you click the yellow button on the right.
Here’s a recap of what we just did.
A job flow is what we call a Hadoop cluster is running or ran at some time. Log files from the cluster are stored in S3 so that they’re accessible later after the job flow has shutdown. Typically a jobflow runs in batch mode. That is it executes a series of MapReduce jobs and then terminates. The batch job might analyse log files over some period of time and produce data in a structured format that is stored in S3 for example. You might also run a jobflow in interactive mode. The typical use case for an interactive mode jobflow is when your developing a batch process. Here you might start with a smaller jobflow and a small portion of your data, you run your Hadoop jobs that are under development and test the results that you get. Another reason to run an interactive jobflow is for Adhoc analysis. You might for example be investigating some aspect of your data, and each query that you run suggests the next query to be run. In this case you run a job flow in interactive mode. You could also choose to run a job flow as an always on, long running job flow. In this case you persist data to Amazon S3 so that you can recover in the event of a master failure, but in the normal case you pull data continuously to your datawarehouse and you run a variety of batch mode and ad-hoc processing on the job flow.
Job flows have steps. A step specifies a jar located in Amazon S3 to be run on the master node. The jar is like a Hadoop job jar. It has a main function that is either specified in the manfiest of the jar or on the command line and it can contain lib jars in the same way that a Hadoop job jar does. Typically a step will use the Hadoop API’s to create one or more Hadoop jobs and wait for them to terminate. Steps are executed sequentially. A step jar indicates failure by returning non-zero value. There is step property called ActionOnFailure, this says what to do after a step fails. The options are: CONTINUE, which will just continue on to the next step effectively ignoring the error, CANCEL_AND_WAIT, which will cancel all following steps and TERMINATE_JOBFLOW which terminate JOBFLOW regardless of the setting KeepJobFlowAliveWhenNoSteps. This last property is property of a jobflow, it is used to decide what to do one all the steps have been executed or cancelled. If you want an interactive or long lived cluster then you need to set this property to true.
Steps only run on the master node. Bootstrap actions run on all nodes. They are run after Hadoop is configured but before Hadoop is started. So you can use them to modify the site-config to set settings that are not settable on a per job basis. You can also use bootstrap actions to install additional software on the nodes or to modify the machine configuration. For example you might want to add more swap space to the nodes. Bootstrap actions run as hadoop user, however Hadoop user to escalate to root without a password using sudo. So really within bootstrap actions you have complete control over the nodes.
Bootstrap actions are typically scripts located in Amazon S3. They can use Hadoop to download additional software to execute from S3. They indicate failure by returning a non-zero value. If a bootstrap action fails then the node will be discarded. Be carefull though, if more than 10% of your nodes fail their bootstrap action then the job flow will fail.
bNext I want to show you an example of developing a bootstrap action. Lets say that your application requires the mysql client library for Ruby. Lets say you have a streaming job and it needs to fetch some parameters from an Amazon RDS instance that is running. So you want to make a bootstrap action that will install the mysql client library. First you create an install script, we’re going to use bash but you could use ruby, or python, or perl or whatever is your favorite. This script first does set --e --x to turn on tracing and to make the script fail with non-zero value if any command in the script fail. Next it escalates to root using sudo and then installs the library using apt-get. The nodes run Debian/stable and the tool for installing software under Debian is called apt-get. We’ll put this script in a file and upload it to S3.
So next lets run an interactive job flow using the command line client. The --alive option makes the jobflow keep running even when all steps are finished. It is important for an interactive jobflow. Next we ssh to the master node and copy our script from Amazon S3 where we uploaded it. Then we make the script executable and execute it.
Next we’ll run a jobflow specifying the bootstrap action script on the command line. The script will then be run on all nodes in the jobflow and install the ruby mysql client for us.
Test on a small subset so you don’t waste lots of money