SlideShare une entreprise Scribd logo
1  sur  46
Benjamin Wootton
www.benjaminwootton.co.uk
What Is This Document?
   A high level tutorial for setting up a small 3 node Hadoop
    cluster on Amazons EC2 cloud;

   It aims to be a complete, simple, and fast guide for people
    new to Hadoop. Documentation on the Internet is spotty and
    text dense so it can be slower to get started than it needs to
    be;

   It aims to cover the little stumbling blocks around SSH keys
    and a few small bug workarounds that can trip you up on the
    first pass through;

   It does not aim to cover MapReduce and writing jobs. A
    small job will be provided at the end to validate the cluster.
Recap - What Is Hadoop?
   An open source framework for ‘reliable, scalable, distributed
    computing’;

   It gives you the ability process and work with large datasets
    that are distributed across clusters of commodity
    hardware;

   It allows you to parallelize computation and ‘move processing
    to the data’ using the MapReduce framework.
Recap - What Is Amazon EC2?
   A ‘cloud’ web host that allows you to dynamically add and
    remove compute server resources as you need them,
    allowing you to pay for only the capacity that you need;

   It is well suited for Hadoop Computation – we can bring up
    enormous clusters within minutes and then spin it down when
    we’ve finished to reduce costs;

   EC2 is quick and cost effective for experimental and learning
    purposes, as well as being proven as a production Hadoop
    host.
Assumptions & Notes
   The document assumes basic familiarity with Linux, Java,
    and SSH;

   The cluster will be set up manually to demonstrate concepts
    of Hadoop. In real life, we would typically use a configuration
    management tool such as Puppet or Chef to manage and
    automate larger clusters;

   The configuration shown is not production ready. Real
    Hadoop clusters need much more bootstrap configuration
    and security;

   It assumes that you are running your cluster on Ubuntu Linux,
    but are accessing the cluster from a Windows host. This is
    possibly not a sensible assumption, but it’s what I had at the
    time of writing!
Part 1

EC2 CONFIGURATION
1. Start EC2 Servers
   Sign up to Amazon Web Services @ http://aws.amazon.com/;

   Login and navigate to Amazon EC2. Using the ‘classic
    wizard’, create three micro instances running the latest 64 bit
    Ubuntu Server;




   If you do not already have a key pair .pem file, you will need
    to create one during the process. We will later use this to
    connect to the servers and to navigate around within the
    cluster., so keep it in a safe place.
2. Name EC2 Servers
   For reference, name the instances Master, Slave 1, and
    Slave 2 within the EC2 console once they are running;




   Note down the host names for each of the 3 instances in the
    bottom part of the management console. We will use these to
    access the servers:
3. Prepare Keys from PEM file
   We need to break down the Amazon supplied .pem file into
    private and public keys in order to access the servers from
    our local machine;

   To do this, download PuttyGen @
    http://www.chiark.greenend.org.uk/~sgtatham/putty/download.
    html

   Using PuttyGen, import your PEM file (Conversions > Import
    Key) and export public and private keys into a safe place.
4. Configure Putty
   We now need to SSH into our EC2 servers from our local
    machine. To do this, we will use the Putty SSH client for
    Windows;

   Begin by configuring Putty sessions for each of the three
    servers and saving them for future convenience;

   Under connection > SSH > Auth in the putty tree, point
    towards the private key that you generated in the previous
    step.
5. Optional - mRemote
   I use a tool called mRemote which allows you to embed Putty
    instances into a tabbed browser. Try it @
    http://www.mremote.org. I recommend this as navigating
    around all of the hosts in your Hadoop cluster can be fiddly
    for larger manually managed clusters;

   If you do this, be sure to select the corresponding Putty
    Session for each mRemote connection to the private key is
    carried through so that you can connect:
6. Install Java & Hadoop
   We need to install Java on the cluster machines in order to
    run Hadoop. The OpenJDK7 will suffice for this tutorial.
    Connect to all three machines using Putty or mRemote, and
    on each of the three machines run the following:
        sudo apt-get install   openjdk-7-jdk



   When that’s complete, configure the JAVA_HOME variable by
    adding the following line at the top of ~/.bashrc:
        export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/


   We can now download and unpack Hadoop. On each of the
    three machines run the following:
        cd ~
        wget http://apache.mirror.rbftpnetworks.com/hadoop/common/hadoop-
        1.0.3/hadoop-1.0.3-bin.tar.gz
        gzip –d hadoop-1.0.3-bin.tar.gz
        tar –xf hadoop-1.0.3-bin.tar
7. Configure SSH Keypairs
   Hadoop needs to SSH from master to the slave servers to
    start and stop processes.

   All of the Amazon servers will have our generated public key
    installed on them in their ~ubuntu/.ssh/authorized_keys file
    automatically by Amazon. However, we need to put the
    corresponding private key in our ~ubuntu/.ssh/id_rsa file on
    the master server to be able to go there.

   To upload the file, use the file transfer software WinSCP @
    http://winscp.net to push the file into your .ssh folder.

   Be sure to upload your OpenSSH private key generated from
    PuttyGen – (Conversions > Export Open SSH Key)!
8. Passwordless SSH
   It is better if Hadoop can move between boxes without
    requiring the pass phrase to your key file;

   To do this, we can save the password into the ssh agent by
    using the following commands on the master server. This will
    avoid the need for specifying the password repeatedly when
    stopping and starting the cluster:
    ◦   ssh-agent bash
    ◦   ssh-add
9. Open Firewall Ports
   We need to open a number of ports to allow the Hadoop
    cluster to communicate and expose various web interfaces to
    us. Do this by adding inbound rules to the default security
    group on the AWS EC2 management console. Open port
    9000, 9001 and 50000-50100
Part 2

HADOOP
CONFIGURATION
1. Decide On Cluster Layout
   There are four components of Hadoop which we would like to
    spread out across the cluster:
    ◦ Data nodes – actually store and manage data;
    ◦ Naming node – acts as a catalogue service, showing what data is stored
      where;
    ◦ Job tracker – tracks and manages submitted MapReduce tasks;
    ◦ Task tracker – low level worker that is issued jobs from job tracker.


   Lets go with the following setup. This is fairly typical in terms
    of data nodes and task trackers across the cluster, and one
    instance of the naming node and job tracker:
       Node                    Hostname                Component

       Master                  ec2-23-22-133-70        Naming Node
                                                       Job Tracker
       Slave 1                 ec2-23-20-53-36         Data Node
                                                       Task Tracker
       Slave 2                 ec2-184-73-42-163       Data Node
                                                       Task Tracker
2a. Configure Server Names
   Logout of all of the machines and log back into the master
    server;

   The hadoop configuration will be located here on the server:
         cd /home/ubuntu/hadoop-1.0.3/conf



   Open the file ‘masters’ and replace the word ‘localhost’ with
    the hostname of the server that you have allocated to master:
         cd /home/ubuntu/hadoop-1.0.3/conf
         vi masters



   Open the file ‘slaves’ and replace the word ‘localhost’ with the
    2 hostnames of the server that you have been allocated on 2
    separate lines:
         cd /home/ubuntu/hadoop-1.0.3/conf
         vi slaves
2b. Configure Server Names
   It should look like this, though of course using your own
    allocated hostnames:




   Do not use ‘localhost’ in the masters/slaves files as this can
    lead to non descriptive errors!
3a. Configure HDFS

   HDFS is the distributed file system that sits behind Hadoop
    instances, syncing data so that it’s close to the processing
    and providing redundancy. We should therefore set this up
    first;

   We need to specify some mandatory parameters to get HDFS
    up and running in various XML configuration files;

   Still on the master server, the first thing we need to do is to
    set the name of the default file system so that it always points
    back at master, again using your own fully qualified
    hostname:

    /home/ubuntu/hadoop-1.0.3/conf/core-site.xml
    <configuration>
     <property>
            <name>fs.default.name</name>
            <value>hdfs://ec2-107-20-118-109.compute-1.amazonaws.com:9000</value>
        </property>
    </configuration>
3b. Configure HDFS
   Still on the master server, we also need to set the
    dfs.replication parameter, which says how many nodes data
    should be replicated to for failover and redundancy purposes:

    /home/ubuntu/hadoop-1.0.3/conf/hdfs-site.xml
    <configuration>
     <property>
             <name>dfs.replication</name>
            <value>1</value>
        </property>
    </configuration>
4. Configure MapReduce
   As well as the underlying HDFS file system, we have to set
    one mandatory parameter that will be used by the Hadoop
    MapReduce framework;

   Still on the master server, we need to set the job tracker
    location, which Hadoop will use. As discussed earlier, we will
    put the job tracker on master in this instance, again being
    careful to substitute in your own master server host name:

    /home/ubuntu/hadoop-1.0.3/conf/mapred-site.xml
    <configuration>
     <property>
            <name>mapred.job.tracker</name>
            <value>ec2-107-22-78-136.compute-1.amazonaws.com:54311</value>
        </property>
    </configuration>
5a. Push Configuration To Slaves
   We need to push out the little mandatory configuration that
    we have done onto all of the slaves. Typically, this could be
    mounted on a shared drive but we will do it manually this time
    using SCP:

    cd /home/ubuntu/hadoop-1.0.3/conf
    scp * ubuntu@ec2-23-20-53-36.compute-1.amazonaws.com:/home/ubuntu/hadoop-
    1.0.3/conf
    scp * ubuntu@ec2-184-73-42-163.compute-1.amazonaws.com:/home/ubuntu/hadoop-
    1.0.3/conf


   By virtue of pushing out the masters and slaves files, the
    various nodes in this cluster should all be correctly
    congfigured and referencing each other at this stage.
6a. Format HDFS
   Before we can start Hadoop, we need to format and initialise
    the underlying distributed file system;

   To do this, on the master server, execute the following
    command:
        cd /home/ubuntu/hadoop-1.0.3/bin
        ./hadoop namenode -format



   It should only take a minute. The expected output of the
    format operation is on the next page;

   The file system will be built and formatted. It will exist in
    /tmp/hadoop-ubuntu if you would like to browse around. This
    file system will be managed by Hadoop to distribute it across
    nodes and access data.
6b. Format HDFS
Part 3

START HADOOP
1a. Start HDFS
   We will begin by starting the HDFS file system from the
    master server.

   There is a script for this which will run the name node on the
    master and the data nodes on the slaves:
         cd /home/ubuntu/hadoop-1.0.3
         ./bin/start-dfs.sh
1b. Start HDFS
   At this point, monitor the log files on the master and the
    slaves. You should see that HDFS cleanly starts on the
    slaves when we start it on the master:
      cd /home/ubuntu/hadoop-1.0.3
      tail –f logs/hadoop-ubuntu-datanode-ip-10-245-114-186.log




   If anything appears to have gone wrong, double check the
    configuration files above are correct, firewall ports are open,
    and everything has been accurately pushed to all slaves.
2. Start MapReduce
   Once we’ve confirmed that the HDFS is up, is now time to
    start the map reduce component of Hadoop;

   There is a script for this which will run the name node on the
    master and the data nodes on the slaves. Run the following
    on the master server.
         cd /home/ubuntu/hadoop-1.0.3
         ./bin/start-mapred.sh


   Again, double check the log files on all servers to check that
    everything is communicating cleanly before moving further..
    Double check configuration and firewall ports in the case of
    issues.
Part 4

EXPLORE HADOOP
2a. Web Interfaces
   Now we’re up and running, Hadoop has started a number of
    web interfaces that give information about the cluster and
    HDFS. Take a look at these to familiarise yourself with them:



    NameNode      master:50070    Information about the
                                  name node and the health
                                  of the distributed file
                                  system
    DataNode      slave1:50075    TBC
                  slave2:50075

    JobTracker    master:50030    Information about
                                  submitted and queued jobs
    TaskTracker   slave1:50060    Information about tasks
                  slave2:50060    that are submitted and
                                  queued
2b. Web Interfaces
2c. Web Interfaces
2d. Web Interfaces
Part 5

YOUR FIRST MAP
REDUCE JOB
1. Source Dataset
   It’s now time to submit a processing job to the Hadoop
    cluster.

   Though we won’t go into much detail here, for the exercise, I
    used a dataset of UK government spending which you can
    bring down onto your master server like so:
      wget http://www.dwp.gov.uk/docs/dwp-payments-april10.csv
2. Push Dataset Into HDFS
   We need to push the dataset into the HDFS so that it’s
    available to be shared across the nodes for subsequent
    processing:
    /home/ubuntu/hadoop-1.0.3/bin/hadoop dfs -put dwp-payments-april10.csv dwp-
       payments-april10.csv



   After pushing the file, note how the NameNode data health
    page shows the file count, and space used increasing after
    the push.
3a. Write A MapReduce Job

   It’s now time to open our IDE and write the Java code that will
    represent our MapReduce job. For the purposes of this
    presentation, we are just looking to validate the Hadoop
    cluster, so we will not go into detail with regards to
    MapReduce;

   My Java project is available at @
    http://www.benjaminwootton.co.uk/wordcount.zip.

   As per the sample project, I recommend that you use Maven
    in order to easily bring in Hadoop depenendencies and build
    the required JAR. You can manage this part differently if you
    prefer.
4. Package The MapReduce Job
   Hadoop Jobs are typically packaged as Java JAR files. By
    virtue of using Maven, we can get the packaged JAR simply
    by running a mvn clean package against the sample
    project.




   Upload the JAR onto your master server using WinSCP:
5a. Execute The MapReduce Job
   We are finally ready to run the job! To do that, we’ll run the
    Hadoop script and pass in a reference to the JAR file, the
    name of the class containing the main method, and an input
    file and an output file that will be used by our Java code:
    cd /home/ubuntu/hadoop-1.0.3
    ./bin/hadoop jar ~/firsthadoop.jar benjaminwootton.WordCount
    /user/ubuntu/dwp-payments-april10.csv /user/ubuntu/RESULTS



   If all goes well, we’ll see the job run with no errors:
    ubuntu@ip-10-212-121-253:~/hadoop-1.0.3$ ./bin/hadoop jar ~/firsthadoop.jar
    benjaminwootton.WordCount /user/ubuntu/dwp-payments-april10.csv
    /user/ubuntu/RESULTS
    12/06/03 16:06:12 INFO mapred.JobClient: Running job: job_201206031421_0004
    12/06/03 16:06:13 INFO mapred.JobClient:   map 0% reduce 0%
    12/06/03 16:06:29 INFO mapred.JobClient:   map 50% reduce 0%
    12/06/03 16:06:31 INFO mapred.JobClient:   map 100% reduce 0%
    12/06/03 16:06:41 INFO mapred.JobClient:   map 100% reduce 100%
    12/06/03 16:06:50 INFO mapred.JobClient: Job complete:
    job_201206031421_0004
5b. Execute The MapReduce Job

   You will also be able to monitor progress of the job on
    the job tracker web application that is running on the
    master server:
6. Confirm Results!
   And the final step is to confirm our results by interogating the
    HDFS:
    ubuntu@ip-10-212-121-253:~/hadoop-1.0.3$ ./bin/hadoop dfs -cat /user/ubuntu/RESULTS/part-
    00000
    Pension                        2150
    Corporate                      5681
    Employment Programmes          491
    Entity                         2
    Housing Benefits                            388
    Jobcentre Plus                 14774
Part 6

WRAPPING UP!
1. What We Have Done!
   Setup EC2, requested machines, configured firewalls and
    passwordless SSH;
   Downloaded Java and Hadoop;
   Configured HDFS and MapReduce and pushed configuration
    around the cluster;
   Started HDFS and MapReduce;
   Compiled a MapReduce job using Maven;
   Submitted the job, ran it succesfully, and viewed the output.
2. In Summary….
   Hopefully you can see how this model of computation would
    be useful for very large datasets that we wish to perform
    processing on;
   Hopefully you are also sold on EC2 as a distributed, fast, cost
    effective platform for using Hadoop for big-data work.



   Please get in touch with any questions, comments, or
    corrections!




@benjaminwootton
www.benjaminwootton.co.uk

Contenu connexe

Tendances

Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
How to Take Advantage of Optimizer Improvements in MySQL 8.0
How to Take Advantage of Optimizer Improvements in MySQL 8.0How to Take Advantage of Optimizer Improvements in MySQL 8.0
How to Take Advantage of Optimizer Improvements in MySQL 8.0Norvald Ryeng
 
A Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
A Microservices approach with Cassandra and Quarkus | DevNation Tech TalkA Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
A Microservices approach with Cassandra and Quarkus | DevNation Tech TalkRed Hat Developers
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introductionchrislusf
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®confluent
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
 
Cassandra Virtual Node talk
Cassandra Virtual Node talkCassandra Virtual Node talk
Cassandra Virtual Node talkPatrick McFadin
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 
Twitterのリアルタイム分散処理システム「Storm」入門
Twitterのリアルタイム分散処理システム「Storm」入門Twitterのリアルタイム分散処理システム「Storm」入門
Twitterのリアルタイム分散処理システム「Storm」入門AdvancedTechNight
 
MySQL Data Encryption at Rest
MySQL Data Encryption at RestMySQL Data Encryption at Rest
MySQL Data Encryption at RestMydbops
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraGokhan Atil
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Policy Enforcement on Kubernetes with Open Policy Agent
Policy Enforcement on Kubernetes with Open Policy AgentPolicy Enforcement on Kubernetes with Open Policy Agent
Policy Enforcement on Kubernetes with Open Policy AgentVMware Tanzu
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking VN
 
Introduction à Cassandra - campus plex
Introduction à Cassandra - campus plexIntroduction à Cassandra - campus plex
Introduction à Cassandra - campus plexjaxio
 

Tendances (20)

Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
How to Take Advantage of Optimizer Improvements in MySQL 8.0
How to Take Advantage of Optimizer Improvements in MySQL 8.0How to Take Advantage of Optimizer Improvements in MySQL 8.0
How to Take Advantage of Optimizer Improvements in MySQL 8.0
 
A Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
A Microservices approach with Cassandra and Quarkus | DevNation Tech TalkA Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
A Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
 
Cassandra Virtual Node talk
Cassandra Virtual Node talkCassandra Virtual Node talk
Cassandra Virtual Node talk
 
kafka
kafkakafka
kafka
 
Grafana 7.0
Grafana 7.0Grafana 7.0
Grafana 7.0
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Twitterのリアルタイム分散処理システム「Storm」入門
Twitterのリアルタイム分散処理システム「Storm」入門Twitterのリアルタイム分散処理システム「Storm」入門
Twitterのリアルタイム分散処理システム「Storm」入門
 
MySQL Data Encryption at Rest
MySQL Data Encryption at RestMySQL Data Encryption at Rest
MySQL Data Encryption at Rest
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Cassandra 101
Cassandra 101Cassandra 101
Cassandra 101
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Grafana.pptx
Grafana.pptxGrafana.pptx
Grafana.pptx
 
Policy Enforcement on Kubernetes with Open Policy Agent
Policy Enforcement on Kubernetes with Open Policy AgentPolicy Enforcement on Kubernetes with Open Policy Agent
Policy Enforcement on Kubernetes with Open Policy Agent
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applications
 
Introduction à Cassandra - campus plex
Introduction à Cassandra - campus plexIntroduction à Cassandra - campus plex
Introduction à Cassandra - campus plex
 

En vedette

Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Jeffrey Breen
 
Hadoop Cluster on Docker Containers
Hadoop Cluster on Docker ContainersHadoop Cluster on Docker Containers
Hadoop Cluster on Docker Containerspranav_joshi
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
 
Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014 Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014 Janos Matyas
 
Docker Swarm Cluster
Docker Swarm ClusterDocker Swarm Cluster
Docker Swarm ClusterFernando Ike
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...DataWorks Summit/Hadoop Summit
 
Managing Docker Containers In A Cluster - Introducing Kubernetes
Managing Docker Containers In A Cluster - Introducing KubernetesManaging Docker Containers In A Cluster - Introducing Kubernetes
Managing Docker Containers In A Cluster - Introducing KubernetesMarc Sluiter
 
Hadoop on Docker
Hadoop on DockerHadoop on Docker
Hadoop on DockerRakesh Saha
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersBlueData, Inc.
 

En vedette (11)

Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
 
Hadoop Cluster on Docker Containers
Hadoop Cluster on Docker ContainersHadoop Cluster on Docker Containers
Hadoop Cluster on Docker Containers
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014 Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014
 
Docker Swarm Cluster
Docker Swarm ClusterDocker Swarm Cluster
Docker Swarm Cluster
 
Simplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & TroubleshootingSimplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & Troubleshooting
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
 
Managing Docker Containers In A Cluster - Introducing Kubernetes
Managing Docker Containers In A Cluster - Introducing KubernetesManaging Docker Containers In A Cluster - Introducing Kubernetes
Managing Docker Containers In A Cluster - Introducing Kubernetes
 
Hadoop on Docker
Hadoop on DockerHadoop on Docker
Hadoop on Docker
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
 

Similaire à Configuring Your First Hadoop Cluster On EC2

R hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopR hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopAiden Seonghak Hong
 
Two single node cluster to one multinode cluster
Two single node cluster to one multinode clusterTwo single node cluster to one multinode cluster
Two single node cluster to one multinode clustersushantbit04
 
R server and spark
R server and sparkR server and spark
R server and sparkBAINIDA
 
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)Nag Arvind Gudiseva
 
Configuringahadoop
ConfiguringahadoopConfiguringahadoop
Configuringahadoopmensb
 
How to create a secured cloudera cluster
How to create a secured cloudera clusterHow to create a secured cloudera cluster
How to create a secured cloudera clusterTiago Simões
 
Writing & Sharing Great Modules - Puppet Camp Boston
Writing & Sharing Great Modules - Puppet Camp BostonWriting & Sharing Great Modules - Puppet Camp Boston
Writing & Sharing Great Modules - Puppet Camp BostonPuppet
 
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Nicolas Brousse
 
Scale Apache with Nginx
Scale Apache with NginxScale Apache with Nginx
Scale Apache with NginxBud Siddhisena
 
How to become cloud backup provider with Cloudian HyperStore and CloudBerry L...
How to become cloud backup provider with Cloudian HyperStore and CloudBerry L...How to become cloud backup provider with Cloudian HyperStore and CloudBerry L...
How to become cloud backup provider with Cloudian HyperStore and CloudBerry L...Cloudian
 
Tutorial CentOS 5 untuk Webhosting
Tutorial CentOS 5 untuk WebhostingTutorial CentOS 5 untuk Webhosting
Tutorial CentOS 5 untuk WebhostingBeni Krisbiantoro
 
Quick-Start Guide: Deploying Your Cloudian HyperStore Hybrid Storage Service
Quick-Start Guide: Deploying Your Cloudian HyperStore Hybrid Storage ServiceQuick-Start Guide: Deploying Your Cloudian HyperStore Hybrid Storage Service
Quick-Start Guide: Deploying Your Cloudian HyperStore Hybrid Storage ServiceCloudian
 
Drupal camp South Florida 2011 - Introduction to the Aegir hosting platform
Drupal camp South Florida 2011 - Introduction to the Aegir hosting platformDrupal camp South Florida 2011 - Introduction to the Aegir hosting platform
Drupal camp South Florida 2011 - Introduction to the Aegir hosting platformHector Iribarne
 

Similaire à Configuring Your First Hadoop Cluster On EC2 (20)

R hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopR hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing Hadoop
 
One-Man Ops
One-Man OpsOne-Man Ops
One-Man Ops
 
Two single node cluster to one multinode cluster
Two single node cluster to one multinode clusterTwo single node cluster to one multinode cluster
Two single node cluster to one multinode cluster
 
R server and spark
R server and sparkR server and spark
R server and spark
 
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
 
Lumen
LumenLumen
Lumen
 
Configuringahadoop
ConfiguringahadoopConfiguringahadoop
Configuringahadoop
 
How to create a secured cloudera cluster
How to create a secured cloudera clusterHow to create a secured cloudera cluster
How to create a secured cloudera cluster
 
Unit 5
Unit  5Unit  5
Unit 5
 
Writing & Sharing Great Modules - Puppet Camp Boston
Writing & Sharing Great Modules - Puppet Camp BostonWriting & Sharing Great Modules - Puppet Camp Boston
Writing & Sharing Great Modules - Puppet Camp Boston
 
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
 
Hadoop on osx
Hadoop on osxHadoop on osx
Hadoop on osx
 
Scale Apache with Nginx
Scale Apache with NginxScale Apache with Nginx
Scale Apache with Nginx
 
How to become cloud backup provider with Cloudian HyperStore and CloudBerry L...
How to become cloud backup provider with Cloudian HyperStore and CloudBerry L...How to become cloud backup provider with Cloudian HyperStore and CloudBerry L...
How to become cloud backup provider with Cloudian HyperStore and CloudBerry L...
 
Tutorial CentOS 5 untuk Webhosting
Tutorial CentOS 5 untuk WebhostingTutorial CentOS 5 untuk Webhosting
Tutorial CentOS 5 untuk Webhosting
 
Network Manual
Network ManualNetwork Manual
Network Manual
 
Quick-Start Guide: Deploying Your Cloudian HyperStore Hybrid Storage Service
Quick-Start Guide: Deploying Your Cloudian HyperStore Hybrid Storage ServiceQuick-Start Guide: Deploying Your Cloudian HyperStore Hybrid Storage Service
Quick-Start Guide: Deploying Your Cloudian HyperStore Hybrid Storage Service
 
Apache
ApacheApache
Apache
 
Apache
ApacheApache
Apache
 
Drupal camp South Florida 2011 - Introduction to the Aegir hosting platform
Drupal camp South Florida 2011 - Introduction to the Aegir hosting platformDrupal camp South Florida 2011 - Introduction to the Aegir hosting platform
Drupal camp South Florida 2011 - Introduction to the Aegir hosting platform
 

Dernier

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 

Dernier (20)

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 

Configuring Your First Hadoop Cluster On EC2

  • 2. What Is This Document?  A high level tutorial for setting up a small 3 node Hadoop cluster on Amazons EC2 cloud;  It aims to be a complete, simple, and fast guide for people new to Hadoop. Documentation on the Internet is spotty and text dense so it can be slower to get started than it needs to be;  It aims to cover the little stumbling blocks around SSH keys and a few small bug workarounds that can trip you up on the first pass through;  It does not aim to cover MapReduce and writing jobs. A small job will be provided at the end to validate the cluster.
  • 3. Recap - What Is Hadoop?  An open source framework for ‘reliable, scalable, distributed computing’;  It gives you the ability process and work with large datasets that are distributed across clusters of commodity hardware;  It allows you to parallelize computation and ‘move processing to the data’ using the MapReduce framework.
  • 4. Recap - What Is Amazon EC2?  A ‘cloud’ web host that allows you to dynamically add and remove compute server resources as you need them, allowing you to pay for only the capacity that you need;  It is well suited for Hadoop Computation – we can bring up enormous clusters within minutes and then spin it down when we’ve finished to reduce costs;  EC2 is quick and cost effective for experimental and learning purposes, as well as being proven as a production Hadoop host.
  • 5. Assumptions & Notes  The document assumes basic familiarity with Linux, Java, and SSH;  The cluster will be set up manually to demonstrate concepts of Hadoop. In real life, we would typically use a configuration management tool such as Puppet or Chef to manage and automate larger clusters;  The configuration shown is not production ready. Real Hadoop clusters need much more bootstrap configuration and security;  It assumes that you are running your cluster on Ubuntu Linux, but are accessing the cluster from a Windows host. This is possibly not a sensible assumption, but it’s what I had at the time of writing!
  • 7. 1. Start EC2 Servers  Sign up to Amazon Web Services @ http://aws.amazon.com/;  Login and navigate to Amazon EC2. Using the ‘classic wizard’, create three micro instances running the latest 64 bit Ubuntu Server;  If you do not already have a key pair .pem file, you will need to create one during the process. We will later use this to connect to the servers and to navigate around within the cluster., so keep it in a safe place.
  • 8. 2. Name EC2 Servers  For reference, name the instances Master, Slave 1, and Slave 2 within the EC2 console once they are running;  Note down the host names for each of the 3 instances in the bottom part of the management console. We will use these to access the servers:
  • 9. 3. Prepare Keys from PEM file  We need to break down the Amazon supplied .pem file into private and public keys in order to access the servers from our local machine;  To do this, download PuttyGen @ http://www.chiark.greenend.org.uk/~sgtatham/putty/download. html  Using PuttyGen, import your PEM file (Conversions > Import Key) and export public and private keys into a safe place.
  • 10. 4. Configure Putty  We now need to SSH into our EC2 servers from our local machine. To do this, we will use the Putty SSH client for Windows;  Begin by configuring Putty sessions for each of the three servers and saving them for future convenience;  Under connection > SSH > Auth in the putty tree, point towards the private key that you generated in the previous step.
  • 11. 5. Optional - mRemote  I use a tool called mRemote which allows you to embed Putty instances into a tabbed browser. Try it @ http://www.mremote.org. I recommend this as navigating around all of the hosts in your Hadoop cluster can be fiddly for larger manually managed clusters;  If you do this, be sure to select the corresponding Putty Session for each mRemote connection to the private key is carried through so that you can connect:
  • 12. 6. Install Java & Hadoop  We need to install Java on the cluster machines in order to run Hadoop. The OpenJDK7 will suffice for this tutorial. Connect to all three machines using Putty or mRemote, and on each of the three machines run the following: sudo apt-get install openjdk-7-jdk  When that’s complete, configure the JAVA_HOME variable by adding the following line at the top of ~/.bashrc: export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/  We can now download and unpack Hadoop. On each of the three machines run the following: cd ~ wget http://apache.mirror.rbftpnetworks.com/hadoop/common/hadoop- 1.0.3/hadoop-1.0.3-bin.tar.gz gzip –d hadoop-1.0.3-bin.tar.gz tar –xf hadoop-1.0.3-bin.tar
  • 13. 7. Configure SSH Keypairs  Hadoop needs to SSH from master to the slave servers to start and stop processes.  All of the Amazon servers will have our generated public key installed on them in their ~ubuntu/.ssh/authorized_keys file automatically by Amazon. However, we need to put the corresponding private key in our ~ubuntu/.ssh/id_rsa file on the master server to be able to go there.  To upload the file, use the file transfer software WinSCP @ http://winscp.net to push the file into your .ssh folder.  Be sure to upload your OpenSSH private key generated from PuttyGen – (Conversions > Export Open SSH Key)!
  • 14. 8. Passwordless SSH  It is better if Hadoop can move between boxes without requiring the pass phrase to your key file;  To do this, we can save the password into the ssh agent by using the following commands on the master server. This will avoid the need for specifying the password repeatedly when stopping and starting the cluster: ◦ ssh-agent bash ◦ ssh-add
  • 15. 9. Open Firewall Ports  We need to open a number of ports to allow the Hadoop cluster to communicate and expose various web interfaces to us. Do this by adding inbound rules to the default security group on the AWS EC2 management console. Open port 9000, 9001 and 50000-50100
  • 17. 1. Decide On Cluster Layout  There are four components of Hadoop which we would like to spread out across the cluster: ◦ Data nodes – actually store and manage data; ◦ Naming node – acts as a catalogue service, showing what data is stored where; ◦ Job tracker – tracks and manages submitted MapReduce tasks; ◦ Task tracker – low level worker that is issued jobs from job tracker.  Lets go with the following setup. This is fairly typical in terms of data nodes and task trackers across the cluster, and one instance of the naming node and job tracker: Node Hostname Component Master ec2-23-22-133-70 Naming Node Job Tracker Slave 1 ec2-23-20-53-36 Data Node Task Tracker Slave 2 ec2-184-73-42-163 Data Node Task Tracker
  • 18. 2a. Configure Server Names  Logout of all of the machines and log back into the master server;  The hadoop configuration will be located here on the server: cd /home/ubuntu/hadoop-1.0.3/conf  Open the file ‘masters’ and replace the word ‘localhost’ with the hostname of the server that you have allocated to master: cd /home/ubuntu/hadoop-1.0.3/conf vi masters  Open the file ‘slaves’ and replace the word ‘localhost’ with the 2 hostnames of the server that you have been allocated on 2 separate lines: cd /home/ubuntu/hadoop-1.0.3/conf vi slaves
  • 19. 2b. Configure Server Names  It should look like this, though of course using your own allocated hostnames:  Do not use ‘localhost’ in the masters/slaves files as this can lead to non descriptive errors!
  • 20. 3a. Configure HDFS  HDFS is the distributed file system that sits behind Hadoop instances, syncing data so that it’s close to the processing and providing redundancy. We should therefore set this up first;  We need to specify some mandatory parameters to get HDFS up and running in various XML configuration files;  Still on the master server, the first thing we need to do is to set the name of the default file system so that it always points back at master, again using your own fully qualified hostname: /home/ubuntu/hadoop-1.0.3/conf/core-site.xml <configuration> <property> <name>fs.default.name</name> <value>hdfs://ec2-107-20-118-109.compute-1.amazonaws.com:9000</value> </property> </configuration>
  • 21. 3b. Configure HDFS  Still on the master server, we also need to set the dfs.replication parameter, which says how many nodes data should be replicated to for failover and redundancy purposes: /home/ubuntu/hadoop-1.0.3/conf/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
  • 22. 4. Configure MapReduce  As well as the underlying HDFS file system, we have to set one mandatory parameter that will be used by the Hadoop MapReduce framework;  Still on the master server, we need to set the job tracker location, which Hadoop will use. As discussed earlier, we will put the job tracker on master in this instance, again being careful to substitute in your own master server host name: /home/ubuntu/hadoop-1.0.3/conf/mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>ec2-107-22-78-136.compute-1.amazonaws.com:54311</value> </property> </configuration>
  • 23. 5a. Push Configuration To Slaves  We need to push out the little mandatory configuration that we have done onto all of the slaves. Typically, this could be mounted on a shared drive but we will do it manually this time using SCP: cd /home/ubuntu/hadoop-1.0.3/conf scp * ubuntu@ec2-23-20-53-36.compute-1.amazonaws.com:/home/ubuntu/hadoop- 1.0.3/conf scp * ubuntu@ec2-184-73-42-163.compute-1.amazonaws.com:/home/ubuntu/hadoop- 1.0.3/conf  By virtue of pushing out the masters and slaves files, the various nodes in this cluster should all be correctly congfigured and referencing each other at this stage.
  • 24. 6a. Format HDFS  Before we can start Hadoop, we need to format and initialise the underlying distributed file system;  To do this, on the master server, execute the following command: cd /home/ubuntu/hadoop-1.0.3/bin ./hadoop namenode -format  It should only take a minute. The expected output of the format operation is on the next page;  The file system will be built and formatted. It will exist in /tmp/hadoop-ubuntu if you would like to browse around. This file system will be managed by Hadoop to distribute it across nodes and access data.
  • 27. 1a. Start HDFS  We will begin by starting the HDFS file system from the master server.  There is a script for this which will run the name node on the master and the data nodes on the slaves:  cd /home/ubuntu/hadoop-1.0.3  ./bin/start-dfs.sh
  • 28. 1b. Start HDFS  At this point, monitor the log files on the master and the slaves. You should see that HDFS cleanly starts on the slaves when we start it on the master: cd /home/ubuntu/hadoop-1.0.3 tail –f logs/hadoop-ubuntu-datanode-ip-10-245-114-186.log  If anything appears to have gone wrong, double check the configuration files above are correct, firewall ports are open, and everything has been accurately pushed to all slaves.
  • 29. 2. Start MapReduce  Once we’ve confirmed that the HDFS is up, is now time to start the map reduce component of Hadoop;  There is a script for this which will run the name node on the master and the data nodes on the slaves. Run the following on the master server.  cd /home/ubuntu/hadoop-1.0.3  ./bin/start-mapred.sh  Again, double check the log files on all servers to check that everything is communicating cleanly before moving further.. Double check configuration and firewall ports in the case of issues.
  • 31. 2a. Web Interfaces  Now we’re up and running, Hadoop has started a number of web interfaces that give information about the cluster and HDFS. Take a look at these to familiarise yourself with them: NameNode master:50070 Information about the name node and the health of the distributed file system DataNode slave1:50075 TBC slave2:50075 JobTracker master:50030 Information about submitted and queued jobs TaskTracker slave1:50060 Information about tasks slave2:50060 that are submitted and queued
  • 35. Part 5 YOUR FIRST MAP REDUCE JOB
  • 36. 1. Source Dataset  It’s now time to submit a processing job to the Hadoop cluster.  Though we won’t go into much detail here, for the exercise, I used a dataset of UK government spending which you can bring down onto your master server like so: wget http://www.dwp.gov.uk/docs/dwp-payments-april10.csv
  • 37. 2. Push Dataset Into HDFS  We need to push the dataset into the HDFS so that it’s available to be shared across the nodes for subsequent processing: /home/ubuntu/hadoop-1.0.3/bin/hadoop dfs -put dwp-payments-april10.csv dwp- payments-april10.csv  After pushing the file, note how the NameNode data health page shows the file count, and space used increasing after the push.
  • 38. 3a. Write A MapReduce Job  It’s now time to open our IDE and write the Java code that will represent our MapReduce job. For the purposes of this presentation, we are just looking to validate the Hadoop cluster, so we will not go into detail with regards to MapReduce;  My Java project is available at @ http://www.benjaminwootton.co.uk/wordcount.zip.  As per the sample project, I recommend that you use Maven in order to easily bring in Hadoop depenendencies and build the required JAR. You can manage this part differently if you prefer.
  • 39.
  • 40. 4. Package The MapReduce Job  Hadoop Jobs are typically packaged as Java JAR files. By virtue of using Maven, we can get the packaged JAR simply by running a mvn clean package against the sample project.  Upload the JAR onto your master server using WinSCP:
  • 41. 5a. Execute The MapReduce Job  We are finally ready to run the job! To do that, we’ll run the Hadoop script and pass in a reference to the JAR file, the name of the class containing the main method, and an input file and an output file that will be used by our Java code: cd /home/ubuntu/hadoop-1.0.3 ./bin/hadoop jar ~/firsthadoop.jar benjaminwootton.WordCount /user/ubuntu/dwp-payments-april10.csv /user/ubuntu/RESULTS  If all goes well, we’ll see the job run with no errors: ubuntu@ip-10-212-121-253:~/hadoop-1.0.3$ ./bin/hadoop jar ~/firsthadoop.jar benjaminwootton.WordCount /user/ubuntu/dwp-payments-april10.csv /user/ubuntu/RESULTS 12/06/03 16:06:12 INFO mapred.JobClient: Running job: job_201206031421_0004 12/06/03 16:06:13 INFO mapred.JobClient: map 0% reduce 0% 12/06/03 16:06:29 INFO mapred.JobClient: map 50% reduce 0% 12/06/03 16:06:31 INFO mapred.JobClient: map 100% reduce 0% 12/06/03 16:06:41 INFO mapred.JobClient: map 100% reduce 100% 12/06/03 16:06:50 INFO mapred.JobClient: Job complete: job_201206031421_0004
  • 42. 5b. Execute The MapReduce Job  You will also be able to monitor progress of the job on the job tracker web application that is running on the master server:
  • 43. 6. Confirm Results!  And the final step is to confirm our results by interogating the HDFS: ubuntu@ip-10-212-121-253:~/hadoop-1.0.3$ ./bin/hadoop dfs -cat /user/ubuntu/RESULTS/part- 00000 Pension 2150 Corporate 5681 Employment Programmes 491 Entity 2 Housing Benefits 388 Jobcentre Plus 14774
  • 45. 1. What We Have Done!  Setup EC2, requested machines, configured firewalls and passwordless SSH;  Downloaded Java and Hadoop;  Configured HDFS and MapReduce and pushed configuration around the cluster;  Started HDFS and MapReduce;  Compiled a MapReduce job using Maven;  Submitted the job, ran it succesfully, and viewed the output.
  • 46. 2. In Summary….  Hopefully you can see how this model of computation would be useful for very large datasets that we wish to perform processing on;  Hopefully you are also sold on EC2 as a distributed, fast, cost effective platform for using Hadoop for big-data work.  Please get in touch with any questions, comments, or corrections! @benjaminwootton www.benjaminwootton.co.uk