2. Basic Setup
1.
Install Ubuntu
2.
Install Java, Python and update
3.
Add group ‘hadoop’ and ‘hduser’ as user (for security and
backup)
4.
Configure SSH
a)
b)
Configure it by editing file ssh_config and save a backup
c)
Generate ssh key for hduser
d)
Enable ssh access to your local machine with the newly created RSA
key
e)
5.
Install OpenSSH Server
hduser@Ubuntu:~$ ssh localhost
Disable IPv6 in sysctl.conf file in editor
3. Installing Hadoop
1. Download hadoop from the collection of Apache
Download Mirrors
• salil@ubuntu:/usr/local$ sudo tar xzf hadoop-2.0.6-alphasrc.tar.gz
2. Make sure to change the owner to hduser in
hadoop group
• $ sudo chown -R hduser:hadoop hadoop (change the
permissions)
3. Update $HOME/.bashrc – hadoop related
environment variables
4. Configuration
1. Edit environment variables in conf/hadoop-env.sh
2. Change settings in conf/*site.xml
3. Make directory and set the required ownerships and
permissions
• Now we create the directory and set the required ownerships
and permissions:
• $ sudo mkdir -p /app/hadoop/tmp
• $ sudo chown hduser:hadoop /app/hadoop/tmp
• $ sudo chmod 750 /app/hadoop/tmp
4. Add configurations snippets between <configuration>
... </configuration> tags in core-site.xml, mapredsite.xml and hdfs-site.xml
5. Starting your single node cluster
• First format the namenode
•
hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop
namenode -format
• Start your single node cluster
6. • Running a MapReduce job
• Download data and copy from local file to hdfs
• hduser@ubuntu:~$ hadoop dfs -copyFromLocal
/home/hduser/project.txt /user/new
• hduser@ubuntu:~$ hadoop dfs -copyFromLocal
/home/hduser/hadoop/project.txt /user/lol
7. • hduser@ubuntu:~$ hadoop dfs -ls /user/lol
Found 2 items
drwxr-xr-x - hduser supergroup
0 2013-10-10
06:30 /user/lol/output
-rw-r--r-- 1 hduser supergroup 969039 2013-1005 20:20 /user/lol/project.txt
• hduser@ubuntu:~$ hadoop jar
/home/hduser/hadoop/hadoop-examples-1.0.3.jar
wordcount /user/lol/project.txt /user/lol/output/
• Hadoop Web interfaces
• http://localhost:50070/ – web UI of the NameNode daemon
• http://localhost:50030/ – web UI of the JobTracker daemon
• http://localhost:50060/ – web UI of the TaskTracker daemon
8. • The NameNode
Web interface gives
us a cluster
summary about
total /remaining,
capacity, live and
dead nodes.
• Aditionally we can
browse the HDFS to
view contents of
files and log
9. • The Jobtracker
Web interface
provides general
job statistics
about Hadoop
cluster,
running/complet
ed/failed jobs
and a job history
log file
• Tasktracker
provides info
about running
and non-running
tasks
10. Writing MapReduce programs
• Hadoop framework is written in java, which is
complicated to code for Non-CS guys
• Can be written in Python and converted to .jar file using
Jython to run on a Hadoop cluster
• But Jython has incomplete standard library because
some Python features not provided in Jython
• Alternative is to use Hadoop Streaming
• Hadoop streaming is the utility that comes with Hadoop
distribution; able to run any executable script as a
mapper and reducer
11. • Write mapper.py and reducer.py in python
• Download and copy data to HDFS
• Run same as previous java implementation
• There are other third party solutions of Python
Mapreduce which are similar to Streaming/Jython
but can be easily used as a library in Python
12. Python implementation stratagies
• Streaming
• mrjob
• dumbo
• Hadoopy
• Non-Hadoop
• disco
• Prefer Hadoop streaming if possible because it is
easy and has the lowest overhead
• Prefer mrjob where you need higher abstraction
and integration with AWS
13. Future Work….
• Python implementation in Hadoop
• Running Hadoop in Multi node cluster
• Pig and its implementation on linux
• Apache Mahout, Hive, Solr