Implementing Hadoop on a single cluster

•Download as PPTX, PDF•

0 likes•999 views

Salil Navgire

Technology

Basic Setup
1.

Install Ubuntu

2.

Install Java, Python and update

3.

Add group ‘hadoop’ and ‘hduser’ as user (for security and
backup)

4.

Configure SSH
a)
b)

Configure it by editing file ssh_config and save a backup

c)

Generate ssh key for hduser

d)

Enable ssh access to your local machine with the newly created RSA
key

e)

5.

Install OpenSSH Server

hduser@Ubuntu:~$ ssh localhost

Disable IPv6 in sysctl.conf file in editor

Installing Hadoop
1. Download hadoop from the collection of Apache
Download Mirrors
• salil@ubuntu:/usr/local$ sudo tar xzf hadoop-2.0.6-alphasrc.tar.gz

2. Make sure to change the owner to hduser in
hadoop group
• $ sudo chown -R hduser:hadoop hadoop (change the
permissions)

3. Update $HOME/.bashrc – hadoop related
environment variables

Configuration
1. Edit environment variables in conf/hadoop-env.sh
2. Change settings in conf/*site.xml
3. Make directory and set the required ownerships and
permissions
• Now we create the directory and set the required ownerships
and permissions:
• $ sudo mkdir -p /app/hadoop/tmp
• $ sudo chown hduser:hadoop /app/hadoop/tmp
• $ sudo chmod 750 /app/hadoop/tmp

4. Add configurations snippets between <configuration>
... </configuration> tags in core-site.xml, mapredsite.xml and hdfs-site.xml

Starting your single node cluster
• First format the namenode
•

hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop
namenode -format

• Start your single node cluster

• Running a MapReduce job
• Download data and copy from local file to hdfs
• hduser@ubuntu:~$ hadoop dfs -copyFromLocal
/home/hduser/project.txt /user/new
• hduser@ubuntu:~$ hadoop dfs -copyFromLocal
/home/hduser/hadoop/project.txt /user/lol

• hduser@ubuntu:~$ hadoop dfs -ls /user/lol
Found 2 items
drwxr-xr-x - hduser supergroup
0 2013-10-10
06:30 /user/lol/output
-rw-r--r-- 1 hduser supergroup 969039 2013-1005 20:20 /user/lol/project.txt
• hduser@ubuntu:~$ hadoop jar
/home/hduser/hadoop/hadoop-examples-1.0.3.jar
wordcount /user/lol/project.txt /user/lol/output/
• Hadoop Web interfaces
• http://localhost:50070/ – web UI of the NameNode daemon
• http://localhost:50030/ – web UI of the JobTracker daemon
• http://localhost:50060/ – web UI of the TaskTracker daemon

• The NameNode
Web interface gives
us a cluster
summary about
total /remaining,
capacity, live and
dead nodes.
• Aditionally we can
browse the HDFS to
view contents of
files and log

• The Jobtracker
Web interface
provides general
job statistics
about Hadoop
cluster,
running/complet
ed/failed jobs
and a job history
log file
• Tasktracker
provides info
about running
and non-running
tasks

Writing MapReduce programs
• Hadoop framework is written in java, which is
complicated to code for Non-CS guys
• Can be written in Python and converted to .jar file using
Jython to run on a Hadoop cluster

• But Jython has incomplete standard library because
some Python features not provided in Jython
• Alternative is to use Hadoop Streaming

• Hadoop streaming is the utility that comes with Hadoop
distribution; able to run any executable script as a
mapper and reducer

• Write mapper.py and reducer.py in python
• Download and copy data to HDFS

• Run same as previous java implementation
• There are other third party solutions of Python
Mapreduce which are similar to Streaming/Jython
but can be easily used as a library in Python

Python implementation stratagies
• Streaming
• mrjob
• dumbo
• Hadoopy

• Non-Hadoop
• disco

• Prefer Hadoop streaming if possible because it is
easy and has the lowest overhead
• Prefer mrjob where you need higher abstraction
and integration with AWS

Future Work….
• Python implementation in Hadoop
• Running Hadoop in Multi node cluster
• Pig and its implementation on linux
• Apache Mahout, Hive, Solr

What's hot

8a. How To Setup HBase with DockerFabio Fumarola

An example Hadoop InstallMike Frampton

Apache HDFS - Lab AssignmentFarzad Nozarian

Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)Nag Arvind Gudiseva

Shark - Lab AssignmentFarzad Nozarian

Web scraping with nutch solr part 2Mike Frampton

Introduction to Apache HiveAvkash Chauhan

Install hadoop in a clusterXuhong Zhang

Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Apache HBase - Lab AssignmentFarzad Nozarian

Boulder dev ops-meetup-11-2012-rundeckWill Sterling

Beeswax Hive editor in HueRomain Rigaux

Large Scale Crawling with Apache Nutch and Friendslucenerevolution

Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Configuringahadoopmensb

PigRamakrishna kapa

Run wordcount job (hadoop)valeri kopaleishvili

What's hot (17)

8a. How To Setup HBase with Docker

An example Hadoop Install

Apache HDFS - Lab Assignment

Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)

Shark - Lab Assignment

Web scraping with nutch solr part 2

Introduction to Apache Hive

Install hadoop in a cluster

Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab

Apache HBase - Lab Assignment

Boulder dev ops-meetup-11-2012-rundeck

Beeswax Hive editor in Hue

Large Scale Crawling with Apache Nutch and Friends

Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab

Configuringahadoop

Pig

Run wordcount job (hadoop)

Viewers also liked

Salil presentation 11.07Salil Navgire

MapReduce and HadoopSalil Navgire

Anomaly DetectionSalil Navgire

Hadoop Overview kdd2011Milind Bhandarkar

Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit

Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Michael Arnold

Data Mining and Recommendation SystemsSalil Navgire

Data-Ed Webinar: A Framework for Implementing NoSQL, HadoopDATAVERSITY

Modeling with Hadoop kdd2011Milind Bhandarkar

Monitor PowerKVM using Ganglia, NagiosPradeep Kumar

Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta

Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Hortonworks

Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer

A Reference Architecture for ETL 2.0 DataWorks Summit

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.

Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks

Hadoop and Enterprise Data WarehouseDataWorks Summit

Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks

Hadoop and Your Data WarehouseCaserta

Large scale ETL with HadoopOReillyStrata

Viewers also liked (20)

Salil presentation 11.07

MapReduce and Hadoop

Anomaly Detection

Hadoop Overview kdd2011

Challenges of Implementing an Advanced SQL Engine on Hadoop

Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)

Data Mining and Recommendation Systems

Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop

Modeling with Hadoop kdd2011

Monitor PowerKVM using Ganglia, Nagios

Hadoop installation, Configuration, and Mapreduce program

Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...

Hadoop Integration into Data Warehousing Architectures

A Reference Architecture for ETL 2.0

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

Implementing a Data Lake with Enterprise Grade Data Governance

Hadoop and Enterprise Data Warehouse

Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...

Hadoop and Your Data Warehouse

Large scale ETL with Hadoop

Similar to Implementing Hadoop on a single cluster

02 Hadoop deployment and configurationSubhas Kumar Ghosh

Big data with hadoop Setup on Ubuntu 12.04Mandakini Kumari

Single node hadoop cluster installation Mahantesh Angadi

Hadoop single node setupMohammad_Tariq

DC HUG Hadoop for WindowsTerry Padgett

Cloudera hadoop installationSumitra Pundlik

Playing with Hadoop (NPW2013)Søren Lund

Big data using Hadoop, Hive, Sqoop with Installationmellempudilavanya999

Big data processing using hadoop poster presentationAmrut Patil

Exp-3.pptxPraveenKumar581409

Hadoop cluster 安裝recast203

Distro-independent Hadoop cluster managementDataWorks Summit

Hadoop installation with an exampleNikita Kesharwani

Hadoop cluster configurationprabakaranbrick

Asbury Hadoop OverviewBrian Enochson

Yahoo! Hack Europe WorkshopHortonworks

Configure h base hadoop and hbase clientShashwat Shriparv

LuisRodriguezLocalDevEnvironmentsDrupalOpenDaysLuis Rodríguez Castromil

Hadoop 20111117exsuns

#WeSpeakLinux SessionKellyn Pot'Vin-Gorman

Similar to Implementing Hadoop on a single cluster (20)

02 Hadoop deployment and configuration

Big data with hadoop Setup on Ubuntu 12.04

Single node hadoop cluster installation

Hadoop single node setup

DC HUG Hadoop for Windows

Cloudera hadoop installation

Playing with Hadoop (NPW2013)

Big data using Hadoop, Hive, Sqoop with Installation

Big data processing using hadoop poster presentation

Exp-3.pptx

Hadoop cluster 安裝

Distro-independent Hadoop cluster management

Hadoop installation with an example

Hadoop cluster configuration

Asbury Hadoop Overview

Yahoo! Hack Europe Workshop

Configure h base hadoop and hbase client

LuisRodriguezLocalDevEnvironmentsDrupalOpenDays

Hadoop 20111117

#WeSpeakLinux Session

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Training state-of-the-art general text embeddingZilliz

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

AI as an Interface for Commercial BuildingsMemoori

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024

Training state-of-the-art general text embedding

DevoxxFR 2024 Reproducible Builds with Apache Maven

The Future of Software Development - Devin AI Innovative Approach.pdf

Powerpoint exploring the locations used in television show Time Clash

Scanning the Internet for External Cloud Exposures via SSL Certs

Vector Databases 101 - An introduction to the world of Vector Databases

SAP Build Work Zone - Overview L2-L3.pptx

DMCC Future of Trade Web3 - Special Edition

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Nell’iperspazio con Rocket: il Framework Web di Rust!

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Are Multi-Cloud and Serverless Good or Bad?

Unleash Your Potential - Namagunga Girls Coding Club

Commit 2024 - Secret Management made easy

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

WordPress Websites for Engineers: Elevate Your Brand

Designing IA for AI - Information Architecture Conference 2024

My INSURER PTE LTD - Insurtech Innovation Award 2024

AI as an Interface for Commercial Buildings

Implementing Hadoop on a single cluster

1. Implementing Hadoop on a Single Cluster - S A L IL NAVG IR E

2. Basic Setup 1. Install Ubuntu 2. Install Java, Python and update 3. Add group ‘hadoop’ and ‘hduser’ as user (for security and backup) 4. Configure SSH a) b) Configure it by editing file ssh_config and save a backup c) Generate ssh key for hduser d) Enable ssh access to your local machine with the newly created RSA key e) 5. Install OpenSSH Server hduser@Ubuntu:~$ ssh localhost Disable IPv6 in sysctl.conf file in editor

3. Installing Hadoop 1. Download hadoop from the collection of Apache Download Mirrors • salil@ubuntu:/usr/local$ sudo tar xzf hadoop-2.0.6-alphasrc.tar.gz 2. Make sure to change the owner to hduser in hadoop group • $ sudo chown -R hduser:hadoop hadoop (change the permissions) 3. Update $HOME/.bashrc – hadoop related environment variables

4. Configuration 1. Edit environment variables in conf/hadoop-env.sh 2. Change settings in conf/*site.xml 3. Make directory and set the required ownerships and permissions • Now we create the directory and set the required ownerships and permissions: • $ sudo mkdir -p /app/hadoop/tmp • $ sudo chown hduser:hadoop /app/hadoop/tmp • $ sudo chmod 750 /app/hadoop/tmp 4. Add configurations snippets between <configuration> ... </configuration> tags in core-site.xml, mapredsite.xml and hdfs-site.xml

5. Starting your single node cluster • First format the namenode • hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format • Start your single node cluster

6. • Running a MapReduce job • Download data and copy from local file to hdfs • hduser@ubuntu:~$ hadoop dfs -copyFromLocal /home/hduser/project.txt /user/new • hduser@ubuntu:~$ hadoop dfs -copyFromLocal /home/hduser/hadoop/project.txt /user/lol

7. • hduser@ubuntu:~$ hadoop dfs -ls /user/lol Found 2 items drwxr-xr-x - hduser supergroup 0 2013-10-10 06:30 /user/lol/output -rw-r--r-- 1 hduser supergroup 969039 2013-1005 20:20 /user/lol/project.txt • hduser@ubuntu:~$ hadoop jar /home/hduser/hadoop/hadoop-examples-1.0.3.jar wordcount /user/lol/project.txt /user/lol/output/ • Hadoop Web interfaces • http://localhost:50070/ – web UI of the NameNode daemon • http://localhost:50030/ – web UI of the JobTracker daemon • http://localhost:50060/ – web UI of the TaskTracker daemon

8. • The NameNode Web interface gives us a cluster summary about total /remaining, capacity, live and dead nodes. • Aditionally we can browse the HDFS to view contents of files and log

9. • The Jobtracker Web interface provides general job statistics about Hadoop cluster, running/complet ed/failed jobs and a job history log file • Tasktracker provides info about running and non-running tasks

10. Writing MapReduce programs • Hadoop framework is written in java, which is complicated to code for Non-CS guys • Can be written in Python and converted to .jar file using Jython to run on a Hadoop cluster • But Jython has incomplete standard library because some Python features not provided in Jython • Alternative is to use Hadoop Streaming • Hadoop streaming is the utility that comes with Hadoop distribution; able to run any executable script as a mapper and reducer

11. • Write mapper.py and reducer.py in python • Download and copy data to HDFS • Run same as previous java implementation • There are other third party solutions of Python Mapreduce which are similar to Streaming/Jython but can be easily used as a library in Python

12. Python implementation stratagies • Streaming • mrjob • dumbo • Hadoopy • Non-Hadoop • disco • Prefer Hadoop streaming if possible because it is easy and has the lowest overhead • Prefer mrjob where you need higher abstraction and integration with AWS

13. Future Work…. • Python implementation in Hadoop • Running Hadoop in Multi node cluster • Pig and its implementation on linux • Apache Mahout, Hive, Solr

Implementing Hadoop on a single cluster

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (20)

Similar to Implementing Hadoop on a single cluster

Similar to Implementing Hadoop on a single cluster (20)

Recently uploaded

Recently uploaded (20)

Implementing Hadoop on a single cluster