SlideShare une entreprise Scribd logo
1  sur  18
Hadoop


    Rahul Jain
    Software Engineer
    http://www.linkedin.com/in/rahuldausa



                                      1
Agenda
• Hadoop
   –   Introduction
   –   Hadoop (Why)
   –   Hadoop History
   –   Uses of Hadoop
   –   High Level Architecture
   –   Map-Reduce
• HDFS
   – GFS (Google File System)
   – HDFS Architecture
• Installation/Configuration
• Examples

                                     2
Introduction
 An open source software framework
 Supports Data intensive Distributed Applications.
 Enables Application to work thousand of computational
  independent computers and petabytes of data.
 Derived from Google’s Map-Reduce and Google File System
  papers.
 Written in the Java Programming Language.
 Started by Doug Cutting, who named it after his son’s toy
  elephant to support distribution for the Nutch ( A sub-project
  of Lucene)




                                                                   3
Hadoop (Why)
• Need to process huge datasets on large no. of computers.
• It is expensive to build reliability into each application.
• Nodes fails everyday
        - Failure is expected, rather than exceptional.
- Need common infrastructure
   - Efficient, reliable, easy to use.
   - Open sourced , Apache License




                                                                4
Hadoop History
•   Dec 2004 – Google GFS paper published
•   July 2005 – Nutch uses Map-Reduce
•   Jan 2006 – Doug Cutting joins Yahoo!
•   Feb 2006 – Become Lucene Subproject
•   Apr 2007 – Yahoo! On 1000 node cluster
•   Jan 2008 – An Apache Top Level Project
•   Feb 2008 – Yahoo Production search index

                                               5
What is Hadoop Used for ?
•   Searching (Yahoo)
•   Log Processing
•   Recommendation Systems (Facebook, LinkedIn, eBay, Amazon)
•   Analytics(Facebook, LinkedIn)
•   Video and Image Analysis (NASA)
•   Data Retention



                                                            6
Hadoop High Level Architecture




                                 7
Map-Reduce
Framework for processing parallel
problems across huge datasets using a
large numbers of computers(nodes),
collectively referred as
Cluster : If all nodes are on same local network and
uses similar network.
Or
Grid: If the nodes are shared across geographically
and uses more heterogeneous hardware.

Consists Two Step :
1.Map Step- The master node takes the input,
divides it into smaller sub-problems, and distributes
them to worker nodes. A worker node may do this
again in turn, leading to a multi-level tree structure.
The worker node processes the smaller problem,
and passes the answer back to its master node.
2.Reduce Step -The master node then collects
the answers to all the sub-problems and combines
them in some way to form the output – the answer
to the problem it was originally trying to solve.

                                                          Multiple Map-Reduce phases

                                                                                       8
Map-Reduce Life-Cycle




Credit : http://code.google.com/edu/parallel/mapreduce-tutorial.html
                                                                       9
HDFS
Hadoop Distributed File System




                                 10
Lets Understand GFS first …
               Google File System




                                    11
GFS Architecture




                   12
Goals of HDFS
1. Very Large Distributed File System
      - 10K nodes, 100 million files, 10 PB
2. Assumes Commodity Hardware
      - Files are replicated to handle hardware failure
      - Detect failures and recovers from them
3. Optimized for Batch Processing
      - Data locations exposed so that computation can move to
      where data resides.


                                                            13
14
Installation/ Configuration
  [rjain@10.1.110.12 hadoop-1.0.3]$ vi conf/hdfs-site.xml     [rjain@10.1.110.12 hadoop-1.0.3]$ pwd
  <configuration>                                             /home/rjain/hadoop-1.0.3
  <property>
       <name>dfs.replication</name>
                                                              [rjain@10.1.110.12 hadoop-1.0.3]$ bin/start-all.sh
       <value>1</value>
      </property>                                             [rjain@10.1.110.12 hadoop-1.0.3]$ bin/start-mapred.sh
   <property>
      <name>dfs.permissions</name>
                                                              [rjain@10.1.110.12 hadoop-1.0.3]$ bin/start-dfs.sh
      <value>true</value>
      </property>                                             [rjain@10.1.110.12 hadoop-1.0.3]$ bin/hadoop fs
  <property>                                                  Usage: java FsShell
     <name>dfs.data.dir</name>                                      [-ls <path>]
     <value>/home/rjain/rahul/hdfs/data</value>                     [-lsr <path>] : Recursive version of ls. Similar to Unix ls -R.
    </property>                                                     [-du <path>] : Displays aggregate length of files contained in the directory or the length of a file.
    <property>                                                      [-dus <path>] : Displays a summary of file lengths.
     <name>dfs.name.dir</name>                                      [-count[-q] <path>]
     <value>/home/rjain/rahul/hdfs/name</value>                     [-mv <src> <dst>]
    </property>                                                     [-cp <src> <dst>]
  </configuration>                                                  [-rm [-skipTrash] <path>]
                                                                    [-rmr [-skipTrash] <path>] : Recursive version of delete(rm).
  [rjain@10.1.110.12 hadoop-1.0.3]$ vi conf/mapred-site.xml         [-expunge] : Empty the Trash
  <configuration>                                                   [-put <localsrc> ... <dst>] : Copy single src, or multiple srcs from local file system to the
      <property>                                              destination filesystem
           <name>mapred.job.tracker</name>                          [-copyFromLocal <localsrc> ... <dst>]
           <value>localhost:9001</value>                            [-moveFromLocal <localsrc> ... <dst>]
      </property>                                                   [-get [-ignoreCrc] [-crc] <src> <localdst>]
  </configuration>                                                  [-getmerge <src> <localdst> [addnl]]
                                                                    [-cat <src>]
  [rjain@10.1.110.12 hadoop-1.0.3]$ vi conf/core-site.xml           [-text <src>] : Takes a source file and outputs the file in text format. The allowed formats are zip
  <configuration>                                             and TextRecordInputStream.
      <property>                                                    [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>]
           <name>fs.default.name</name>                             [-moveToLocal [-crc] <src> <localdst>]
            <value>hdfs://localhost:9000</value>                    [-mkdir <path>]
      </property>                                                   [-setrep [-R] [-w] <rep> <path/file>] : Changes the replication factor of a file
  </configuration>                                                  [-touchz <path>] : Create a file of zero length.
                                                                    [-test -[ezd] <path>] : -e check to see if the file exists. Return 0 if true. -z check to see if the file is
[rjain@10.1.110.12 hadoop-1.0.3]$ jps                         zero length. Return 0 if true. -d check to see if the path is directory. Return 0 if true.
29756 SecondaryNameNode                                             [-stat [format] <path>] : Returns the stat information on the path like created time of dir
19847 TaskTracker                                                   [-tail [-f] <file>] : Displays last kilobyte of the file to stdout
18756 Jps                                                           [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
29483 NameNode                                                      [-chown [-R] [OWNER][:[GROUP]] PATH...]
29619 DataNode                                                      [-chgrp [-R] GROUP PATH...]                                                                       15
19711 JobTracker                                                    [-help [cmd]]
HDFS- Read/Write Example
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);

Given an input/output file name as string, we construct inFile/outFile Path objects.
Most of the FileSystem APIs accepts Path objects.

Path inFile = new Path(argv[0]);
Path outFile = new Path(argv[1]);

Validate the input/output paths before reading/writing.

if (!fs.exists(inFile))
  printAndExit("Input file not found");
if (!fs.isFile(inFile))
  printAndExit("Input should be a file");
if (fs.exists(outFile))
  printAndExit("Output already exists");

Open inFile for reading.

FSDataInputStream in = fs.open(inFile);

Open outFile for writing.

FSDataOutputStream out = fs.create(outFile);

Read from input stream and write to output stream until EOF.

while ((bytesRead = in.read(buffer)) > 0) {
  out.write(buffer, 0, bytesRead);
}

Close the streams when done.

in.close();
out.close();
                                                                                       16
Hadoop Sub-Projects
•   Hadoop Common: The common utilities that support the other Hadoop subprojects.
•   Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput
    access to application data.
•   Hadoop MapReduce: A software framework for distributed processing of large data sets on
    compute clusters.

Other Hadoop-related projects at Apache include:

•   Avro™: A data serialization system.
•   Cassandra™: A scalable multi-master database with no single points of failure.
•   Chukwa™: A data collection system for managing large distributed systems.
•   HBase™: A scalable, distributed database that supports structured data storage for large tables.
•   Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
•   Mahout™: A Scalable machine learning and data mining library.
•   Pig™: A high-level data-flow language and execution framework for parallel computation.
•   ZooKeeper™: A high-performance coordination service for distributed applications.




                                                                                                       17
Questions ?




              18

Contenu connexe

Tendances

Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
From docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayFrom docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayDataWorks Summit
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 

Tendances (20)

Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
6.hive
6.hive6.hive
6.hive
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Sqoop
SqoopSqoop
Sqoop
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
From docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayFrom docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native way
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Nov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on HadoopNov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on Hadoop
 

En vedette

Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldDataWorks Summit
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learnedtcurdt
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingBart Vandewoestyne
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignMichael Noll
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

En vedette (18)

Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
Hadoop
HadoopHadoop
Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
reveal.js 3.0.0
reveal.js 3.0.0reveal.js 3.0.0
reveal.js 3.0.0
 

Similaire à Hadoop & HDFS for Beginners

Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFSApache Apex
 
Big data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with InstallationBig data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with Installationmellempudilavanya999
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop IntroductionSNEHAL MASNE
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON Padma shree. T
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of HadoopNam Nham
 
Data analysis on hadoop
Data analysis on hadoopData analysis on hadoop
Data analysis on hadoopFrank Y
 
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Hadoop 2.x  HDFS Cluster Installation (VirtualBox)Hadoop 2.x  HDFS Cluster Installation (VirtualBox)
Hadoop 2.x HDFS Cluster Installation (VirtualBox)Amir Sedighi
 
Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117exsuns
 
Drupal cambs ansible for drupal april 2015
Drupal cambs ansible for drupal april 2015Drupal cambs ansible for drupal april 2015
Drupal cambs ansible for drupal april 2015Ryan Brown
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentationpuneet yadav
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 

Similaire à Hadoop & HDFS for Beginners (20)

Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
 
MapReduce1.pptx
MapReduce1.pptxMapReduce1.pptx
MapReduce1.pptx
 
HDFS_Command_Reference
HDFS_Command_ReferenceHDFS_Command_Reference
HDFS_Command_Reference
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 
Big data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with InstallationBig data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with Installation
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
 
Data analysis on hadoop
Data analysis on hadoopData analysis on hadoop
Data analysis on hadoop
 
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Hadoop 2.x  HDFS Cluster Installation (VirtualBox)Hadoop 2.x  HDFS Cluster Installation (VirtualBox)
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
 
Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117
 
Unit 1
Unit 1Unit 1
Unit 1
 
Drupal cambs ansible for drupal april 2015
Drupal cambs ansible for drupal april 2015Drupal cambs ansible for drupal april 2015
Drupal cambs ansible for drupal april 2015
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 

Plus de Rahul Jain

Flipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and RecommendationFlipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and RecommendationRahul Jain
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to ScalaRahul Jain
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremRahul Jain
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperRahul Jain
 
Hibernate tutorial for beginners
Hibernate tutorial for beginnersHibernate tutorial for beginners
Hibernate tutorial for beginnersRahul Jain
 

Plus de Rahul Jain (13)

Flipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and RecommendationFlipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and Recommendation
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 
Hibernate tutorial for beginners
Hibernate tutorial for beginnersHibernate tutorial for beginners
Hibernate tutorial for beginners
 

Dernier

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Dernier (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

Hadoop & HDFS for Beginners

  • 1. Hadoop Rahul Jain Software Engineer http://www.linkedin.com/in/rahuldausa 1
  • 2. Agenda • Hadoop – Introduction – Hadoop (Why) – Hadoop History – Uses of Hadoop – High Level Architecture – Map-Reduce • HDFS – GFS (Google File System) – HDFS Architecture • Installation/Configuration • Examples 2
  • 3. Introduction  An open source software framework  Supports Data intensive Distributed Applications.  Enables Application to work thousand of computational independent computers and petabytes of data.  Derived from Google’s Map-Reduce and Google File System papers.  Written in the Java Programming Language.  Started by Doug Cutting, who named it after his son’s toy elephant to support distribution for the Nutch ( A sub-project of Lucene) 3
  • 4. Hadoop (Why) • Need to process huge datasets on large no. of computers. • It is expensive to build reliability into each application. • Nodes fails everyday - Failure is expected, rather than exceptional. - Need common infrastructure - Efficient, reliable, easy to use. - Open sourced , Apache License 4
  • 5. Hadoop History • Dec 2004 – Google GFS paper published • July 2005 – Nutch uses Map-Reduce • Jan 2006 – Doug Cutting joins Yahoo! • Feb 2006 – Become Lucene Subproject • Apr 2007 – Yahoo! On 1000 node cluster • Jan 2008 – An Apache Top Level Project • Feb 2008 – Yahoo Production search index 5
  • 6. What is Hadoop Used for ? • Searching (Yahoo) • Log Processing • Recommendation Systems (Facebook, LinkedIn, eBay, Amazon) • Analytics(Facebook, LinkedIn) • Video and Image Analysis (NASA) • Data Retention 6
  • 7. Hadoop High Level Architecture 7
  • 8. Map-Reduce Framework for processing parallel problems across huge datasets using a large numbers of computers(nodes), collectively referred as Cluster : If all nodes are on same local network and uses similar network. Or Grid: If the nodes are shared across geographically and uses more heterogeneous hardware. Consists Two Step : 1.Map Step- The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. 2.Reduce Step -The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. Multiple Map-Reduce phases 8
  • 9. Map-Reduce Life-Cycle Credit : http://code.google.com/edu/parallel/mapreduce-tutorial.html 9
  • 11. Lets Understand GFS first … Google File System 11
  • 13. Goals of HDFS 1. Very Large Distributed File System - 10K nodes, 100 million files, 10 PB 2. Assumes Commodity Hardware - Files are replicated to handle hardware failure - Detect failures and recovers from them 3. Optimized for Batch Processing - Data locations exposed so that computation can move to where data resides. 13
  • 14. 14
  • 15. Installation/ Configuration [rjain@10.1.110.12 hadoop-1.0.3]$ vi conf/hdfs-site.xml [rjain@10.1.110.12 hadoop-1.0.3]$ pwd <configuration> /home/rjain/hadoop-1.0.3 <property> <name>dfs.replication</name> [rjain@10.1.110.12 hadoop-1.0.3]$ bin/start-all.sh <value>1</value> </property> [rjain@10.1.110.12 hadoop-1.0.3]$ bin/start-mapred.sh <property> <name>dfs.permissions</name> [rjain@10.1.110.12 hadoop-1.0.3]$ bin/start-dfs.sh <value>true</value> </property> [rjain@10.1.110.12 hadoop-1.0.3]$ bin/hadoop fs <property> Usage: java FsShell <name>dfs.data.dir</name> [-ls <path>] <value>/home/rjain/rahul/hdfs/data</value> [-lsr <path>] : Recursive version of ls. Similar to Unix ls -R. </property> [-du <path>] : Displays aggregate length of files contained in the directory or the length of a file. <property> [-dus <path>] : Displays a summary of file lengths. <name>dfs.name.dir</name> [-count[-q] <path>] <value>/home/rjain/rahul/hdfs/name</value> [-mv <src> <dst>] </property> [-cp <src> <dst>] </configuration> [-rm [-skipTrash] <path>] [-rmr [-skipTrash] <path>] : Recursive version of delete(rm). [rjain@10.1.110.12 hadoop-1.0.3]$ vi conf/mapred-site.xml [-expunge] : Empty the Trash <configuration> [-put <localsrc> ... <dst>] : Copy single src, or multiple srcs from local file system to the <property> destination filesystem <name>mapred.job.tracker</name> [-copyFromLocal <localsrc> ... <dst>] <value>localhost:9001</value> [-moveFromLocal <localsrc> ... <dst>] </property> [-get [-ignoreCrc] [-crc] <src> <localdst>] </configuration> [-getmerge <src> <localdst> [addnl]] [-cat <src>] [rjain@10.1.110.12 hadoop-1.0.3]$ vi conf/core-site.xml [-text <src>] : Takes a source file and outputs the file in text format. The allowed formats are zip <configuration> and TextRecordInputStream. <property> [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>] <name>fs.default.name</name> [-moveToLocal [-crc] <src> <localdst>] <value>hdfs://localhost:9000</value> [-mkdir <path>] </property> [-setrep [-R] [-w] <rep> <path/file>] : Changes the replication factor of a file </configuration> [-touchz <path>] : Create a file of zero length. [-test -[ezd] <path>] : -e check to see if the file exists. Return 0 if true. -z check to see if the file is [rjain@10.1.110.12 hadoop-1.0.3]$ jps zero length. Return 0 if true. -d check to see if the path is directory. Return 0 if true. 29756 SecondaryNameNode [-stat [format] <path>] : Returns the stat information on the path like created time of dir 19847 TaskTracker [-tail [-f] <file>] : Displays last kilobyte of the file to stdout 18756 Jps [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] 29483 NameNode [-chown [-R] [OWNER][:[GROUP]] PATH...] 29619 DataNode [-chgrp [-R] GROUP PATH...] 15 19711 JobTracker [-help [cmd]]
  • 16. HDFS- Read/Write Example Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Given an input/output file name as string, we construct inFile/outFile Path objects. Most of the FileSystem APIs accepts Path objects. Path inFile = new Path(argv[0]); Path outFile = new Path(argv[1]); Validate the input/output paths before reading/writing. if (!fs.exists(inFile)) printAndExit("Input file not found"); if (!fs.isFile(inFile)) printAndExit("Input should be a file"); if (fs.exists(outFile)) printAndExit("Output already exists"); Open inFile for reading. FSDataInputStream in = fs.open(inFile); Open outFile for writing. FSDataOutputStream out = fs.create(outFile); Read from input stream and write to output stream until EOF. while ((bytesRead = in.read(buffer)) > 0) { out.write(buffer, 0, bytesRead); } Close the streams when done. in.close(); out.close(); 16
  • 17. Hadoop Sub-Projects • Hadoop Common: The common utilities that support the other Hadoop subprojects. • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. • Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters. Other Hadoop-related projects at Apache include: • Avro™: A data serialization system. • Cassandra™: A scalable multi-master database with no single points of failure. • Chukwa™: A data collection system for managing large distributed systems. • HBase™: A scalable, distributed database that supports structured data storage for large tables. • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. • Mahout™: A Scalable machine learning and data mining library. • Pig™: A high-level data-flow language and execution framework for parallel computation. • ZooKeeper™: A high-performance coordination service for distributed applications. 17